Online Catalog Design Models: Are We Moving in the Right Direction?

Charles R. Hildreth, Ph.D.

4. ADVANCED IR RESEARCH AND THE PROBABILISTIC RETRIEVAL MODEL

For most of its 30-year history, automated information retrieval experiment and analysis has focused on query-oriented, query-based matching and retrieval strategies and techniques. As Rorvig (1988) commented in his review of research on the mental aspects of information retrieval, "Certain assumptions about information retrieval (IR) in general have existed for so long that they are rarely seriously re-examined. Foremost among these is the notion that IR is primarily about the retrieval of documents and that these documents are textual in nature and will be retrieved by textual queries." (emphasis mine) As Cox attests (1994): "Little seems to have changed since he [Rorvig] wrote that review."

During this period, researchers in advanced information retrieval have contributed a large body of experimental findings, theory, and published literature (See, for example, Robertson, 1977; Van Risjbergen, 1979; Maron, 1983; Belkin and Vickery, 1985; Salton and McGill, 1983; Cleverdon, 1984; Bookstein, 1983 and 1985; Gerrie, 1983; Belkin and Croft, 1987; Willett, 1988). There have been many advances in the field, and the work of the IR researchers has produced a great many successful experiments and enlightening results. The primary accomplishment of this group of researchers and scholars has been the transformation of traditional indexing and retrieval analysis, opinion, and design activities into an empirical science resting on sound theoretical bases. With the benefit of hindsight, much of this research can be seen as a response to both the popularity and deficiencies of Boolean retrieval. Major outcomes of this scientific work include:

  1. the development of a deeper understanding of the inherent complexities in the information retrieval process and surrounding situation,
  2. new theoretical models of the IR environment, models which have more explanatory and predictive power,
  3. widely applicable evaluation methods and performance measures, and
  4. tested, more effective non-conventional retrieval techniques and more usable user-system interfaces.

Where conventional information retrieval (IR) systems employ an exact match retrieval strategy, non-conventional query-oriented systems employ a "closest" or "best match" strategy where degree of closeness or similarity of a candidate item's content description to the textual query is taken into account. Various measures of "closeness" or "similarity" between query and document representation or document have been used, and non-conventional query-oriented systems have employed a variety of "post-Boolean" retrieval strategies such as vector space processing, fuzzy match, and probabilistic techniques. The latter typically employ weighted-term searching, ranked output, and automatic or interactive relevance feedback. The relevance feedback is used to "expand" and enhance an existing query, with the aim of producing a better results set.

4.1. The Indeterminate Information Retrieval Situation

There is wide agreement among information scientists that the Boolean retrieval model is theoretically flawed because it does not reflect or account for all the inherent subtleties and complexities which comprise the real world information retrieval situation. Researchers have proposed alternative models of the IR function and situation which they believe more accurately identify critical aspects of the problem area under study. If theory is to lead to improvements in practice (i.e., IR system design and use), theoretical models must take into account both the simple and the complex characteristics of the activity being modeled.

Much of the complexity of the IR situation can be attributed to the large degree of indeterminacy, uncertainty, and variability inherent in various levels of the whole domain (Bates, 1986). Researchers have shown that the IR situation is loaded with variability at all sides and, as a result, "Uncertainty seems to be a characteristic intrinsic to the information retrieval (IR) process." (Bookstein, 1983) From document description and subject analysis of texts to IR system design, efforts to improve matters must confront the inherently probabilistic nature of the entire retrieval environment.

Maron (1983) described the information retrieval situation in the following passage:

This probability ranking principle was also enunciated by Robertson (1977), who elucidated its solid theoretical grounding. In brief, the principle states that optimal performance is achieved by that retrieval system which ranks retrieved documents in decreasing order of their probability of relevance to the query which has been submitted.

According to IR researchers, the challenge facing system designers is to exploit the science and technology of automated information retrieval to achieve the "best" retrieval for a given user query in an inherently imprecise and uncertain situation. Compounding the variabilities and complexities of subject cataloging/indexing, file structure, and matching and retrieval algorithms, the user may not know or be able to adequately express his need, or may simply change his mind during the retrieval process about what he wants or is interested in.

4.2. Information Retrieval as an Inference Process

The aim of a bibliographic retrieval system is to retrieve documents likely to be relevant to a particular user's query, or, more precisely, documents relevant in the eyes of the user. Thus, relevance is a function of user assessment and cannot be established by the simple, mechanistic query-document matching procedures employed in conventional retrieval systems. We do not know enough about how and on what basis users make relevancy or utility judgments about retrieved bibliographic citations. Such reasoning activity may consist of a careful process of inference after examination of all the pertinent data in a citation. On the other hand it may consist of simple "flash" recognition, a drawing on a quick analogy to other known items, or just playing a hunch. The first case seems to be ruled out in present-day online catalog subject searching: every study of transaction logs indicates online catalog searchers seldom if ever look at a display of the full citation, which is the only way to find any explicit relevancy assessment data other than that contained in the title. The less subject data there is in the citation, the less likely a systematic process of inference will be undertaken to decide the matter. The user's knowledge of the subject field and any prior knowledge of the contents of the database would no doubt be significant variables in this assessment and selection activity.

The IR situation requires that we view information retrieval as an iterative, truly interactive, inductive process, a process which engages the user throughout the process to gain relevance feedback that can be used by the system to correct its assumptions or to modify its automatically applied, heuristics-based matching and document ranking procedures. In other words, information retrieval, especially document retrieval, should be viewed as an interactive, cooperative process of mutually supportive inference. Conventional retrieval system matching mechanisms which exploit the inverted file structures of their databases may be internally efficient, but they too often produce large, unordered results sets that turn users off and away. Few end-users display the desire or ability to use existing system query syntax to modify their queries to achieve better, more manageable results.

An information retrieval system is effective to the degree that it supports and facilitates these document-relevancy assessment, selection or rejection activities. Since this human reasoning/recognition activity is not singular, one-dimensional, or usually predictable in a mechanistic way, it is unlikely a matching mechanism that does not interactively seek clues and rank its output will get the job done.

A number of methods have been tested for supporting this inference process, including automatic indexing techniques and retrieval techniques that employ statistical criteria and procedures. Statistical properties of text or terms in a database of bibliographic records are used to assign special values or weights to words, phrases, groups of related words, or clusters of citations. These techniques are in turn used in probabilistic or extended Boolean retrieval methods.

Willett (1988) describes the concept of "best match" or "nearest neighbor" searching:

A probabilistic retrieval system, simply understood, retrieves all documents that match a query in any degree, even if the match occurs on only one word or word root (stem), infers (computes) the probability of relevance of these documents to that specific query and ranks them accordingly. The ranking algorithm orders the set of retrieved documents according to their decreasing similarity to the query. The probability of relevance may be calculated from the frequencies of occurrence of query/index terms in the entire database and/or the retrieved documents, or on the basis of a variety of other query-document similarity measures. As an example of term weighting, a query term that occurs with very low frequency in the entire database but has a high occurrence count in particular documents would be considered to have special (high) value, and the documents it indexes would be considered to have a high probability of relevance.

Relevance feedback from the searcher is considered essential to the effective performance of probabilistic retrieval systems. The searcher may explicitly change the values (system calculated weights) assigned to search terms or may respond to the first-listed, top-ranked documents. Relevance feedback may lead to a refinement or expansion of the user's query and "fuel" the system for even better performance (Robertson and Sparck Jones, 1976; Oddy, 1977; Harper, 1980; Hendry, et al, 1986).

Probabilistic retrieval with relevance feedback is especially useful and effective in searching bibliographic databases because the user, on his own, cannot possibly know or specify all the possible linkages, associations, and relevancy ties among documents in a large multidisciplinary database. Probabilistic retrieval techniques, automatic search heuristics, and relevance feedback can exploit pre-coordinated conceptual structures and statistical associations to improve retrieval in such a universe.

Croft and Thompson (1987) summarize the advantages of probabilistic, statistical retrieval techniques:

Probabilistic, combinatoric, retrieval methods, and rule-based search strategy selection (if one retrieval strategy fails, automatically attempt another) can supplement the human tasks of relevancy assessment, inference, and selection better than Boolean methods, but neither can replace the human factor entirely. Human judgment is not only richer, it is the human who wants the documents or the information they contain. An intelligent retrieval system may never have the proper motivation to do a perfect job, that is, retrieve all relevant documents (assuming a comprehensive search is desired) and no non-relevant documents and rank order the retrieved documents according to degree of relevance. Croft and Thompson (1987) remind us that the other source of imperfection in any machine retrieval environment is the system's inability to achieve in its interpretation of a query anything more than a close approximation to the actual information need. He concludes that the two best ways to improve retrieval performance are "to enhance the inference process used by the system and to acquire better descriptions of the information need."

4.3. Summary of Contributions from Information Retrieval Research and Experimentation

IR research has resulted in a significant gain in our knowledge of the information retrieval process and environment, more effective and feasible retrieval methods, and useful performance evaluation measures and methods. One of the strengths of the IR research tradition is the community's emphasis on testing and evaluation.

Doszkocs (1986) cites the following advanced information retrieval functions and features as being among the paramount achievements of the IR research and experimentation community: "the notion of accepting unrestricted natural-language user queries, flexible matching functions, ranking of retrieval output according to potential relevance to the query, and dynamic utilization of user feedback in automatic search strategy modification."

The design principles and retrieval methods contributed by probabilistic retrieval theorists and researchers have successfully addressed many of the major limitations and problems associated with conventional information retrieval systems. When employed, these methods -- weighted-term, best match searching, relevance feedback, and ranking of retrieved documents -- generally lead to significantly better search performance than that obtainable with exact match, Boolean retrieval techniques. By not requiring the use of Boolean retrieval techniques, probabilistic retrieval resolves the problems users have with understanding and correctly using Boolean logic operators. Document ranking largely solves the problem of coping with large retrieval sets, the problem Maron calls "output overload." This problem occurs when the patron is "swamped by records which match his query. ... Simply narrowing his query by use of the logical AND causes serious deterioration of recall" (1983). Online catalog research shows that most users say they do not wish to look at more than 15 records in a large unranked results set. In their study of user persistence in scanning displays of search results, Wiberley, et al, analyzed data from transaction logs and learned that many searchers will persist in looking beyond 15 records, but that "Users' persistence falls off significantly when the number of postings retrieved exceeds 30" (1990). Furthermore, when searches retrieve more than 30 records, a majority of users display and look at no records at all. Document ranking addresses this problem by placing the documents most likely to be relevant at the top of the output list.

Other subject searches fail when nothing is retrieved. In large, heterogeneous online catalog databases, expanded or not, it cannot be assumed that there is nothing in the database that might be relevant to the user's information need or interest. It is theoretically possible, of course, that this case could occur; that is, there simply is not any information or material represented in the database potentially relevant to the user's request. However, this should not be an operational assumption, primarily because the user may want the opportunity, interactively, to refine or change his subject query. The relaxation of "exact" matching in probabilistic retrieval techniques greatly reduces the number of subject searches that result in no items retrieved. Searchers are almost always provided some retrieved items to assess and to which they may respond. This makes relevance feedback, the process of obtaining relevance information from the user and using it to expand or enhance a search, possible in almost all searches. IR researchers have come to accept the view that relevance feedback from real searchers is a major factor in improving the performance of retrieval systems. Automatic or interactive relevance feedback techniques can provide the system useful information not contained in, nor derivable from, the original query, or not available at all from a user who begins to search and browse without a specific, clear expression of his information need. The multi-year series of Okapi online catalog experiments and evaluations in London have shown this to be an effective retrieval strategy (Walker, 1989; Walker and DeVere, 1990; Walker et al, 1991)