Online Catalog Design Models: Are We Moving in the Right Direction?

Charles R. Hildreth, Ph.D.

3. CONVENTIONAL INFORMATION RETRIEVAL SYSTEMS:
THE BOOLEAN MODEL

In our classification of retrieval systems, at the highest level we distinguish query-oriented systems from non-query-oriented, exploratory systems (see Figure 5 below). Query-oriented systems may be characterized as conventional or non-conventional. Conventional retrieval systems are exact-match systems. Online catalogs modeled after the card catalog or Boolean information retrieval systems, or both, are exact-match systems. Where conventional information retrieval systems employ an exact match retrieval strategy, non-conventional query-oriented systems employ a "closest" or "best match" strategy where degree of closeness or similarity of a candidate item's content description to the textual query is taken into account. Non-conventional retrieval systems will be discussed in the following section.

INFORMATION RETRIEVAL SYSTEMS

Figure 5. Classification of Information Retrieval Systems

Salton and McGill (1983) have reminded us that conventional retrieval systems are based on a common set of principles and methods. These systems typically carry out Boolean operations on search term match sets found in the inverted search index files. But, more importantly,

3.1. The Exact Match Search Paradigm.

Most of today's operational information retrieval systems and second-generation online catalogs use exact match retrieval techniques, featuring keyword, Boolean, proximity, and string searching. Search field specification, truncation and/or wildcard searching is usually supported as well. These exact match techniques require that the specifications of the query (e.g., the search terms and their specified logical or textual relationships) be satisfied precisely by any and all document representations that would make up the retrieval set.

Although the object of widescale criticism by researchers and many librarians, exact match searching remains the paradigm for operational online information retrieval, CD-ROM, and online catalog systems. There is much discussion and debate in the research literature regarding the reasons for this situation and why it continues. Two explanatory factors should be mentioned: 1) some techniques have been employed by system designers that relax the constraints of exact match searching, for example, stemming of query or index terms, and the provision of "wild card" searching; and 2) "The traditional Boolean, analytical search strategy is widely used by professional searchers of online bibliographic databases because of its potential for expressing an information need accurately" (Marchionini, 1988). In other words, it can be plausibly argued that Boolean propositions provide the flexibility and finesse to represent fine aspects of a user's information with great precision. Researchers and designers have given database searchers post-coordinate searching tools that are both powerful and flexible for constructing expressions of users' information needs.

Designers of second-generation online catalogs implemented this model in the 1980s largely because it was the model incorporated by the major commercial online search services, and because it was preferred by the librarians who had become the trained, experienced users of those services. Willett points to the inertia factor: "Boolean systems have been with us for many years now and there is a natural disinclination on the part of both users and system providers to develop new techniques" (1988).

3.2. Boolean Retrieval Performance and Use Problems.

In the early 1980s many librarians and online catalogue researchers urged vendors and system designers to upgrade their online catalogues to keyword, Boolean retrieval systems. Many thought that the provision of Boolean search formulation and retrieval methods in online catalogues would provide more search flexibility and subject access points than first-generation online catalogues, and, by offering term/query post-coordinated searching, would give subject searchers an effective alternative to exact matches on unknown LC subject headings. Some librarians welcomed keyword/Boolean online catalogues as the panacea for the problems of subject searching in early online catalogues.

This enthusiastic anticipation of the arrival of Boolean online catalogues in libraries is easy to explain. First-generation online catalogues were not very good retrieval systems. Many of them are still in place alongside the new interactive CD-ROM reference retrieval systems and the online search terminals being used to access the commercial database search services. Boolean retrieval is the predominant mode of access in the world of commercial online reference searching, and has acquired over the years an immense prestige in the eyes of librarians. The commercial database search services have steadily grown in the number and size of their databases, and the number of libraries and librarians using one or more of them has increased greatly in the 1980s.

Conventional query-oriented information retrieval systems, especially keyword, Boolean systems, have come under heavy criticism by information scientists for more than twenty years. Objections have been issued on both theoretical and empirical grounds. Much of the early criticism of conventional systems was theoretical in nature, and focused on the relative poor performance and ineffectiveness of these systems in various document retrieval tasks. Controlled experiments were often conducted in artificial test environments (e.g., no human searchers were used) to confirm the hypotheses regarding the poor performance of conventional systems and/or the superior performance of alternative models and techniques. The flood of evidence produced by more recent studies of online catalog use and end-user searching demonstrating major performance and use problems has provided additional ammunition for the critics of conventional retrieval systems and online catalogs.

There have been a number of complaints and objections to conventional, Boolean retrieval systems voiced by researchers who specialize in information retrieval theory and design. Upon review, these criticisms fall naturally into two broad categories: 1) performance and use problems or deficiencies, and 2) questions about the adequacy of the model itself to represent the "retrieval situation" and the varieties of subject searching behavior.

Ironically, at the very dawning of Boolean retrieval in OPAC design, IR researchers were issuing warnings about the inherent deficiencies of the Boolean methodology. Evidence as to the difficulties end-users were encountering with the use of Boolean retrieval operations continued to mount through the decade of the 1980s. As Boolean OPACs began to crop up across the library landscape, Doszkocs (1983) commented:

Reflecting on the popularity of Boolean retrieval, Porter and Galpin remark, "This is unfortunate, since it has a number of inherent weaknesses" (1988). What is behind this prevalent anti-Boolean opinion which seems to pervade the IR research community? In a nutshell it is this: much research and experience with Boolean retrieval systems (including online catalogues) indicates clearly and repeatedly that Boolean search formulation syntax and retrieval techniques are not very effective in search performance and not very usable or efficient search methods for end users. The accumulating evidence clearly supports this summary critique of Boolean retrieval by Porter and Galpin (1988):

Criticisms of Boolean retrieval are commonplace in the IR research literature, Early on, Salton, et al (1983), described the difficulties searchers' encounter in achieving both high recall and high precision in search results when using Boolean methods:

In their review of the literature on end-user searching of online bibliographic databases, Mischo and Lee (1987) reported: "There is also much anecdotal evidence and observation showing that end users have particular difficulties with search strategy formulation and the use of Boolean logic." Mischo and Lee discovered that numerous authors tell of the significant difficulties end users are experiencing with the proper use of Boolean logic: "Several evaluation studies indicate that the use of Boolean operators is viewed as the most difficult aspect of retrieval" (1987). Apparently, the solution is not more and better training programs. (How do you force training on public library patrons and dial-up online catalog users?) In one study reviewed by Mischo and Lee, the experimental end-user group was required to attend several training sessions and have their search strategy approved prior to searching online (BRS). The investigators found that end users had major problems with the choice of terminology, the use of Boolean operators, and search strategy modification. Even after this training and supervision, each subject who returned to attempt new searches needed to meet with expert search specialists beforehand to review system commands and Boolean logic. These findings have been corroborated by a number of similar studies.

In explaining their motives in designing an alternative, non-Boolean, natural language retrieval system for their users ("STATUS with IQ"), Pape and Jones (1988) refer to the "basic problem" with Boolean logic systems: "namely that high precision and high recall seem incompatible in this environment." And they go on: "There is also the important issue of allowing queries to be entered in natural language and saving users from the horrors of a typical boolean based query syntax."

Noreault, et al (1977), the founders of SIRE, an experimental prototype bibliographic retrieval system developed at the University of Syracuse, focus their criticism of traditional Boolean retrieval on a major problem associated with the output (results) of Boolean systems. The problem is especially felt by users when their searches produce large results sets:

A near-consensus exists among information retrieval theorists and investigators regarding the shortcomings of IR systems that rely solely on Boolean-logic query formulation and matching. Salton (1984) gives us the best overall summary of criticisms of Boolean retrieval systems voiced or shared by most researchers and experimenters in the information science community:

More recently, Crawford (1992) has stated the case against Boolean very succinctly: Boolean "is difficult to learn and easy to misuse." It is difficult to learn the meaning and the correct application of Boolean operations to concepts, terms in indexes, bibliographic records or sets of these records, and it is difficult to remember these semantic and syntactic complexities of Boolean query construction and retrieval strategies when this knowledge is not put to use on a regular basis. Also, for the less knowledgeable of searchers, it is easy to misinterpret the results of Boolean retrieval operations. They may have no idea how or why the system retrieved the items produced in the results set, or, worse, they may believe that only and all of the most important items have been retrieved. In the all too frequent case of 0-hit retrievals, the searcher may interpret this to mean that the database holds no relevant item(s). This may result in a greater degree of puzzlement in the searcher's mind than they brought to the system in the first place. In the case of very large results sets, the searcher may interpret this to mean that each item stands an equal chance of being relevant, necessitating an exhaustive review of many items if nothing important is to be missed or passed by. And, finally, even when Boolean techniques are used correctly and used well, the result is a binary split of the database being searched; all items in the database are ruled either to be included in the search results set or to be excluded. Various degrees of inclusion are not permitted, and retrieved items are not ranked according to the probability of their relevance to the query.

Keen (1994) has commented that the basis of effective Boolean retrieval in practice consists of "a plethora of technical devices and human heuristic tactics geared to adjust the elusive parameter of desired search breadth." He laments that "Professionals as well as end-users find it to be unsatisfactory, but it is all they have available." The arrival of hypertext and natural language search and retrieval methods on the World Wide Web offer some hope that OPAC users will have a brighter future.

Recent online catalogue design efforts have centered on making post-coordinate, exact match, Boolean logic, library retrieval systems easier to learn and easier to use than the commercial models used by trained intermediaries. However, too little attention has been given to the performance limitations of Boolean online catalogues, and no commercially available online catalogue uses any of the advanced post-Boolean retrieval methods which have been tested with some success in the retrieval labs by the probabilistic and fuzzy-set retrieval theorists.

3.3. Explicit and Implicit Online Catalogs

After end-users' difficulties with Boolean query systems began to be widely reported and discussed, some online catalog designers implemented various techniques aimed at reducing the difficulties associated with formulating and entering complex queries. For example, menus were provided for command selection, and users of these system interfaces had only to enter search terms and optionally specify a type of search or field to be the target of the search. The online catalog software then "constructed" the query and supplied the Boolean or proximity operator to coordinate the terms entered by the user. The default or "implicit" operator used to specify the relationship between the search terms could, in many cases, be changed by system managers if they felt it was necessary to change the logic of the relationship between search terms. For example, changing the system-supplied implicit operator "between" search terms from adjacency to the Boolean "AND" would likely broaden the search and usually yield a larger results set or reduce the number of no match, "no hit" search failures. This change was found to be necessary when users began to complain of not being able to find titles of books they knew were in the collection, and consultation with transaction logs confirmed the problem. Searchers typically remember and enter two or three significant words in a title, rather than the complete title or precise order of words in the title.

With or without these "user friendly" techniques, most online catalogs in operation are still Boolean query or string-matching, exact match retrieval systems. One might refer to these two kinds of second-generation online catalogs as "explicit" and "implicit" exact match systems. In implicit online catalogs, the query formulation requirements placed upon the user are greatly reduced or removed altogether. In the former case, the searcher is required to enter a term or terms that represent his information need, and, perhaps, specify a type of search or search field by selecting it from a menu or by using a simplified command language (e.g., FIND TITLE medieval art). The system then supplies the combinatorial logic which specifies a relationship between the terms to be assumed and acted upon in the matching operation. Implicit truncation, for example, might also be applied to the terms such that a match could occur on both "medieval" and "medievalist", or "art" and "artists".

Such implicit online catalogs leave the user entirely in the dark about the term combinatorial logic, truncation (if any) and matching functions they automatically employ. As a consequence, most searchers will not have a clue as to why some searches fail to retrieve any documents, or why other searches retrieve large numbers of non-relevant documents. Thus, they have no information feedback to aid in the modification or reformulation of their search queries for a second or third try. Even if they guess that the online catalog they are using searches on "medieval art" as a unitary string of contiguous characters, these implicit online catalogs generally do not provide the means for a searcher to re-specify the request as "medieval AND art", for example.

Another category of implicit online catalogs includes those that remove the requirement to formulate and enter a query altogether. Using these online catalogs, the searcher may optionally select a type of search from a menu (e.g., author, title, subject, etc.) or proceed directly to a display of index terms or brief document titles usually presented in an alphabetically ordered list. (some online catalogs display title lists in class number order.) Markey calls this approach, "alphabetical searching" (Markey, 1989). This approach closely mimics the way searchers access and scan document records in the earlier manual, card catalogs. Searchers choose a location in the displayed alphabetical list (or drawer of cards) of "headings" terms as an entry point to the database, then scan nearby terms or the bibliographic records filed under them. In the online catalog, a selection of a single term from the list (terms can be keywords extracted from text or pre-coordinated phrases from a controlled subject vocabulary) will typically call up a display of all bibliographic records associated with the selected term. These usually abbreviated document "title" records may, in turn, be scanned for further selection, fuller display, and assessment.

This alphabetical list scanning and selection approach to searching, found in many online catalogs, is often named the "BROWSE" mode or searching option. These "browse" online catalogs make the least demands on the searcher with regard to the process of query formulation and entry. The searcher merely scans a list, selects a term from the list (rather than entering one of his own), then sees what document records are retrieved. The searcher may have a term or terms in mind, of course, then consult the system's lists to find it or one like it in some sense and thus suitable for searching. When a term has been selected, the system carries out the "built-in" matching and retrieval operations. Such browse online catalogs may still be classified in the category of exact match systems. The only search approach offered in a few online catalogs, in most second-generation online catalogs this approach is offered as a search option, along with a keyword, Boolean search option (explicit in some, implicit in others).

We have now identified three types of present-day operational online catalogs: 1) explicit, exact match systems (usually Boolean and string searching systems); 2) implicit Boolean exact match systems (in which the system software defines the term relationships); and 3) "browse" online catalogs that feature alphabetical searching of index term or citation lists. In all three types of operational online catalogs - and most are mixed, hybrid systems - effective subject searching requires the user to express his need for information in a form or terminology acceptable to the system. This means that users must not only specify their need in advance, but think about what sort of documents will satisfy their need, and also translate these concepts into the terms used in the indexing vocabulary of the system. These terms may then be used in a formal query, if the particular system requires one, or sought for in an alphabetical list displayed for this purpose. The system then takes the query or selected term and applies a matching function to determine which records are to be retrieved for display and evaluation by the user.

Larson explains that either process of query formulation or term selection from lists required in conventional information retrieval systems and online catalogs "involves predicting which terms in the indexing language of the system have been used to index the documents that the user would want to retrieve" (Larson, 1991). He points to evidence that indicates online catalog users do not conceive of subject searching in this way, and that when required to, they usually do not do a very good job of predicting or guessing the terms used to index the desired or potentially useful documents. Some of the guessing required may be reduced in systems that permit or require searchers to scan lists of index or thesaurus terms to identify search terms. However, in large online databases, the length of these lists, or the complex structure of lists such as thesauri, may place an unreasonable burden on the untrained, infrequent user.

3.4. Problems Experienced By Users

Online catalog research studies have uncovered a number of common problems experienced by users of second-generation online catalogs. In general terms the major problems include:

3.5. Subject Searching Difficulties

There is a growing body of research-based evidence which demonstrates that present-day, second-generation online catalogs are not very effective instruments in meeting the information access needs of library users. This research continues to reveal that there are pervasive problems with subject searching on online catalogs, and, as Larson (1991) notes, "the major problems pointed out by users of online catalog systems were symptomatic of a lack of effective subject access to the contents of the collection."

As Larson and others point out (e.g., Hancock-Beaulieu, 1989a; Drabenstott, 1995), many of the reported problems with subject access have a long history, and may be inherent in the complex processes of document content analysis and indexing, or associated with limitations in the media and functionality that have been provided for subject searching and retrieval in the past. There seems little doubt, however, that the extent and impact of the difficulties associated with subject access and the deficiencies in subject searching performance will increase as end-user searching in the expanded, extended online catalog increases.

Based on a review of early online catalog subject searching research studies and her own research, Markey (1984) listed the major problems the earliest subject searchers faced, with little or no assistance from online catalogs:

These findings have been corroborated in many subsequent studies of other online catalogs (Cochrane, 1986, Bates, 1986; Pritchard, 1986; Carlyle, 1989; Smith, 1989; Lester, 1989; Peters, 1991; Hunter, 1991; Johnson and Carey, 1992).

Reviewing online catalogs at the end of the 1980s, Hildreth (1989) provided this summary of subject -searching deficiencies of second-generation online catalogs:

3.6. Conclusion

Most researchers who have conducted studies of subject searching in online catalogs have recommended improvements to the subject searching capabilities of online catalogs. In the last ten years developers have responded with improvements to subject searching. These improvements include keyword, Boolean searching, better browsing displays of subject indexes, and labeled displays of bibliographic records. In their recent monograph on online subject searching, Drabenstott (formerly Markey) and Vizine-Goetz (1994) comment: "A review of studies conducted since the [1992] CLR study shows that despite a decade of improvements in online catalogs, the difficulties today's users have performing subject searches in online catalogs are similar to those experienced by the earliest online catalog users." This opinion receives further support from the findings of the authors' massive transaction log study of subject searching in the online catalogs of three university research libraries. Operating on the premise that improvements in term matching and retrieval algorithms are the best way to improve the subject searching capabilities of existing online catalogs, Drabenstott and Vizine-Goetz's continuing research on "search trees" (1992, 1995) is query-oriented in that it addresses the questions of which retrieval process is best for a given type of user query, and what alternative retrieval process should be applied (perhaps automatically) to a query when a given retrieval process fails.

To summarize, studies of end-user searching of online catalogs show that: 1) online catalog users experience the most difficulties with subject searching and 2) users experience difficulties in conducting effective searches in the "friendlier" but conventional Boolean online catalogs. Not surprisingly, when surveyed, online catalog users rank improvements to subject searching a high priority among various system enhancements. Nonetheless, many research studies report high levels of user enthusiasm and satisfaction with the use and performance of these retrieval systems, even though there is considerable evidence that search success rates are far from optimal (Ankeny, 1991; Kenny and Schroeder, 1992). Studies of search outcomes reveal seemingly high rates of search failure: for example, nothing is retrieved by many searches, or the retrieval sets are very large and often remain unscanned by the user. Indications are that between one-third and one-half of all searches result in no items retrieved; and user-entered subject search terms seldom match the indexing subject vocabulary of the catalog. A good, recent review of these troublesome research findings can be found in Thorne and Whitlatch's "Patron Online Catalog Success." (1994)

The areas that present the most difficulties to searchers of today's online catalogs are the following: understanding and using Boolean operators; implementing good, appropriate search strategies for the task at hand; expressing accurate and complete queries in a form acceptable by the system; and matching search terms to the system's indexing language. That these conceptual and "entry" requirements for good searching represent formidable challenges even for highly-trained database search specialists indicates that fundamentally new and different approaches may need to be applied in the design of end-user search interfaces and retrieval methods. In their research on users of conventional online database search services, both Vigil and Bellardo point out that formulating accurate, clear search queries is a complex cognitive activity that requires a very high cognitive load; and further, according to Vigil, often results in "a cognitive strain that limits the mental resource and energy which can be devoted to the primary task of judging relevancy" (Vigil, 1983, and Bellardo, 1985). Borgman reached similar conclusions in her study of online catalog users (1986).

Boolean retrieval may be a powerful, manageable, and effective search method in the hands of knowledgeable, trained, experienced search intermediaries whose most common search objective is to produce that one best set of results as a product that can be delivered to the user. Boolean methods give the intermediaries a degree of control over the search and retrieval process they apparently find desirable, although research indicates they generally under utilize these capabilities and conduct brief and simple searches. Hsieh-Yee has published an informative review of studies of Boolean searchers (1993). When Boolean is proposed as an end-user search interface, it is hard to marshal arguments for an approach that requires for its effective use a large amount of specialized knowledge (set-theoretic knowledge), fairly sophisticated cognitive and linquistic abilities, and on-going skills training. The easier-to-use implicit Boolean OPACs may be doing their users a great disservice by concealing these requirements for effective searching within the Boolean model, increassing the chances that search results will be misinterpreted. Much research has been directed toward the design of user-oriented interfaces for conventional Boolean IR systems and OPACs with the commendable aim of making them easier to use by the growing population of end users. Unfortunately, as Hsieh-Yee reminds us (1993), "most of them were developed on the basis of interface designers' idea of how people search, instead of empirical evidence on end users' search behavior."

In her paper, "User Friendliness and Human-Computer Interaction in Online Library Catalogues," Hancock-Beaulieu (1992) identifies today's menu-driven interfaces, boolean searching, and poor document analysis and representation as the major barriers to effective searching and interaction. We agree with her argument that more usable, more effective systems will not result from improvements to the interface alone: "Clearly a more friendly interface which enables the user to search more intuitively cannot be developed independently, without taking account of the functionality of the search software and the nature of the raw database."

In the next section, the recent theoretical and experimental contributions of information retrieval researchers are reviewed, and it is argued that a new model of information seeking and retrieval is needed, a model that more closely describes much of the subject searching and browsing activity actually conducted by library users. It is suggested that a new non-query-oriented search and browse paradigm may be appropriate for guiding the design of subject access mechanisms in future online catalogs.