Subject Searching: Why a Taxonomy, Thesaurus, or Controlled Vocabulary Still Helps in the Age of Search

Subjects, topics, index terms, keywords, controlled vocabulary, thesaurus, taxonomy. These all refer to an organized, precise way to find and retrieve desired information, where that information has been indexed to terms. Indexing content with subject terms can be manual or automated, but in either case the focus is on what the content is about, not what words appear in the text. The subject terms represent unambiguous concepts, which may have synonyms, but synonyms are often included as cross-references to redirect to the preferred term name and thus to the same set of content. Before the era of digital content, subject categories or index terms were the only method to find specific information, such as in a back-of-the-book index or business categories in the yellow pages.

Using subject terms to find desired content contrasts with using a search engine for full text search. Search is based on the occurrence of words, not concepts, so appropriate results can be missed if they use different wording for the same concept, and inappropriate results can be retrieved if a word has multiple meanings. The accuracy of search, without the additional support index terms/subjects, is dependent instead on the sophistication of algorithms. The combinations of algorithms have improved only slightly in the past decade or two. What has made a bigger difference in retrieving good results through search (without subject indexing), is that in many cases the volume of content has grown, and when search results are arranged by relevancy, a larger number of initially displayed search results are satisfactory.

There are two issues with this kind of search. Ordering results by relevancy is not always the preferred option. Sometimes searchers are interested in timely stories, so they want their results to be ordered by date, newest first, but when relying on a search engine, newer results might not all be insufficiently relevant.  Secondly, such results are good for the searcher who only wants to get some or enough information on a topic. If instead, the searcher wants to perform an exhaustive search and retrieve everything available on a topic, there will likely be relevant content that is missed in the search retrieval because it was worded differently. Indexing with subject terms improves both precision (accuracy, where incorrect content is not retrieved) and recall (comprehensiveness, where appropriate content is not missed).

The role that index terms play in the search process has evolved. Originally, researchers started with browsing a full list of subjects that may have been arranged alphabetically (as a traditional book-style index) or hierarchically (as a taxonomy), and they navigated the index to find more specific subdivisions as aspects of the main heading, or they navigated the taxonomy to drill down to the most specific term. As the volume of indexed documents or other content items has grown over the years, browsing and selecting a term from a taxonomy or thesaurus is often no longer as practical or sufficient. An individual term may have too many records indexed to it. Furthermore, many taxonomies and thesauri have grown too large to easily browse.

So, instead of taxonomy terms being used as the primary starting point to find desired content, taxonomy terms are more often being used to narrow or filter search results.  The user executes a search in the search engine, and if they get too many results, they can limit or filter the results by various aspects listed in the margin, including by indexed subject. (Other aspects could be date, document type, author, source etc.) The subjects can display in order of frequency of occurrence on the records in the search result set, and the user can select among them, rather than having to browse the entire taxonomy or thesaurus.

Use of subjects and other attributes to limit search results is becoming very common across various implementations, so most people are familiar with using them, such as enterprise search systems to find internal corporate documents, ecommerce websites for selecting products, library databases for selecting research articles.

The use of subjects to limit search results is similar to a faceted taxonomy, although the designation “faceted taxonomy” typically refers to a taxonomy where different types of terms are grouped into multiple facets. In other words, a faceted taxonomy involves several facets or filters, whereas a traditional taxonomy or thesaurus may comprise a single facet or filter, which may be used in combination with other, non-taxonomy filters.

I will be exploring and demonstrating this topic, specifically in the case of library subscription databases, in a presentation “Customer Focused Thesauri,” in addition to a pre-conference workshop on taxonomy creation, at the Computers in Libraries conference in Arlington, VA, in April.