In planning a taxonomy, I have often said that it is important to define the taxonomy’s scope, specifically the subject area scope of the taxonomy’s terms, but without going into more detail. Recently I was asked by a client how to define a taxonomy’s scope. This is a good question. The taxonomy should be suited to the subject area scope of the content that will be tagged with the taxonomy and to the scope of the user’s expectations. Terms or topics only marginal to the subject scope, however, could occur in the content, and whether they should also be included in the taxonomy is a question. Ultimately, that should depend on whether user expectations justify it, as the needs of users should also be a factor in creating a taxonomy. A taxonomy should suit both its content and its users.
Sources for Taxonomy Terms
content as a source of taxonomy terms, a combination of manual and
automated approaches is recommended. By manually reviewing sample
individual documents or content items, you can discern the main ideas
and main topics, which should form the start and basic structure of the
taxonomy and also help define its scope. Automated methods of extracting
terms, through text analytics technologies, can bring in many
additional terms from a much larger corpus of documents more quickly,
picking up terms that a limited manual review would miss. Even though
automated text analytics extracts terms based on relevancy and frequency
of occurrence, such terms could be out of scope of the subject domain.
That’s why it’s important to start first with a manual review of content
to define the subject scope. Then, when you enrich the taxonomy with
automated extraction, you can approve terms that appear to be in scope
or at least closely relevant and reject others. But should you reject
all that are out of scope, even if they appear with sufficient frequency
and relevancy? My advice is to try to assume the role of the user. Ask
yourself: Might a user want to search for content on this term in this
For user needs and expectations as a contributing source of taxonomy terms, obtaining this information can be very direct, such as by creating a user questionnaire (at least for your internal users) that asks what the topics of importance are, how they would define the scope, and what “marginal” topics would be acceptable to include. You could also request sample challenging (not expected, basic, typical) queries that the users would make. Another good way to obtain input from the user side is to look at search query logs that list search strings that users have entered over a period of time, ranked by frequency. If a search phrase that is slightly out of scope of the subject occurs frequently, then the term should still be considered for inclusion in the taxonomy.
In either case, the scope of the subject gets better defined as the taxonomy is created. For example, a taxonomy for recipes may initially be scoped to comprise terms for the names of dishes, ingredients, and cooking method. But then a different term shows up significant frequency, “Nutrition Facts.” If it occurs in both the content and the user research, then it likely should be included. If it shows up in the content only, but is not validated in user research, then it is more questionable.
initial taxonomy structure itself tends to impose limits on scope.
Taxonomies tend to be hierarchical with a limited number of top terms.
If a candidate term appears in the content that does not seem to belong
anywhere in the current taxonomic hierarchy, you might be inclined to
exclude it. Factors of user needs (they might want to look up this term
in this content), however, should take precedence. For example, the term
“COVID-19” might be marginal but still of interest to be included many
taxonomies on varied subjects, but there would exist no broader term for
diseases in those taxonomies. Then adjustments need to be made, such as
renaming or adding broader terms, or perhaps, more likely, the proposed
term should be modified to fit the context of the taxonomy, such as
becoming q“COVID-19 impacts.”
Another thing to consider is adopting more a thesaurus structure than a taxonomy structure, at least for the facet or concept scheme of the taxonomy that is for miscellaneous “topics.” One characteristic of thesauri is to not rely so heavily on extensive hierarchical trees. What this means is that you could decide that it is acceptable that not all terms have broader terms and thus it’s OK to have a very large number of top terms, with the more specific terms linked to other terms only by related-term relationships, another feature of thesauri, if not by broader/narrower-term relationships. Abandoning the full hierarchical tree structure should only be considered if this hierarchy is not displayed as a navigation to the end users.
any case you need to define policies regarding what kinds of terms can
be added and what kinds should not. This will evolve out of the activity
of building the taxonomy, especially from evaluating what extracted
terms to approve and what search log terms to approve. Whoever is doing
this task (hopefully more than one person), should document each
instance of uncertainty. While many term approvals and rejections will
be obvious, there will be a gray area. This should be collected and
discussed together, and then a policy can emerge.