Overview
Many languages have processes by which words can change systematically through inflection. Examples of inflectional changes include number (singular/plural), gender, case, verb tense and mood. For example, the terms book and books are inflectionally related in English, as are the terms coupé and coupée (in French) and the terms sprechen and sprach (in German).
It is usually desirable for a word in a query to match inflected forms of that word in documents. AIE uses dictionaries of inflections to improve search results. Within each family of related terms (run,ran,running), one word is chosen as a code to represent the pattern (run), and this word is referred to as the lemma . The dictionary is referred to loosely as the lemmatization dictionary , and the process of matching inflectionally-related terms is referred to as lemmatization. Collapsing a family of inflected words into a single lemma is known as lemmatization by reduction. This is AIE's preferred method of lemmatization across all languages.
English has relatively few (2-4) inflectionally-related words for a given word. Languages such as French or German may have dozens of inflectional variants, and some languages, such as Turkish and Finnish, have hundreds of inflectional variants for a given word. Also, languages such as Hebrew and Arabic (which are written without vowels) require language-specific rules in order to determine the lemma corresponding to an ambiguous word. The languages supported by AIE, either natively (English), through the Advanced Linguistics Module, or through individual language modules, implicitly include language-specific rules and lemmatization dictionaries.
Note that lemmatization is no longer configurable in AIE, as of release 4.2. Lemmatization is applied automatically during tokenization of ingested text. To disable lemmatization for English see English (en) - Lemmatization.
View incoming links.