AIE performs linguistic transformation on incoming documents in order to create the richest possible index. Query strings must undergo the same linguistic transformations in order to match indexed documents accurately.
AIE is very good atof incoming documents (and parts of documents), and then applying the transformations that are appropriate to that language.
Queries, however, are usually too short for language-identification algorithms to work. For this reason, it is critically important that you tell AIE which language a non-English query was written in. Otherwise AIE will analyze the query as if it were written in English, and search results may not be optimal.
Language identification is performed by both Attivio Core Language Identification and by ALM Language Identification. The AIE core recognizes more than 25 languages; the ALM recognizes over 50 languages.
Locale codes are mapped to language-specific tokenizers in <project-dir>\conf\features\core\TokenizerModel... files.
Exposing Locale Properties demonstrates how to inspect locale settings of an IngestDocument during ingestion.
View incoming links.
Examples of Non-English Queries
Designating the language of a non-English query lets AIE apply a specific set of linguistic transformations to the text of the query. The examples that follow illustrate differences in Lemmatization among several languages.
- If we search for the ubiquitous word "domino" using the default English query processing, AIE looks for documents that contain "domino."
- However, if we tell AIE that "domino" is written in Spanish or Portuguese, AIE searches for "domino" and also for its lemma "dominar."
- If we identify "domino" as Italian, AIE searches for "domino" and "domare."
- In Hungarian, AIE searches for "domino" and "dominó."
- For Polish, AIE matches documents containing "domino" and "domina."
From these simple examples you can imagine the impact this can have on complex queries. The default English version of the query could be significantly different from the non-English version, with a corresponding difference in matching documents.
Setting the Locale in the UI
Language of the Query
In this context, the "language" of the query means the spoken language the the words and phrases of the query were written in, such as English, French, German, Thai, etc.
However, specifying the query's language is done in multiple ways, depending in part on which of AIE's Query Languages is in use. In this sense, we refer to AIE's Simple Query Language and Advanced Query Language.
The SAIL interface sets the query locale in the Preferences dialog, Results View tab Locale field. To use the locale field, type in a single ISO-639-1 two-letter code for the language. For instance, enter "fr" to pose a query in French.
Debug Search Interface
The Debug Search interface uses the Locale control to set the language of both Simple and Advanced queries. To use the locale field, type in the ISO-639-1 two-letter code for the language.
Setting the Spoken Language in the Query
Under some circumstances you can specify the spoken language of the query (also for parts of the query) from within the query itself.
Simple Query Language
There is no construct within the Simple Query Language that lets you directly specify the spoken language of the query.
However, simple queries can be submitted in a variety of ways that let us designate the spoken language outside of the actual query string. See the section about the Debug Search pages, above.
Advanced Query Language