Overview
The ALM employs sophisticated language-identification algorithms that can recognize over fifty modern languages. Over forty of these languages can be subjected to language-specific tokenization and text analytics.
Some languages can be identified, but text-analytics features are not available for that language. These languages are tokenized as if they were English, producing search results of varying quality depending on the language. It is also possible to route text to a tokenizer for a language that, even if not a perfect match, might be more appropriate than English. This configuration is described on this page.
Required Modules
This feature also requires that you include the alm module when you run createproject to create the project directories.
Related Topics
Language identification is also performed by Attivio Core Language Identification.
Locale codes are mapped to language-specific tokenizers in <project-dir>\conf\features\core\TokenizerModel... files.
Non-English Queries also use language codes to guide text analysis of the query.
Exposing Locale Properties demonstrates how to inspect locale settings of an IngestDocument during ingestion.
Recognition is not Analysis
The fact that AIE can recognize a language does not mean that there are analytic tools available for that language.
Non-English Queries
AIE assumes that queries are written in English unless you specify the spoken language used in the query. See Non-English Queries for more information.
View incoming links.
Important Concepts
Language identification can be a confusing topic. Here are some useful concepts to help keep everything in perspective.
Locale vs. Language
Locale values are two-letter language-identification codes that are temporarily attached to documents, fields, and field values during ingestion. (A typical locale code would be "en", meaning English.) The locale properties are consulted during text analysis to determine which language-specific tools should be applied to a particular field value. The locale properties are normally invisible and ephemeral. A field's locale may be accessed programmatically, but is not visible when a field or field value is copied or displayed. Locale properties are not stored in the AIE index.
The language and languages fields are string-valued fields with human-readable values ("English" instead of "en"). They are set as a byproduct of locale detection. These fields are indexed and stored, and can be queried or displayed as facets.
If you write a custom workflow component that sets locale or language values, the “best practice” is to set both values. This is what AIE's language identifier component does.
Language Codes, Country Codes, and Variants
When the ALM identifies the language of a block of text, it generally assigns one of the standard two-letter language codes to the locale property of that text. English is en; Spanish is es, and so forth. See below for a complete list.
However, in some instances the ALM will express a more specific opinion by appending a country code to the language code. For instance, it sometimes identifies English text as en-US or en-UK. In other situations, it may encode a language variant, such as zh_TW, which is Chinese as used in Taiwan.
Country and variant codes are relatively unreliable, so the "best practice" is to ignore them.
LocaleDetector Component
The localeDetector component of the ingestInit workflow encapsulates an instance of the BasisTechDetectLocale transformer. The localeDetector component is configured in <install-dir>\conf\basistech\module.xml, and can be reconfigured in the AIE Administrator.
The localeDetector processes one or more fields in a document, and every value within those fields. The default behavior is to process the text field only, but you can configure it to include other fields. Locale detection is sensitive to sample size, and the text field usually provides a large enough sample for reliable results.
Once the locale of the first field value is determined, it is automatically assigned as the locale of the field and of the document. At the same time, the human-readable name of the language overwrites the value in the language field, and is added to the list of languages in the languages field.
The Minimum Length parameter is the number of characters below which locale detection will not be attempted for a field value. We recommend setting this value to 0 for ALM language identification, allowing the detector to operate on field values of any length.
If the detector fails to assign a locale, the Fallback Locale is used.
The Short Detector bean is not used by the ALM locale detector and should be ignored.
The ALM also assigns locale values to individual phrases within a larger field value. It can identify French phrases in an English paragraph, for instance. If these individual phrases are too short, locale identification can be unreliable. The Shortest Language Length setting applies a minimum length to these phrases (default 50). If Filter Short Results is set to true, the ALM will revert these short phrases back to the default locale of the field. This lets us filter out erratic locale assignments on short phrases while keeping the more-reliable locale assignments for larger phrases within the same field.
Recognized Languages
The following languages (and the corresponding ISO language codes) can be identified by ALM. Most have language-specific tokenization and text analytics available, but some do not.
language name | ISO-639-1 2-letter alpha code | Comment |
---|---|---|
Albanian | sq | Identification and Analytics |
Arabic | ar | Identification and Analytics |
Bengali | bn | Identification only |
Bosnian | bs | Identification and Analytics |
Bulgarian | bg | Identification and Analytics |
Catalan | ca | Identification and Analytics |
Chinese (simplified) | zh_SC | Identification and Analytics |
Chinese (traditional) | zh_TC | Identification and Analytics |
Croatian | hr | Identification and Analytics |
Czech | cs | Identification and Analytics |
Danish | da | Identification and Analytics |
Dutch | nl | Identification and Analytics |
English | en | Identification and Analytics |
Estonian | et | Identification and Analytics |
Farsi (Persian, Dari) | fa | Identification and Analytics |
Finnish | fi | Identification and Analytics |
French | fr | Identification and Analytics |
German | de | Identification and Analytics |
Greek | el | Identification and Analytics |
Gujarati | gu | Identification only |
Hebrew | he (formerly "iw") | Identification and Analytics |
Hindi | hi | Identification only |
Hungarian | hu | Identification and Analytics |
Icelandic | is | Identification only |
Indonesian | id | Identification and Analytics |
Italian | it | Identification and Analytics |
Japanese | ja | Identification and Analytics |
Kannada | kn | Identification only |
Korean | ko | Identification and Analytics |
Kurdish | ku | Identification only |
Latvian | lv | Identification and Analytics |
Malay | ms | Identification and Analytics |
Malayalam | ml | Identification only |
Norwegian | nb | Identification and Analytics |
Pashto | ps | Identification only |
Polish | pl | Identification and Analytics |
Portuguese | pt | Identification and Analytics |
Romanian | ro | Identification and Analytics |
Russian | ru | Identification and Analytics |
Serbian | sr | Identification and Analytics |
Slovak | sk | Identification and Analytics |
Slovenian | sl | Identification and Analytics |
Somali | so | Identification only |
Spanish | es | Identification and Analytics |
Swedish | sv | Identification and Analytics |
Tagalog | tl | Identification only |
Tamil | ta | Identification only |
Telugu | te | Identification only |
Thai | th | Identification and Analytics |
Turkish | tr | Identification and Analytics |
Ukrainian | uk | Identification only |
Urdu | ur | Identification and Analytics |
Uzbek | uz | Identification only |
Vietnamese | vi | Identification only |
Note that "iw" and "he" are both accepted as language codes for Hebrew.
Configure a Tokenizer for an Unsupported Language
Note that the ALM preconfigures the linguistic analysis of over thirty common languages. One would not normally override these settings.
However, the languages that are listed as "Identification only" in the table above do not have language-specific tokenizers. They pass through a default tokenizer instead, as if the text were in English. Although this is not ideal, it often produces adequate search results.
If one of the non-English ALM tokenizers would be more appropriate to the language, you can redirect the incoming text to that tokenizer by editing the appropriate TokenizerModel file. For instance, this snippet of XML directs German and Dutch text to the "Germanic" tokenizer:
<f:tokenizer enabled="true" fallbackLocale="en" group="default" locales="de,nl" name="germanic.default" ref="tokenizer.basistech"/>
e tokenizers. For best results, remember that queries need to be tokenized the same way as the documents. See Non-English Queries for more information.