Page tree
Skip to end of metadata
Go to start of metadata

Overview

Language identification (often called "locale detection") is the process of automatically identifying the language or languages found in a block of text. The first step in the linguistic analysis is to perform language identification on the text field of the document. (The text field usually provides the largest sample of text to analyze.)  The language is temporarily encoded in the field's locale property for internal use. 

Many Attivio transformers use the locale property to inform their operation. For example, tokenizers are selected by language, synonym expansion looks for a language-specific synonym dictionary, and key-phrase extraction uses the language-specific model.  In addition, the SAIL interface displays a graph of detected languages when appropriate.

This page explains how to enable and set up automatic locale identification for Attivio Core Linguistics, and how to override the default behavior.

 

Related Topics

Language identification is also performed by ALM Language Identification.

Locale codes are mapped to language-specific tokenizers in <project-dir>\conf\features\core\TokenizerModel... files.

Non-English Queries also use language codes to guide text analysis of the query.

Exposing Locale Properties demonstrates how to inspect locale settings of an IngestDocument during ingestion.

Recognition is not Analysis

The fact that Attivio can recognize a language does not mean that there are analytic tools available for that language.  

Non-English Queries

Attivio assumes that queries are written in English unless you specify the spoken language used in the query. See Non-English Queries for more information.

View incoming links.

Important Concepts

Language identification can be a confusing topic.  Here are some concepts that help to keep everything in perspective.

Locale vs. Language

Locale values are two-letter language-identification codes that are temporarily attached to documents, fields, and field values during ingestion.  (A typical locale code would be "en", meaning English.) The locale properties are consulted during text analysis to determine which language-specific tools should be applied to a particular field value.  The locale properties are normally invisible and ephemeral. A field's locale may be accessed programmatically, but is not visible when a field or field value is copied or displayed. Locale properties are not stored in the Attivio index.

The language and languages fields are string-valued fields with human-readable values ("English" instead of "en"). They are set as a byproduct of locale detection.  These fields are indexed and stored, and can be queried or displayed as facets. 

Best Practice

If you write a custom workflow component that sets locale or language values, the “best practice” is to set both values. This is what Attivio's language identifier component does.

Language Codes, Country Codes, and Variants

For the most part, a document's locale is described as a two-letter language code, such as en (English) or zh (Chinese).

However, it is possible for a locale to be much more specific.  For instance, a locale can sometimes include a country, as en-US or en-UK.  In other situations, the locale can encode a language variant, such as zh_TW, which is Chinese as used in Taiwan.

Country and variant codes are relatively unreliable, so the "best practice" is to ignore them. 

LocaleDetector Component

The localeDetector component of the ingestInit workflow encapsulates an instance of the DetectLocale  transformer.  The localeDetector component is configured in <project-dir>\conf\components\localeDetector.xml, and can be reconfigured in the Attivio Administrator.

The localeDetector processes one or more fields in a document, and every value within those fields.  The default behavior is to process the text field only, but you can configure it to include other fields.  Locale detection is sensitive to sample size, and the text field usually provides a large enough sample for reliable results.  The Maximum Length parameter stops Attivio from processing the entire body of a book if it should happen to appear in a text field. If Maximum Length is set to 2000, then Attivio uses only the first 2000 characters of each field when performing language identification.

Once the locale of the first field value is determined, it is automatically assigned as the locale of the field and of the document.  At the same time, the human-readable name of the language overwrites the value in the language field, and is added to the list of languages in the languages field.

The Minimum Length parameter is the cutoff below which locale detection is unreliable.  If the available text field value is under 50 characters, locale detection will not be attempted.  Instead, the text falls through to a "Short Detector" function.

The default Short Detector is called RuleBasedLanguageIdentifierDefaultEn.  It can identify Chinese, Japanese, Korean, Hebrew, Russian and Arabic from very small samples of the characters used.  It also recognized Latin characters, which it assumes indicate English.  

If the Short Detector fails to assign a locale, the Fallback Locale is used. 

Override Locale Detection

There are a few languages that Attivio can analyze (using dedicated language-specific tokenizers) but cannot automatically recognize.  If you have documents in these languages, it will be necessary to override the localeDetector and force the correct locale code on each document.

This example presumes that the Factbook news connector has been reconfigured to read news feeds that we know are all in the Kazakh language.  The first step is to edit the news connector (System Management > Connectors).  Go to the Field Mappings tab, in the Static Field Values section.  Add these fields:

  • KKLocale (or any name you wish) with the value kk (or any language code you wish). 
  • language (required field) with the value Kazakh (or any human-readable language name).
  • languages (required field) with the same value that you used for the language field.

Note that we have not actually set the locale of the document yet.  Locale is an internal property of the document, not a field that we can set directly. Instead, we have stored the desired local value in the KKLocale field, and we have populated the language and languages fields with the "Kazakh" label.  Documents ingested through other connectors will not have these values.

The second step is to create a new document transformer that can pick up the KKLocale value and use it to set the document's locale. Navigate to Palette > New > Filter by "locale" > Document Transformers > SetLocale and open it in an editor.  We called the transformer setLocaleToKK.  Set the Locale Fieldname to the KKLocale field. This is where the transformer will look for the desired locale value. it will set the document's locale property to whatever label it finds in this field.  It is important to leave the Default Locale field blank!  This transformer will be processing all incoming documents.  We want it to set the locale on the Kazakh documents only, not on all documents. Save this transformer.

Add setLocaleToKK to the IngestInit workflow.  Move it to the position just before the localeDetector component. 

At runtime, setLocaleToKK scans all incoming documents for the ones that have the KKLocale field, and sets the document locale to the value of that field.  If the document does not have a KKLocale field, it passes through unchanged.  The next stage is the localeDetector, which identifies the document's locale in the usual manner.  Note that localeDetector (by default) will not override an existing locale setting, so it lets the Kazakh documents pass by unchanged. 

Recognized Languages

The Attivio Core Linguistics module can recognize the following languages:

Language NameISO-639-1 2-letter alpha code
Arabicar
Danishda
Germande
Eweee
Greekel
Englishen
Spanishes
Estonianet
Farsifa
Finnishfi
Frenchfr
Hebrewhe
Hungarianhu
Icelandicis
Italianit
Japaneseja
Koreanko
Dutchnl
Norwegianno
Polishpl
Portuguesept
Russianru
Swedishsv
Thaith
Turkishtr
Chinesezh

 

 

 

 

  • No labels