Page tree
Skip to end of metadata
Go to start of metadata

Overview

The ALM employs sophisticated language-identification algorithms that can recognize over fifty modern languages.  Over forty of these languages can be subjected to language-specific tokenization and text analytics.  

Some languages can be identified, but text-analytics features are not available for that language. These languages are tokenized as if they were English, producing search results of varying quality depending on the language. It is also possible to route text to a tokenizer for a language that, even if not a perfect match, might be more appropriate than English.  This configuration is described on this page.

Required Modules

This feature also requires that you include the alm module when you run createproject to create the project directories.

Related Topics

Language identification is also performed by Attivio Core Language Identification.

Locale codes are mapped to language-specific tokenizers in <project-dir>\conf\features\core\TokenizerModel... files.

Non-English Queries also use language codes to guide text analysis of the query.

Exposing Locale Properties demonstrates how to inspect locale settings of an IngestDocument during ingestion.

Recognition is not Analysis

The fact that AIE can recognize a language does not mean that there are analytic tools available for that language.  

Non-English Queries

AIE assumes that queries are written in English unless you specify the spoken language used in the query. See Non-English Queries for more information.

View incoming links.

Important Concepts

Language identification can be a confusing topic.  Here are some useful concepts to help keep everything in perspective.

Locale vs. Language

Locale values are two-letter language-identification codes that are temporarily attached to documents, fields, and field values during ingestion.  (A typical locale code would be "en", meaning English.) The locale properties are consulted during text analysis to determine which language-specific tools should be applied to a particular field value.  The locale properties are normally invisible and ephemeral. A field's locale may be accessed programmatically, but is not visible when a field or field value is copied or displayed. Locale properties are not stored in the AIE index.

The language and languages fields are string-valued fields with human-readable values ("English" instead of "en"). They are set as a byproduct of locale detection.  These fields are indexed and stored, and can be queried or displayed as facets.

If you write a custom workflow component that sets locale or language values, the “best practice” is to set both values. This is what AIE's language identifier component does.

Language Codes, Country Codes, and Variants

When the ALM identifies the language of a block of text, it generally assigns one of the standard two-letter language codes to the locale property of that text.  English is en; Spanish is es, and so forth.  See below for a complete list.

However, in some instances the ALM will express a more specific opinion by appending a country code to the language code.  For instance, it sometimes identifies English text as en-US or en-UK.  In other situations, it may encode a language variant, such as zh_TW, which is Chinese as used in Taiwan.

Country and variant codes are relatively unreliable, so the "best practice" is to ignore them.

LocaleDetector Component

The localeDetector component of the ingestInit workflow encapsulates an instance of the BasisTechDetectLocale  transformer.  The localeDetector component is configured in <install-dir>\conf\basistech\module.xml, and can be reconfigured in the AIE Administrator.


The localeDetector processes one or more fields in a document, and every value within those fields.  The default behavior is to process the text field only, but you can configure it to include other fields.  Locale detection is sensitive to sample size, and the text field usually provides a large enough sample for reliable results. 

Once the locale of the first field value is determined, it is automatically assigned as the locale of the field and of the document.  At the same time, the human-readable name of the language overwrites the value in the language field, and is added to the list of languages in the languages field.

The Minimum Length parameter is the number of characters below which locale detection will not be attempted for a field value. We recommend setting this value to 0 for ALM language identification, allowing the detector to operate on field values of any length.

If the detector fails to assign a locale, the Fallback Locale is used.

The Short Detector bean is not used by the ALM locale detector and should be ignored.

The ALM also assigns locale values to individual phrases within a larger field value.  It can identify French phrases in an English paragraph, for instance.  If these individual phrases are too short, locale identification can be unreliable.  The Shortest Language Length setting applies a minimum length to these phrases (default 50). If Filter Short Results is set to true, the ALM will revert these short phrases back to the default locale of the field.  This lets us filter out erratic locale assignments on short phrases while keeping the more-reliable locale assignments for larger phrases within the same field.

Recognized Languages

The following languages (and the corresponding ISO language codes) can be identified by ALM.  Most have language-specific tokenization and text analytics available, but some do not. 

language name

ISO-639-1 2-letter alpha code

Comment

Albanian

sq

Identification and Analytics

Arabic

ar

Identification and Analytics

Bengali

bn

Identification only
BosnianbsIdentification and Analytics

Bulgarian

bg

Identification and Analytics

Catalan

ca

Identification and Analytics

Chinese (simplified)

zh_SC

Identification and Analytics

Chinese (traditional)

zh_TC

Identification and Analytics

Croatian

hr

Identification and Analytics

Czech

cs

Identification and Analytics

Danish

da

Identification and Analytics

Dutch

nl

Identification and Analytics

English

en

Identification and Analytics

Estonian

et

Identification and Analytics

Farsi (Persian, Dari)

fa

Identification and Analytics

Finnish

fi

Identification and Analytics

French

fr

Identification and Analytics

German

de

Identification and Analytics

Greek

el

Identification and Analytics

Gujarati

gu

Identification only

Hebrew

he (formerly "iw")

Identification and Analytics

Hindi

hi

Identification only

Hungarian

hu

Identification and Analytics

Icelandic

is

Identification only

Indonesian

id

Identification and Analytics

Italian

it

Identification and Analytics

Japanese

ja

Identification and Analytics

Kannada

kn

Identification only

Korean

ko

Identification and Analytics

Kurdish

ku

Identification only

Latvian

lv

Identification and Analytics

Malay

ms

Identification and Analytics

Malayalam

ml

Identification only

Norwegian

nb

Identification and Analytics

Pashto

ps

Identification only

Polish

pl

Identification and Analytics

Portuguese

pt

Identification and Analytics

Romanian

ro

Identification and Analytics

Russian

ru

Identification and Analytics

Serbian

sr

Identification and Analytics

Slovak

sk

Identification and Analytics

Slovenian

sl

Identification and Analytics

Somali

so

Identification only

Spanish

es

Identification and Analytics

Swedish

sv

Identification and Analytics

Tagalog

tl

Identification only

Tamil

ta

Identification only

Telugu

te

Identification only

Thai

th

Identification and Analytics

Turkish

tr

Identification and Analytics

Ukrainian

uk

Identification only

Urdu

ur

Identification and Analytics

Uzbek

uz

Identification only

Vietnamese

vi

Identification only

Note that "iw" and "he" are both accepted as language codes for Hebrew.

Configure a Tokenizer for an Unsupported Language

Note that the ALM preconfigures the linguistic analysis of over thirty common languages.  One would not normally override these settings.

However, the languages that are listed as "Identification only" in the table above do not have language-specific tokenizers.  They pass through a default tokenizer instead, as if the text were in English.  Although this is not ideal, it often produces adequate search results. 

If one of the non-English ALM tokenizers would be more appropriate to the language, you can redirect the incoming text to that tokenizer by editing the appropriate TokenizerModel file. For instance, this snippet of XML directs German and Dutch text to the "Germanic" tokenizer:

<project-dir>\conf\features\core\TokenizerModel.germanic.default.xml
  <f:tokenizer enabled="true" fallbackLocale="en" group="default" locales="de,nl" name="germanic.default" ref="tokenizer.basistech"/>

 

e tokenizers.  For best results, remember that queries need to be tokenized the same way as the documents.  See Non-English Queries for more information.

 

  • No labels