The Advanced Linguistics Module (ALM) augment the analysis of languages that are not as richly supported by the Attivio platform's native analytic tools.
Depending on the terms of your license, the ALM augments Attivio's analysis of English, and adds support for Azerbaijani, Bosnian, Chinese, Czech, Danish, Dutch, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Kazakh, Korean, Lithuanian, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, and Urdu. (See the Features by Language Table for a summary of the exact features that are offered for each of these languages.)
The ALM provides the following improvements over the standard Attivio platform linguistics capabilities:
- Recognizes over fifty languages.
- Provides text-analytics tools for more than half of these languages.
- Provides language-specific stemming and/or lemmatization
- Breaks up compounded words (in German, for instance).
- Provides segmentation (word chunking) for languages that do not normally put spaces between words (such as Chinese).
- Provides powerful statistical entity extraction for seventeen languages.
These features require the inclusion of the alm module when you run createproject to create the project directories.
Advanced Linguistics Module installs as a part of the default Attivio platform installer and requires a license key file which determines what linguistics features are enabled. A License key can be obtained from email@example.com or your Attivio Sales Representative.
View incoming links.
ALM Resource Requirements
The lexical resources used by the ALM vary in size from approximately 1MB to over 80MB. These resources map to the virtual memory space of the running Java process. Most server-class machines have adequate virtual memory resources, but on some laptops, the memory map request fails silently on startup, resulting in log file messages such as Take5FileMapException, or Failed to open model. This is more likely to happen if many languages are in use simultaneously. If such messages are encountered, try increasing the virtual memory on the machine, reducing the memory allocated to the JVM, or contacting Attivio Support.
Setting Up the ALM
The Advanced Linguistics Module (ALM) is almost ready to run out-of-the-box. This section describes getting the proper licenses, and installing the ALM in your Attivio installation tree.
ALM-related licensing has several aspects, and results in two license files that must be installed on your Attivio platform server.
- You need a basic Attivio license in order to download and run the Attivio platform. (This is the usual Attivio license.)
- The Attivio license includes permission to run ALM for English Entity Extraction.
- The Attivio license must be extended, at additional cost, to permit you to use the ALM for other features and languages
- Each ALM language you intend to use must be licensed separately (at additional cost).
- If you want to use the ALM's Statistical Entity Extraction feature, you must license an ALM Advanced Module.for each one of the licensed languages. The ALM's Statistical Entity Extraction is turned on for all licensed languages or for none of them, with no mix-and-match permitted.
Contact firstname.lastname@example.org. for pricing and for license delivery.
You will receive an attivio.license file, called attivio.license. Copy it into a directory that you can reference during Attivio platform installation. .
You will also receive a rlp-license.xml file that enables specific ALM languages. After installing the Advanced Linguistics Module, copy this file to <install-dir>\lib\basisTech\licenses.
License Key Installation
The Advanced Linguistics Module license file is installed by placing the rlp-license.xml file into the <install_dir>/lib/basisTech/licenses directory. To determine which features are enabled, view the license file contents.
Attivio applies tokenization to incoming documents in the standardAnalyzer component of the ingestPreProcess workflow.
The ALM configures several tokenizers that are registered by group, and which analyze specific languages. For instance, this is the feature declaration for the French tokenizer:
<f:tokenizer locales="fr" name="bt-french" group="default" class="com.attivio.basistech.tokenizer.BasisTechTokenizer"> <f:property name="language" value="FRENCH"/> <f:property name="lemmatize" value="true"/> </f:tokenizer>
The "language" property must be configured in order to support directly using the tokenizer (where locale may not be set).
|language||String||required||Specify the default language for the tokenizer (can be an ALM language code, or a locale language tag)|
|Enable/disable decompounding tokens|
|normalize||boolean||false||Enable/disable using normalized form of lemma|
|readings||boolean||false||Enable/disable readings (japanese specific)|
|sentences||boolean||true||Enable/disable sentence scope annotations|
|useSubstringOffsets||boolean||false||Enable/disable using substring offsets for highlighting lemmas (Korean only)|
|shortQueryThreshold||int||6||minimum length for query processing of strings that are all katakana or all hiragana|
|substringMatch||boolean||false||Enable/disable substring matching. See Substring Matching for more information.|
The <install-dir>\conf\alm\tokenization.xml file now contains a tokenizer definition for each language.
This allows for full configurability out of box on a per-language basis
Enable ALM for English
Attivio does not need the ALM for most linguistic processing of English. However, the ALM's Statistical Entity Extraction and its Lemmatization for English are quite powerful. To switch over to ALM as the default tokenizer for English, edit <install-dir>\conf\alm\tokenization.xml and add group="default" to the English tokenizer. From that point on, new projects that utilize the ALM will have ALM tokenization of English enabled.
Substring matching requires 5.5.1 patch 62 or higher and version 1.0.2 or higher of the Advanced Linguistics Module.
The tokenization of Chinese, Korean and Japanese (CJK) languages can sometimes differ between query and ingest time due to differences in the amount of contextual information. There is typically much more surrounding text at ingest time than at query time, and this can lead to the linguistic processing choosing to tokenize the same text slightly differently.
For example, if A, B and C were CJK characters, we could end up with:
|Text||Query Tokens||Ingest Tokens|
|ABC||AB C||A BC|
At query time,
AB are considered one token and
C a second token, while during ingestion of a document containing this text,
A is considered one token and
BC a second due to the surrounding context. This misalignment will prevent the query from matching the expected document.
As an alternative to the standard tokenization offered for CJK languages, Attivio provides the ability to enable substring matching, which will stack additional tokens for each character of a token.
To enable substring matching, set the
substringMatch property to
true within the
<project-dir>/conf/features/core/TokenizerModel.bt-<language>.xml configuration file.
Following is an example of some Japanese text that has been tokenized with substring matching enabled:
['sentence'@20000002#3~3] ['エルツリン'#3~8, 'エ'#3~4, 'ル'#4~5, 'ツ'#5~6, 'リ'#6~7, 'ン'#7~8] ['、'@3#8~9] ['character_s'@20000002#9~9] ['朝日'#9~11, '朝'#9~10, '日'#10~11, 'アサヒ'@8#9~11] ['character_s'@40000002#11~11] ['き'#11~12, 'キ'@8#11~12] ['の'#12~13, 'ノ'@8#12~13] ['うき'#13~15, 'う'#13~14, 'き'#14~15, 'うい'@8#13~15, 'ウキ'@8#13~15] ['た'#15~16, 'タ'@8#15~16] ['よー'#16~18, 'よ'#16~17, 'ー'#17~18, 'ヨー'@8#16~18] ['sentence'@40000002#18~18]
Notice the token エルツリン, which starts at position 3 of the field. It has the additional tokens of エ, ル, ル, ツ, リ and ン listed separately. This will allow any substring of the original token to be matched and highlighted. At query time, when a query of
AB is received for a locale with substring matching enabled, the query generated will be
OR(AB, phrase(A, B, boost=50)).
When enabling substring matching, the number of tokens generated in each document are increased. The additional tokens created for each character count towards a document's total number of tokens. This increase is estimated to be up to 200% of the number of tokens for the document without substring matching enabled. This can increase the likelihood of a document exceeding the
value for the field and result in less of the document being searchable. This should be considered when enabling substring matching.
User Defined Dictionaries
User defined dictionaries can be provided in order to improve tokenization/analysis of text.
User defined dictionaries are activated automatically by placing them into the correct location in the resources directory of your project.
Dictionaries must be placed in the following per-language directory:
LANG indicates the 3-letter ISO 639-3 language code used to identify the language. (EX:
eng for english,
zho for chinese,
zhs for simplified chinese)
The type/options for the dictionary is determined based on the name of the dictionary:
- Tokenization dictionaries - must include the term "token" in the name (EX: token.bin)
- Analysis dictionaries - must include the term "morpho" in the name (EX: morpho.bin)
- Case insensitive dictionaries - Analysis dictionaries can have case insensitive versions, these must include the term "lowercase" in the name (EX: morpho-lowercase.bin)
See User-Defined Dictionaries for more information on creating user-defined dictionaries..
Language Identification Component
Attivio's DetectLocale component identifies the language of the text of one or more fields in a document. The locale detector chooses one language that "best" describes each field value. It cannot detect that the field value contains multiple languages. (The LocaleDetector component is configured in <project-dir>\conf\components\localeDetector.xml. It is one of the components of the ingestInit workflow.)
The ALM overrides the default locale detector. The ALM detector provides broader and more fine-grained language identification. It can subdivide a field value into regions of text written in multiple languages. The ALM then processes each region of text using the appropriate tokenizer for that language.
Main Article: ALM Language Identification
The configuration for the ALM language detector appears in <install_dir>\conf\alm\module.xml. It overrides <project-dir>\conf\components\localeDetector.xml with a new component of the same name. Here is the relevant XML from that file:
<component name="localeDetector" class="com.attivio.basistech.transformer.ingest.linguistics.BasisTechDetectLocale" override="true"> <properties> <property name="languages" value="languages"/> <list name="input"> <entry value="text"/> </list> <property name="minimumLength" value="50"/> </properties> </component>
The attribute override="true" replaces the default localeDetector with the ALM version.
The properties include:
- languages supplies the name of the document field that the localDetector fills when more than one language is detected in document. All the detected languages are listed, making it possible to facet on them.
- input lists the names of all the document fields to analyze for language identification.
- minimumLength supplies the minimum length (in characters) for a field to receive an identified language. If the field is shorter than specified, language identification does not run. Fields with little text cannot reliably identify the language.
Configuring Entity Extraction
Attivio offers several native forms of entity extraction, usually intended for English. The ALM form of entity extraction is enabled by default when including the ALM module.
The ALM offers additional statistical entity extraction features for these languages: Arabic, Chinese, Dutch, English, Farsi, French, German, Hebrew, Italian, Japanese, Korean, Portuguese, Spanish, Russian, Urdu. Using entity extraction with these languages requires additional licensing.
ALM entity extraction is performed in the BasisTechEntityExtractor component of the ingestPreProcess workflow. It is ready to use out-of-the-box unless you want to change the field mappings. Look in <project-dir>\conf\components\BasisTechEntityExtractor.xml.
<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="basisTechEntityExtraction" class="com.attivio.basistech.transformer.ingest.linguistics.BasisTechEntityExtractor" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd "> <properties> <list name="input"> <entry value="text"/> </list> <list name="languages"> <entry value="*"/> </list> <property name="outputAsScope" value="true"/> <map name="entityMappings"> <property name="LOCATION" value="%ENTITY_LOCATION%"/> <property name="GPE" value="%ENTITY_LOCATION%"/> <property name="ORGANIZATION" value="%ENTITY_COMPANY%"/> <property name="PERSON" value="%ENTITY_PERSON%"/> </map> <map name="fieldMappings"> <property name="%ENTITY_LOCATION%" value="location"/> <property name="%ENTITY_COMPANY%" value="company"/> <property name="%ENTITY_PERSON%" value="people"/> </map> </properties> </component>
You can also edit this component using dynamic configuration in the Attivio Administrator.
Disabling ALM Entity Extraction
ALM entity extraction can be disabled if not desired. In order to do so, you will have to edit a line of <install-dir>\conf\alm\alm.properties.
Set alm.extractEntities to "false".
This change will disable initialization of entity extraction and will result in ALM no longer performing entity extraction.
Querying with the ALM
Attivio platform automatically tokenizes queries so that they can be matched accurately with terms in the index. When the ALM tokenizer feature is active, language-specific tokenizers are automatically applied to incoming query strings.
Since query strings are generally rather short, automatic language identification is not attempted. For this reason, Attivio's query languages let us assign a locale (a language code) to each field of an incoming query. This guides Attivio in selecting the right tokenizers. See the page on Non-English Queries for more information on this subject.