Page tree
Skip to end of metadata
Go to start of metadata

Overview

All projects created by the createProject tool have a set of tokenizer bean files that contains all of the tokenizers registered with an Attivio Intelligence Engine (AIE) system. These files control which tokenizers are used when performing text analytics and indexing of textual content. These are standard Spring beans files that can be edited to add/remove/customize tokenizers within AIE. In addition, the AIE schema can be edited to use specific tokenizers for individual fields. 

View incoming links.

 

Standard Tokenizers

AIE applies tokenization to incoming documents in the standardAnalyzer component of the ingestPreProcess workflow.  AIE defines two "standard" tokenizers: the default tokenizer and the alphanum tokenizer.

  • The default tokenizer includes punctuation-mark tokens among the words and numbers.  This supports features like noun-phrase extraction and entity extraction.  The punctuation tokens are used during linguistic analysis, but are not added to the index. This is the default tokenizer used for all tokenized fields.
  • The alphanum tokenizer provides simple language agnostic tokenization splitting tokens on whitespace and punctuation.

The default tokenizer is based on AIE's MultiLanguageTokenizerFeature, which allows a master tokenizer to dispatch text to a "group" of language-specific tokenizers.  New language specific tokenizers can be added to the default tokenizer by specifying the group of "default".

AIE also provides some special-purpose tokenizers (listed below).  The Advanced Linguistics Module introduces many language-specific tokenizers.

To configure certain fields to only support exact matches, set tokenize="false" and lowercase="true/false" in the AIE schema entry for those fields. These schema settings mean that the tokenization settings will be ignored for those fields. See the examples below and refer to the schema Field Attributes documentation for more information on these options.

Sample Standard Tokenizer Output

For the following sample input text, the table below describes the tokenized text resulting from different combinations of tokenizer, the tokenize schema field attribute and the lowercase schema field attribute. The tokenized output will determine how queries against AIE match results as well as whether or not advanced linguistics techniques such as entity extraction can be used on the resulting output.

Main Article: Ingestion Tokenization
Main Article: Query Tokenization

Sample Input:

Foo "Bar" BAZ; 123-DEF 5678 abc_def test@email.com foo.txt hij.

Tokenizer

Tokenize

Lowercase

Output

alphanum

true

true

['foo', 'bar', 'baz', '123', 'def', 'abc', 'def', 'test', 'email',
'com', 'foo', 'txt', 'hij']

alphanum

true

false

['Foo', 'Bar', 'BAZ', '123', 'DEF', 'abc', 'def', 'test',
'email', 'com', 'foo', 'txt', 'hij']

default

true

true

['foo', '"', 'bar', '"', 'baz', ';', '123', '-', 'def', '5678', 'abc',
'_', 'def', 'test', '@', 'email', '.', 'com', 'foo', '.', 'txt', 'hij', '.']

default

true

false

['Foo', '"', 'Bar', '"', 'BAZ', ';', '123', '-', 'DEF', '5678',
'abc', '_', 'def', 'test', '@', 'email', '.', 'com', 'foo', '.', 'txt', 'hij', '.']

*

false

true

[foo "bar" baz; 123-def 5678 abc_def test@email.com foo.txt hij.]

*

false

false

[Foo "Bar" BAZ; 123-DEF 5678 abc_def test@email.com foo.txt hij.]

Other Tokenizers

AIE provide several other tokenizers that can be used depending upon specific use cases. 

ClassDescription
EnglishTokenizerThe EnglishTokenizer provides tokenization for english language text. This includes lemmatization of english words. Puncutation tokens will also be emitted for linguistic analysis, but will not be indexed. This is the default tokenizer used by AIE.
AlphaNumericTokenizerThe AlphaNumericTokenizer emits tokens consisting of sequences of alphanumeric characters. The AIE Schema uses this tokenizer as the default for the uri field, the email address fields (to, bcc, cc), the sourcepath field, and the filename field. (Configured in <project-dir>\conf\beans\tokenizer.alphanum.xml).
DigestTokenizerTokenizer that outputs a token that is the md5sum for each input token. (Configured in <project-dir>\conf\beans\tokenizerWrapper.md5.xml).
LetterTokenizerTokenizer that lowercases and splits on non-letters (as defined by Character.isLetter(char) ).
MultiLanguageTokenizerA tokenizer that proxies calls to other tokenizers based on a matching locale.
WhitespaceTokenizerTokenizer that lowercases and splits on whitespace.

Attivio recommends consulting with Attivio Support prior to changing tokenization configuration to ensure that the right tokenizer is used based on the particular use case.

 

Tokenizer Registration

When a new AIE project is created, any included modules that specify tokenization configuration will have that configuration copied into the project.  Any changes to tokenization for a project should be made in the project's configuration files.

Tokenization is an advanced topic. Attivio recommends contacting Attivio Support before modifying tokenization configurations.

For instance, the default alphanum tokenizer is configured via a tokenizer feature in <project-dir>\features\core\TokenizerModel.alphanum.xml:

<project-dir>\features\core\TokenizerModel.alphanum.xml
   <f:tokenizer enabled="true" name="alphanum" class="com.attivio.platform.tokenizer.AlphaNumericTokenizer"/>

The tokenizer is linked to specific document fields in <project-dir>\conf\schema\default.xml:

<project-dir>\conf\schema\default.xml
     <field name="filename" displayName="File Name" type="STRING" indexed="true" stored="true" sort="false">
      <properties>
        <property name="index.tokenizer" value="alphanum"/>
      </properties>
    </field>

Multi-Language Tokenization

Fields are configured to use a single tokenizer for both index and query side processing; however, different locales may require different tokenizers. For instance, AIE offers tokenizers specific to Japanese, Chinese, Korean, and many additional languages via the Advanced Linguistics Module.

These tokenizers can all be configured as a single logical tokenizer using a MultiLanguageTokenizer class that is configured using the MultiLanguageTokenizer feature.  This multi-language tokenizer maintains a map of locales to locale-specific tokenizers such that every field value for a given field can be processed using the appropriate tokenizer based on the field value's detected locale.

The default MultiLanguageTokenizer feature is configured in <project-dir>\conf\features\core\MultiLanguageTokenizerModel.default.xml: This configuration creates a new multi-language tokenizer which can be referenced in the AIE schema using the "default" label and to which new tokenizers in the "default" group can be registered. If the locale of a field value can not be detected, the locale will be assumed to be "en" (English). If the locale of the field value to tokenize does not have a tokenizer mapped to it, the default multi-language tokenizer will fallback to the english tokenizer.

<project-dir>\conf\features\core\MultiLanguageTokenizerModel.default.xml
  <f:multiLangTokenizer enabled="true" fallbackLocale="en" fallbackTokenizer="english" group="default" name="default"/>

MultiLanguageTokenizer Properties

Property

Description

group

Can be used in new tokenizer declarations to auto-register tokenizers for use by a multi-language tokenizer based on the new tokenizers' declared locales. Using the group attribute, different AIE modules can register tokenizers easily without having to modify the base multi-language tokenizer configuration.

fallbackLocale

Used as the locale to determine which tokenizer to use when a locale cannot be detected either due to an unknown language or insufficient text to make a locale determination.

fallbackTokenizer

Specifes which tokenizer to use when there is no tokenizer registered for a detected locale.