This page provides the technical details for Tokenization in Attivio.
View incoming links.
To support free-text search, searchable field values in Attivio are tokenized. Tokens are the fundamental matching unit in Attivio. Tokenization is the process of breaking up text strings into tokens. A typical token contains a word and its exact location within the string.
For most languages, tokenization is achieved by separating alphanumeric strings, which are indexed, from the surrounding whitespace and punctuation, which is tokenized but is not indexed. There are also tokens for lemmas, synonyms, entities and for scope tags. In languages where whitespace isn't always used between words (such as Chinese, Japanese, and Korean), determining which words to index is called segmentation.
Tokens can be "stacked," meaning that multiple tokens all have the same location values. This supports synonyms, lemmas, and other features that enrich the incoming text with alternate interpretations of the original words.
You can apply tokenization to a document field by marking the field as tokenize="true" in the AIE Schema.
Field Guide to Tokens
To view and interpret tokenized text, see Field Guide to Tokens. Note that the details of Attivio tokenization are internal and are subject to change without notice.
Attivio's standard set of tokenizers is defined in <install-dir>\conf\core-app\tokenization.xml.
Tokenization is coordinated by the "default" tokenizer. It responds to locale information attached to each field by applying a language-specific tokenizer where one is available. (Various optional language modules contain their own language-specific tokenizers.)
When an appropriate language-specific tokenizer is unavailable (as is usually the case in most projects), the default tokenizer falls back on Attivio's native "English" tokenizer.
The Attivio English tokenizer splits text blocks into individual strings that are single words, punctuation marks, numbers, etc. Word boundaries are determined by punctuation, including especially the space character. Where possible, the English tokenizer inserts English lemmas into the token stream.
The case of the text is preserved by the English tokenizer (but later the case is ignored by the indexer). Punctuation is also preserved, along with any attached word segments. If two or more punctuation marks occur together, they are separated into individual tokens. Note that email addresses and hyphenated words are separated by this tokenizer.
This dog's for ?<>sale!!! Cocker-spaniel, "Friendly", $58 OBO, C.O.D. firstname.lastname@example.org, 1-800-dog-4-sale
Tokenized output of the default tokenizer:
This dog ' s for ?<> sale !!! Cocker - spaniel , " Friendly " , $ 58 OBO , C . O . D . dog - for - sale @ domain . com , 1 - 800 - dog - 4 - sale
The retention of punctuation and case is essential for later text analytics stages. During the indexing process, however, all text is transformed into lower case, and tokens containing punctuation are ignored.
The alphanum tokenizer is used on fields that should not have lemmas inserted. Fields containing email addresses, for instance, do not require lemmatization. The alphanum tokenizer is applied to specific fields through the index.tokenizer field property in the AIE Schema.