Page tree
Skip to end of metadata
Go to start of metadata

 

 

Overview

This page provides the technical details for Tokenization in Attivio. 

View incoming links.

Tokenization

To support free-text search, searchable field values in Attivio are tokenized. Tokens are the fundamental matching unit in Attivio. Tokenization is the process of breaking up text strings into tokens. A typical token contains a word and its exact location within the string.

For most languages, tokenization is achieved by separating alphanumeric strings, which are indexed, from the surrounding whitespace and punctuation, which is tokenized but is not indexed. There are also tokens for lemmassynonyms, entities and for scope tags. In languages where whitespace isn't always used between words (such as Chinese, Japanese, and Korean), determining which words to index is called segmentation.

Tokens can be "stacked," meaning that multiple tokens all have the same location values. This supports synonyms, lemmas, and other features that enrich the incoming text with alternate interpretations of the original words.

You can apply tokenization to a document field by marking the field as tokenize="true" in the AIE Schema

Field Guide to Tokens

To view and interpret tokenized text, see Field Guide to Tokens.  Note that the details of Attivio tokenization are internal and are subject to change without notice.

Tokenizers

Attivio's standard set of tokenizers is defined in <install-dir>\conf\core-app\tokenization.xml.

Default Tokenizer

Tokenization is coordinated by the "default" tokenizer. It responds to locale information attached to each field by applying a language-specific tokenizer where one is available.  (Various optional language modules contain their own language-specific tokenizers.)

When an appropriate language-specific tokenizer is unavailable (as is usually the case in most projects), the default tokenizer falls back on Attivio's native "English" tokenizer.

English Tokenizer

The Attivio English tokenizer splits text blocks into individual strings that are single words, punctuation marks, numbers, etc. Word boundaries are determined by punctuation, including especially the space character. Where possible, the English tokenizer inserts English lemmas into the token stream.

The case of the text is preserved by the English tokenizer (but later the case is ignored by the indexer). Punctuation is also preserved, along with any attached word segments. If two or more punctuation marks occur together, they are separated into individual tokens. Note that email addresses and hyphenated words are separated by this tokenizer.

For example:

This dog's for ?<>sale!!!  Cocker-spaniel, "Friendly", $58 OBO, C.O.D.
dog-for-sale@domain.com, 1-800-dog-4-sale

Tokenized output of the default tokenizer:

This
dog
'
s
for
?<>
sale
!!!
Cocker
-
spaniel
,
"
Friendly
"
,
$
58
OBO
,
C
.
O
.
D
.
dog
-
for
-
sale
@
domain
.
com
,
1
-
800
-
dog
-
4
-
sale

The retention of punctuation and case is essential for later text analytics stages. During the indexing process, however, all text is transformed into lower case, and tokens containing punctuation are ignored.

Alphanum Tokenizer

The alphanum tokenizer is used on fields that should not have lemmas inserted. Fields containing email addresses, for instance, do not require lemmatization.  The alphanum tokenizer is applied to specific fields through the index.tokenizer field property in the AIE Schema.

  • No labels