Overview
Several AIE features work through Tokenization and "stacking" various kinds of tokens in a TokenList. It is sometimes useful to take a look at the tokens, for instance when submitting a support issue involving tokenizing features.
This page demonstrates how to view tokenized text, and how to interpret the token annotations you will see there.
Expert Use Only
Raw token format evolves as new features are added to AIE. This page is a field guide to tokens spotted "in the wild" to help you interpret the source and meaning of the tokens. The details of token encoding may change without notice.
View incoming links.
Viewing Tokenized Documents
AIE provides many logging classes that can record the state of IngestDocuments as they pass through ingestion workflows. (See Logging Level Settings for general information about this feature.) The following example shows how to log the state of IngestDocuments just before they enter the index, but it is equally possible to capture a snapshot of the document state before and/or after any workflow or component.
The log file is in XML format and is rich with information about each IngestDocument. It includes side-by-side examples of document text before and after tokenization. (The transformer outputs TokenLists as strings.)
LogDocument File Size
The output file quickly becomes very large. It is best to limit the sample to just a few documents of interest.
Logging IngestDocument State
To view raw tokens in text, simply navigate to the AIE Administrator, Logging > Logging Level Settings page. In this case, set the logging level to trace.
Scroll down to the com.attivio.platform.DocumentLogging.DocumentLog-indexer-index-content-dispatcher class.
Click the Update button.
Go to the System Management > Connectors page and ingest just a few documents by turning a connector on and then immediately turning it off.
The log output file is <data-agent>\projects\<project-name>\default\logs\logs-<node-name>\aie-node.documentDetails.log. Navigate the XML to a document of interest, and note the locations where cleartext and tokenLists are displayed side-by-side:
<field name="terrain" type="string"> <value> <tokens>['Mostly'#0~6] ['High'#7~11] ['Plateau'#12~19] ['and'#20~23] ['Desert'#24~30]</tokens> <string>Mostly High Plateau and Desert</string> </value> <value> <tokens>['Some'#0~4] ['Mountains'#5~14, 'mountain'@8#5~14]</tokens> <string>Some Mountains</string> </value> <value> <tokens>['Narrow'#0~6] ['Discontinuous'#8~21] ['Coastal'#22~29] ['Plain'#30~35]</tokens> <string>Narrow, Discontinuous Coastal Plain</string> </value> </field>
To stop the logging, set the logging level back to "fatal" or simply restart the AIE node.
See Interpreting Token Annotations, below, for more information on interpreting tokens.
Interpreting Token Annotations
Basic tokenization breaks up a string into individual words (and sometimes punctuation marks), and attaches position information to the word. If a string contains the word "running" in the 100th to 106th character positions, its simple token might be:
['running'#100~107]
Note that the second position number indicates the character following the end of the word.
The word "running" would be entered in the AIE index, and the position pointers would be used to support phrase matching, "nearby" matches, and highlighting of matching text in the results.
If the stemming feature was operating, it might change "running" to "run" and add a second token, "stacked" in the same position as the first:
['running'#100~107, 'run'@10#100~107]
This shows two tokens at the same location (within the same pair of square brackets). A search might match either one of them. The "@10" is an annotation indicating that this token is the result of stemming. (See the table below.)
Token Annotations
There are thirty-four token-annotation flags, numbered from 0 to 33 in the table below. When added to tokens, these codes are indicated by setting a "1" (a flag) in the appropriate column of a 64-bit binary number. The column number corresponds to the code number, counting to the left from the least-significant column, which is "column zero." This binary number is then rendered in hexadecimal format in displays of raw tokens.
Code 0 (not indexed) becomes binary 1 (a "1" flag in column _zero_), which is 1 in hexadecimal. Code 4 (stem) becomes binary 10000 (a "1" flag in the fourth column), which is 10 in hexadecimal. Code 23 (stopword) becomes binary 100000000000000000000000 (a "1" flag in the 23rd column) which is 80000 in hexadecimal.
The codes can be combined by doing a bitwise OR of the respective binary numbers.
Code 1 (tokenized) can be combined with code 28 (case-sensitive) by performing a bitwise OR of binary 10 with binary 10000000000000000000000000000, resulting in 10000000000000000000000000010, which is 10000002 in hexadecimal.
To interpret an unknown token annotation, convert it from hexadecimal to binary, and then simply note which columns contain the flags. The column numbers are the code numbers. Remember that the column number is zero-based from the least-significant side.
For instance, interpret the annotation of ['sentence'@20000002] in this manner: Convert hex 20000002 to binary 100000000000000000000000000010. Interpret this as two flags in column 1 and in column 29. Look up these flags in the Binary Column Number column of the table below.
- 1 means this token has been tokenized.
- 29 means that the token marks the beginning of a structure element, in this case the beginning of a sentence.
If the example is very simple, like this one, you could just look up hex 20000000 and hex 2 directly in the Binary Flag as Hex column.
The table below provides both the column number and the corresponding binary-flag-to-hexadecimal value. It is often possible to interpret a token annotation simply by looking up the hexadecimal value in the table.
Annotation | Binary Column Number | Binary Flag as Hex | Description |
---|---|---|---|
Not Indexed | 0 | 1 | Non-indexable token. |
Tokenized | 1 | 2 | Means that any subsequent tokenizing pass should skip over this token. It has already been tokenized. (The Advanced Linguistics Module, for instance, will sometime re-tokenize text.) |
Not preceeded by whitespace | 2 | 4 | The token does not follow a whitespace character. |
Lemma | 3 | 8 | This token is a lemma. |
Stem | 4 | 10 | This token is a stem. |
Entity | 5 | 20 | This is an entity. |
Start of Document | 6 | 40 | Marks the beginning of a document. |
End of Document | 7 | 80 | Marks the end of a document. |
Start of Paragraph | 8 | 100 | Beginning of a paragraph. |
End of Paragraph | 9 | 200 | End of a paragraph. |
Start of Sentence | 10 | 400 | Beginning of a sentence. |
End of Sentence | 11 | 800 | End of a sentence. |
Start Noun Phrase | 12 | 1000 | Beginning of a noun phrase. |
Included in Noun Phrase | 13 | 2000 | Part of a noun phrase, but not the start of it. |
Noun | 14 | 4000 | Token is a noun. Not normally used in AIE. |
Pronoun | 15 | 8000 | This token is a pronoun. |
Verb | 16 | 10000 | Not used in AIE. |
Adjective | 17 | 20000 | Not used in AIE. |
Adverb | 18 | 40000 | Not used in AIE. |
Preposition | 19 | 80000 | Not used in AIE. |
Conjunction | 20 | 100000 | Not used in AIE. |
Interjection | 21 | 200000 | Not used in AIE. |
Boundary | 22 | 400000 | Marks a token which immediately follows the boundary of a table or spreadsheet cell, table or spreadsheet row, or a paragraph break. |
Stopword | 23 | 800000 | This token is a stopword. |
Prefix | 24 | 1000000 | This token is a prefix. |
Suffix | 25 | 2000000 | This token is a suffix. |
Wildcard | 26 | 4000000 | This token is a wildcard character. |
Wildcard Component | 27 | 8000000 | This token is part of a wildcard expression. |
Case-Sensitive | 28 | 10000000 | Marks that a token should not be lowercased by the indexer. |
Start Element | 29 | 20000000 | This token is the start of a structure element, such as the beginning of a scope. |
End Element | 30 | 40000000 | This token is the end of a structure element, such as the end of a scope. |
Element Attribute | 31 | 80000000 | This token is an attribute of a structure element. |
Characters | 32 | 100000000 | Character data. |
Marker | 33 | 200000000 | Serves as a marker (checkpoint) to enable seeking to a specific point in a token list. |