Page tree
Skip to end of metadata
Go to start of metadata

Overview

Several AIE features work through Tokenization and "stacking" various kinds of tokens in a TokenList. It is sometimes useful to take a look at the tokens, for instance when submitting a support issue involving tokenizing features.

This page demonstrates how to view tokenized text, and how to interpret the token annotations you will see there.

Expert Use Only

Raw token format evolves as new features are added to AIE. This page is a field guide to tokens spotted "in the wild" to help you interpret the source and meaning of the tokens. The details of token encoding may change without notice.

View incoming links.

Viewing Tokenized Documents

AIE provides many logging classes that can record the state of IngestDocuments as they pass through ingestion workflows. (See Logging Level Settings for general information about this feature.) The following example shows how to log the state of IngestDocuments just before they enter the index, but it is equally possible to capture a snapshot of the document state before and/or after any workflow or component.

The log file is in XML format and is rich with information about each IngestDocument. It includes side-by-side examples of document text before and after tokenization. (The transformer outputs TokenLists as strings.)

LogDocument File Size

The output file quickly becomes very large. It is best to limit the sample to just a few documents of interest.

Logging IngestDocument State

To view raw tokens in text, simply navigate to the AIE Administrator, Logging > Logging Level Settings page. In this case, set the logging level to trace.

Scroll down to the com.attivio.platform.DocumentLogging.DocumentLog-indexer-index-content-dispatcher class.

Click the Update button.

Go to the System Management > Connectors page and ingest just a few documents by turning a connector on and then immediately turning it off.

The log output file is <data-agent>\projects\<project-name>\default\logs\logs-<node-name>\aie-node.documentDetails.log. Navigate the XML to a document of interest, and note the locations where cleartext and tokenLists are displayed side-by-side:

Tokenized Output
      <field name="terrain" type="string">
        <value>
          <tokens>['Mostly'#0~6] ['High'#7~11] ['Plateau'#12~19] ['and'#20~23] ['Desert'#24~30]</tokens>
          <string>Mostly High Plateau and Desert</string>
        </value>
        <value>
          <tokens>['Some'#0~4] ['Mountains'#5~14, 'mountain'@8#5~14]</tokens>
          <string>Some Mountains</string>
        </value>
        <value>
          <tokens>['Narrow'#0~6] ['Discontinuous'#8~21] ['Coastal'#22~29] ['Plain'#30~35]</tokens>
          <string>Narrow, Discontinuous Coastal Plain</string>
        </value>
      </field>

To stop the logging, set the logging level back to "fatal" or simply restart the AIE node.

See Interpreting Token Annotations, below, for more information on interpreting tokens.

 

Interpreting Token Annotations

Basic tokenization breaks up a string into individual words (and sometimes punctuation marks), and attaches position information to the word. If a string contains the word "running" in the 100th to 106th character positions, its simple token might be:

['running'#100~107]

Note that the second position number indicates the character following the end of the word.

The word "running" would be entered in the AIE index, and the position pointers would be used to support phrase matching, "nearby" matches, and highlighting of matching text in the results.

If the stemming feature was operating, it might change "running" to "run" and add a second token, "stacked" in the same position as the first:

['running'#100~107, 'run'@10#100~107]

This shows two tokens at the same location (within the same pair of square brackets). A search might match either one of them. The "@10" is an annotation indicating that this token is the result of stemming. (See the table below.)

Token Annotations

There are thirty-four token-annotation flags, numbered from 0 to 33 in the table below. When added to tokens, these codes are indicated by setting a "1" (a flag) in the appropriate column of a 64-bit binary number. The column number corresponds to the code number, counting to the left from the least-significant column, which is "column zero." This binary number is then rendered in hexadecimal format in displays of raw tokens.

Code 0 (not indexed) becomes binary 1 (a "1" flag in column _zero_), 
which is 1 in hexadecimal.

Code 4 (stem) becomes binary 10000 (a "1" flag in the fourth column), 
which is 10 in hexadecimal. 

Code 23 (stopword) becomes binary 100000000000000000000000 
(a "1" flag in the 23rd column) which is 80000 in hexadecimal.

The codes can be combined by doing a bitwise OR of the respective binary numbers.

Code 1 (tokenized) can be combined with code 28 (case-sensitive) 
by performing a bitwise OR of binary 10 with binary 
10000000000000000000000000000, resulting in 
10000000000000000000000000010, which is 10000002 in hexadecimal.

To interpret an unknown token annotation, convert it from hexadecimal to binary, and then simply note which columns contain the flags. The column numbers are the code numbers. Remember that the column number is zero-based from the least-significant side.

For instance, interpret the annotation of ['sentence'@20000002] in this manner: Convert hex 20000002 to binary 100000000000000000000000000010. Interpret this as two flags in column 1 and in column 29. Look up these flags in the Binary Column Number column of the table below.

  • 1 means this token has been tokenized.
  • 29 means that the token marks the beginning of a structure element, in this case the beginning of a sentence.

If the example is very simple, like this one, you could just look up hex 20000000 and hex 2 directly in the Binary Flag as Hex column.

The table below provides both the column number and the corresponding binary-flag-to-hexadecimal value. It is often possible to interpret a token annotation simply by looking up the hexadecimal value in the table.

Annotation

Binary Column Number

Binary Flag as Hex

Description

Not Indexed

0

1

Non-indexable token.

Tokenized

1

2

Means that any subsequent tokenizing pass should skip over this token. It has already been tokenized. (The Advanced Linguistics Module, for instance, will sometime re-tokenize text.)

Not preceeded by whitespace

2

4

The token does not follow a whitespace character.

Lemma

3

8

This token is a lemma.

Stem

4

10

This token is a stem.

Entity

5

20

This is an entity.

Start of Document

6

40

Marks the beginning of a document.

End of Document

7

80

Marks the end of a document.

Start of Paragraph

8

100

Beginning of a paragraph.

End of Paragraph

9

200

End of a paragraph.

Start of Sentence

10

400

Beginning of a sentence.

End of Sentence

11

800

End of a sentence.

Start Noun Phrase

12

1000

Beginning of a noun phrase.

Included in Noun Phrase

13

2000

Part of a noun phrase, but not the start of it.

Noun

14

4000

Token is a noun. Not normally used in AIE.

Pronoun

15

8000

This token is a pronoun.

Verb

16

10000

Not used in AIE.

Adjective

17

20000

Not used in AIE.

Adverb

18

40000

Not used in AIE.

Preposition

19

80000

Not used in AIE.

Conjunction

20

100000

Not used in AIE.

Interjection

21

200000

Not used in AIE.

Boundary

22

400000

Marks a token which immediately follows the boundary of a table or spreadsheet cell, table or spreadsheet row, or a paragraph break.

Stopword

23

800000

This token is a stopword.

Prefix

24

1000000

This token is a prefix.

Suffix

25

2000000

This token is a suffix.

Wildcard

26

4000000

This token is a wildcard character.

Wildcard Component

27

8000000

This token is part of a wildcard expression.

Case-Sensitive

28

10000000

Marks that a token should not be lowercased by the indexer.

Start Element

29

20000000

This token is the start of a structure element, such as the beginning of a scope.

End Element

30

40000000

This token is the end of a structure element, such as the end of a scope.

Element Attribute

31

80000000

This token is an attribute of a structure element.

Characters

32

100000000

Character data.

Marker

33

200000000

Serves as a marker (checkpoint) to enable seeking to a specific point in a token list.

  • No labels