Page tree
Skip to end of metadata
Go to start of metadata

Overview

Attivio provides a very robust set of language-specific features, intended to increase the accuracy (both precision and recall) of search results in multi-language collections.  Incoming text is identified by language, and language-specific transformations (lemmatization, synonym substitution) are applied.  Later, queries can be put through the same language-specific transformation to more accurately match appropriate records in the index.  

Language features are packaged in three general categories: 

  • Attivio Core Linguistics - This is the set of language analysis and transformation tools that are native to Attivio.  They were developed for English text, but can be generally useful for languages that use the Latin alphabet.
  • Advanced Linguistics Module - This add-on module supports more than fifty European, Middle Eastern and Asian languages. (Not all features are available in all languages.)  Individual languages are licensed separately.

This document first discusses the linguistic workflows that are part of the Core Linguistics Module, on both the document ingestion side and the query side. Then it introduces the Attivio Advanced Linguistics Module and individual components in more detail.

See the self-study page on Attivio Text Analytics for an introduction to text analytics with many video highlights.

View incoming links.

 

Core Linguistics Module

Attivio's Core Linguistics Module offers several tools for full-text indexing, from low-level tools like tokenization to higher level tools like dictionary entity extraction and first+last name extraction. This section describes how these tools work together to extract desired information from unstructured text.

Linguistics Ingestion Components

For more information about the ingestion workflows and components, see:

Linguistics Query Components

For more information about the query workflows and components, see:

Schema and Field Properties

The AIE Schema is the description of all the document fields that Attivio knows how to process.  Schema fields are described in terms of their properties.   In the Field Properties page, there are several properties that determine how the text in a field should be processed. These properties, their legal values, and their defaults, are as follows:

property

setting

default

stopwords.mode

off,index,query

off

synonyms.mode

off,on,query

off

acronyms.mode

off,on,query

off

However, the settings have slightly different meanings for the different properties:

  • For stopwords.mode, the settings are off (default), query (query-time only), and index (index-time and query-time, since removing stopwords from the index but not from queries produces zero hits for queries that contain stopwords).
  • Synonym and acronym expansion can be off (the default), on (enabled on all query terms), or auto, where the ExpandSynonyms or ExpandAcronyms component expands only terms within a pre-configured list of fields.

You can customize the linguistic performance of Attivio on a per-field basis by changing these settings.

Interaction with Inclusion Fields

Some fields are populated by copying text from other fields, such as Attivio's default content field.  The content field "includes" all of the text from the title and author fields, among others. It is Attivio's "default search" field.

If linguistics processing is required on a field that includes the content of other fields, the same processing should be configured for the combined field and for all of the component fields.

Consider the following example:

<project-dir>/conf/schema.xml
<field name="content" type="string" indexed="true" facet="false" stored="false">
        <include-field name="title" />
        <include-field name="author" />
</field>

For linguistics processing to work properly on the content field above, you must define the linguistics properties for all included fields, not just the content field.  For example, to perform stemming on the content field, set:

<project-dir>/conf/schema.xml
<property name="index.stemming" value="true"/>

in every <field> specification (content, title, author, text) as follows:

<project-dir>/conf/schema.xml
<field name="content" type="string" indexed="true" facet="false" stored="false">
        <properties>
            <property name="index.stemming" value="true"/>
        </properties>
        <include-field name="title" />
        <include-field name="author" />
</field>

<field name="title" type="string" indexed="true" stored="true" sort="true" facet="false">
        <properties>
              <property name="index.stemming" value="true"/>
        </properties>
</field>

<field name="author" type="string" indexed="true" stored="true" sort="true" facet="false">
        <properties>
              <property name="index.stemming" value="true"/>
        </properties>
</field>

Multilingual Support - Core Linguistics Module

Attivio's Core Linguistics Module is intended for use with English text only. That said, the Core Linguistics Module can accept and index words in multiple languages by treating the text as if it were written in English. For instance, a French document might be analyzed as if it were English. This is not the most accurate way to index a French document, but for some applications it is adequate.

Attivio uses Unicode to represent text internally. Unicode includes support for a large number of characters sets, as well as punctuation, different currency markers, accents, combined characters, and character variants. Attivio performs Unicode normalization as part of the ingestion and query workflows. For more information, refer to the section on Unicode Normalization.

Advanced Linguistics Module

The optional Advanced Linguistics Module extends the standard linguistics capabilities of Attivio to provide broader language support and algorithmic entity extraction for many European, Middle Eastern and Asian languages. Languages are licensed individually.

 

  • No labels