Page tree

Overview

The Attivio platform schema defines the fields that can appear in the Attivio Universal Index.

The default schema file location is: <project-dir>\conf\schema\default.xml

The index fields have a variety of attributes and properties that control how the field will be analyzed and indexed, such as the type of stemming that should be performed on different types of text fields.

Fields derived from your incoming documents must be mapped into Attivio schema fields in order to insert the data into the index.

Virtual Fields

In addition to the fields defined in the Attivio schema, there are several Virtual Fields that Attivio uses internally, such as the .id field. These fields cannot be set or modified at ingest time, but can be referenced in queries.

View incoming links.

Field Mapping in Attivio

The Attivio Schema is intimately related to the larger issue of mapping database fields into the Attivio Universal Index. For instance, database fields name and summary might need to be mapped into Attivio Schema fields title and text. The correct mapping is necessary in order to get the content properly analyzed and into the index.

This mapping from database fields to schema fields can occur in a variety of ways:

  • You can edit the Attivio Schema for your project to use the existing database field names as new fields in the Attivio Universal Index.
  • Mapping can be performed in the SQL query that pulls content from the database, through the use of field aliases.
  • Mapping can be set up quite easily as part of the database connector definition in the Connector Editor. You just fill out a table.
  • Mapping can also occur by editing various components of the input workflow.

This page is largely about the first option in this list: modifying the Attivio Schema to handle incoming fields that are not provided by default. The other opportunities for mapping are described where they are encountered in context in the documentation.

Fields of IngestDocuments

Configure the Attivio Schema describes the datatypes and behavior of strongly-typed fields in the Universal Index. Many users assume that these strong datatype restrictions must apply to IngestDocument fields, too, but this is a misleading idea.

Schema field definitions are not binding on the fields of an IngestDocument, even when the fields have the same names as Schema fields. An IngestDocument is a scratch pad where Attivio connectors and workflow components work up a description of an index entry. Field values can be transformed from one datatype to another during this process.

The schema field definitions are applied during indexing. The indexer attempts to cast the document's field values into the types required by the schema. If one or more field values cannot be correctly cast, they are discarded. This will also happen with fields that are not defined in the Attivio Schema, with the exception of dynamic fields. Note that all other valid field data will be still be indexed.

Working with the Attivio Schema

During ingestion, connectors create IngestDocument  instances and pass them into ingestion workflows, where the IngestDocuments are processed by workflow components.  Eventually the IngestDocuments reach the indexer subflow, which processes their content and inserts it into the Attivio index.


The connectors and components typically add many fields to the IngestDocuments during ingestion. There is no restriction on the names of these fields; they are simply properties of the IngestDocument that are used to store information temporarily. When you configure a connector or workflow component, you can invent any field names that seem useful.

When the IngestDocuments reach the indexer subflow, the fields whose names match entries in the Attivio schema are inserted into the index. Undefined fields are ignored by default, with the exception of the "wildcard" or "dynamic" fields mentioned below.

To send content to the index, the containing field must be:

  • A field that is defined in the default schema, or
  • A field that you have customized in the schema, or
  • A field that you have added to the schema, or
  • Any field name that matches a "wildcard" or "dynamic" field definition in the schema.

Selecting Fields from the Schema

The easy way to index content from IngestDocuments is to use fields that are already defined in the default schema. Just open the project schema file and browse. You may find a preconfigured field that describes your content, such as "author."

This is the default author field:

<project-dir>\conf\schema\default.xml
<field name="author" displayName="Author" type="STRING" multivalue="true" indexed="true" stored="true" sort="false"/>

This means when your connector or workflow component puts the author's name into an IngestDocument field called "author", it will be indexed. Further, the author's name will be stored for retrieval.

The default schema defines over sixty fields, and sometimes contains up to 300 depending on which modules you have loaded.

Customizing Fields in the Schema

Best Practice

The best practice is to create new schema fields rather than to change the properties of Attivio's default schema fields.

Some changes, such as modifying the type of a field, can make the schema incompatible with an existing index. Attivio will be unable to start. The remedy is to revert the changes or delete the existing index files.

Adding Fields to the Schema

To add a field to the schema, open your schema file (The default schema file location is: <project-dir>\conf\schema\default.xml) and simply add another <field> element.

  1. Scan the file for a field that is similar to the one you wish to create. This will show you appropriate default settings for that type of field. Copy this entry.
  2. Paste the entry at the bottom of the list of fields, just before the </fields> tag. Fields are evaluated top-down, so if there are name conflicts, the last entry will overwrite any previous ones.
  3. Customize the entry and save the file.
  4. In the AIE CLI, deploy the modified project configuration to the Configuration Server(s); this should restart all running nodes with the new configuration.

Note that schema changes are applied during ingestion, so you may have to re-ingest your previous content.

When customizing a field, there are several attributes that one typically sets:

  • name: Choose a unique name, unless you are deliberately overriding an existing field definition.
  • type: string, text, boolean, integer, long, float, double, money, decimal, date or shape. (Data-type details are listed in the reference tables below.)
  • indexed: Will this field be indexed? Most fields are indexed but some utility fields, such as the uri of an image, are unlikely to be searched.
  • stored: Should Attivio store this field value for use in query results? Usually we store field values, but there are exceptions. Some fields like the content field are assembled by concatenating other fields that area already indexed and stored independently. It is useful to index this concatenated content to provide a default-search-field, but there is no need to store it. That information is already stored.
  • displayName: If the field name is cryptic and will be displayed in query results, you can optionally give it a more readable display name.

Many other field attributes are listed in the tables below. There are also many field properties that can be invoked. The field properties are listed here.

Indexing Dynamic Fields using Wildcards

Attivio provides a shortcut that lets you index the content of some of your dynamic fields without first modifying the schema. This shortcut is very convenient during prototyping and when creating a demo in real time.

When you create the dynamic field (in your connector or workflow component) you can add a special suffix to the field name. The suffix tells Attivio how to index the field. These are the dynamic-field suffixes that are defined in the default Attivio schema:

Suffix

Action

<fieldname>_s

This is a string field. This is a single-valued field that will be indexed and stored.  It can be used as a facet, and it has been optimized for sorting results quickly. Attivio will also concatenate this field with existing content in the document's content field, meaning that it will match queries which have no field restriction specified.

<fieldname>_text

This is a text field. It will be indexed and stored. It cannot be used as a facet or join key and has not been optimized for sorting. It will receive special tokenization to support natural-language processing. Like <fieldname>_s, above, it will be concatenated with the content field for indexing and will match queries which have no field restriction specified.

<fieldname>_id

This is a single-valued string field. Since this value is intended as a document ID, it will be indexed and stored but not tokenized, and cannot be used as a facet.  

<fieldname>_i

This single-valued field will be indexed and stored as an integer.

<fieldname>_l

This single-valued field will be indexed and stored as a long.

<fieldname>_d

This single-valued field will be indexed and stored as a double.

<fieldname>_f

This single-valued field will be indexed and stored as a float.

<filename>_decThis single-valued field will be indexed and stored as a decimal number with eight places to the right of the decimal point.
<fieldname>_mThis single-valued field will be indexed and stored to represent money (as a decimal type with four places after the decimal point).

<fieldname>_r

This multi-valued field will be indexed and stored as a real-time update field.

<fieldname>_t

This single-valued field will be indexed and stored as a date, using the format "MM/dd/yyyy".

<fieldname>_mvsMulti-valued string field. It is multi-valued, indexed, and stored, but not optimized for sorting. Attivio will also concatenate this field with existing content in the document's content field.

The dynamic-field suffixes are often referred to as "wildcards" in the documentation because they are defined using wildcard notation in the Attivio schema.  This is how the "_s" field is defined in the schema:

<project-dir>\conf\schema\default.xml
<field name="*_s" type="STRING" multivalue="false" indexed="true" stored="true" sort="false"/>

As you can see, the "*_s" field matches all dynamic fields that end with "_s".

Defining a "Catch-All" Field

The wildcard "*" standing alone matches all dynamic fields, whether they have special suffixes or not.

<project-dir>\conf\schema\default.xml
<field name="*" type="string" indexed="true" stored="true"/>

Note that Attivio attempts to match unknown fields to these wildcard entries in a top-down manner, and stops with the first successful match. Therefore, if you use the "*" wildcard to describe a "catch-all" field, it needs to be the final wildcard field in the schema. Otherwise it will intercept dynamic fields that were intended to match other wildcards.

The wildcard suffixes apply only to the unknown fields that the indexer finds in IngestDocuments. The wildcards do not match fields that are described in the Attivio schema, even if they use the same suffixes.

<schema> Element

When you create a project with the createproject utility, Attivio creates a local schema file based on the modules that you included in the proejct. This file is <project_dir>\conf\schema\default.xml. Any schema extensions or modifications for the project should be made in this file.

Do not edit AIE's default schema!

Do not make changes in Attivio's default schema file, which is <install-dir>\conf\core-app\attivio-schema.xml. Edit the project's schema file instead.

This lets you make independent schema changes in each of your projects, and insures that the changes will not be lost when you upgrade Attivio to the next version.

<schema> Attributes

The top-level XML element defining a schema is the <schema> element. The <schema> element has two attributes:

  • name: The primary Attivio schema is called "default". Do not change the name of the default schema. Multiple components of Attivio look for the default schema by name.
  • merge: Your project schema file may contain multiple "default" schema definitions. This is due to createproject adding small schema definitions from various modules that were included in the project. If merge is true, the schema definition will be merged with the existing one. If set to false, the schema will replace the existing one.

For instance, the keyphrases module adds a single field to the project's schema.xml file. Note that it refers to schema name "default" with merge set to "true."

  <!--
  From module: keyphrases
  -->
  <schema name="default" merge="true">
    <fields>
      <field name="keyphrases" type="string" tokenize="false" indexed="true" stored="true"  displayName="Key Phrases"/>
    </fields>
  </schema>

The default schema contains many field definitions with preconfigured properties that are usually appropriate to that type of field. In any specific project, most of those fields will never be used. These unused fields entail no overhead, so there's no benefit to removing fields from the schema.

<schema> Properties

The following property must be set on the <schema> element.

Property

Type

Default Value

Description

fieldNames.mode

enum (single_value, multi_value)

multi_value

Indicates how fields will be indexed, choosing between single_value (one value indexed per document) and "multi_value" (multiple values indexed per document). Default is "multi_value".

Example:

<schema name="default">
  <properties>
    <property name="fieldNames.mode" value="multi_value"/>
  </properties>
  ...
</schema>

Any other properties set on the <schema> element will provide the default value for any field level properties.

<fields> Element

The <fields> element is a container for all the <field> and <realtimeField> declarations, which define the attributes of the fields in the index.

<fields> Attributes

The <fields> element has one attribute:

  • default-search-field: Defines which field is used to match query terms that do not specify a field. This attribute is set to the content field in the default Attivio schema.


The content field concatenates the document's title, author and text values, plus the values of all *_s and *_i dynamic fields, into a single field value. When the user issues an unfielded query, Attivio looks for a match in the content field.  If a match is found, the search results are compiled using the stored values of the individual fields.  The concatenated value is indexed, but is not stored.

<field> Elements

The <field> elements are read into Attivio in a top-down manner. If there are two or more <field> elements with the same name, the last one will replace the earlier ones.

<field> Attributes

The following attributes can be set as <field> tag attributes.

Attribute

Description

Default

name

The name of the field.  Field names must start with a number, letter or underscore and contain only numbers, letters, periods, dashes, or underscores.
A name can also start with the '*_' character, which matches zero or more legal field name characters. For example, "*_s" matches any field name ending with "_s".

As of Attivio Platform 5.0, field names are no longer case-sensitive.

Numbers and letters are determined by the Java Character.isLetterOrDigit method. Period, dash, and underscore mean only those specific ASCII characters, not other Unicode equivalents.

N/A (Required)

type

The data type of the field. Note that all field types are facetable and joinable by default, except for TEXT and SHAPE fields. (If you want to prevent faceting on a tokenized field, define it as a TEXT field, not a STRING field.)

STRINGShort text string (limited to 4,096 characters by default).
TEXT

Long text string (unlimited length).
(TEXT fields must be tokenized, are not optimized for sorting, cannot be used as join keys, and do not support faceting.)

BOOLEAN

True/False value.

INTEGER

Binary 32-bit precision integer.

LONG

Binary 64-bit precision integer.

FLOAT

Binary 32-bit precision floating-point number.

DOUBLE

Binary 64-bit precision floating-point number.

MONEY

Fixed-precision number value with 4 decimal digits after decimal point. Supports values in range of [-922337203685477.5808, 922337203685477.5807].
(Input values should be added to IngestDocument as String or BigDecimal values to ensure precision is honored.)

DECIMAL

Fixed-precision number value that supports large numbers with configurable scale. See Field Properties for configuring this type.
(Input values should be added to IngestDocument as String or BigDecimal values to ensure precision is honored.)

DATE

Binary date represented as a 64-bit precision integer.
(Highlighting is not supported for DATE fields.)

SHAPE

Geometric shape used for shape filtering. See Shape Intersection Filtering for use of this type.
(Faceting is not supported for SHAPE fields.)

string

indexed

<true>

Field is indexed and searchable.

<false>

Field is not searchable. Making a field unsearchable is useful for fields that you want to display at result time but that are never searched by an end user.

true

tokenize

<auto>

Field is tokenized in the ingest workflow if it is indexed and is of type STRING or TEXT.

<yes>

Field is tokenized in the ingest workflow.

<no>

Field is not tokenized.
(This setting is not supported for fields of type TEXT.)

<true>

Same as <yes>

<false>Same as <no>

auto

lowercase

<auto>

Field is indexed in lowercase if tokenized or case-sensitive otherwise; sorting is case-sensitive.

<yes>

Field is indexed in lowercase; sorting is case-insensitive.

<no>

Field is indexed case-sensitive; sorting is case-sensitive.

<true>Same as <yes>
<false>Same as <no>

This setting does not affect JOIN behavior. If two join-key field values have mismatched case, those values will not join in a JOIN query.

auto

stored

<true>

Field is stored so that it can be returned in a result list.

<false>

Field is not stored. Should be used for fields that are indexed but are not displayed in a result list, such as the default search field.

true

sort

<true>

Field has been optimized for rapid sorting of search results.

<false>

Field can be used to sort search results, but the process may be very slow..

false

displayName

Value to use for the field name in search results.

null - (displays field name)

default

The default value for the field.  "NOW" is allowed as a default value for fields of type DATE.

null

multivalue

Required on all fields

<yes>

Field is multivalued (can have multiple values per document).

<no>

Field is single-valued (can only have one value per document).

<true>Same as <yes>
<false>Same as <no>

default

<field> Properties

See Field Properties for a list of all properties that apply to fields.

<include-field> Elements

Include-field elements are used to create a single field by concatenating the values of other fields. For example, the content field is comprised of many other fields as follows:

<field name="content" type="string" indexed="true" stored="false">
  <include-field name="title" />
  <include-field name="author" />
  <include-field name="text" />
  <include-field name="*_s" />
</field>

<realtimeField> Elements

Real-time fields are configured using the <realtimeField> element.

Real-time fields have a memory overhead that is proportional to the number of documents in the index. See Real-Time Updates for more details.

<realtimeField> Attributes

Real-time field attributes are the same as <field> attributes.

 

<udfs> Element

The <udfs> element is a container for all the <udf> declarations, which define User Defined Fields.

If you incrementally add a module using 'createproject -i' to your project, you will need to re-add the <udfs> and <udf> elements into your schema.

<udf> Elements

Defines a User Defined Field. This inclues the name of the User Defined Field, the class that provides the implementation, and output type information.

<udf> Attributes

The following attributes can be set as <udf> tag attributes.

Attribute

Type

Description

Default

name

string

The name of the user defined field expression.

<required>
typeenumThe output type for the user defined field expression.
Valid types are: string, date, integer, long, float, double, money, decimal, boolean, point, shape
<required>
classstringThe name of the class that implements UserDefinedFieldEvaluator.<required>
dateResolutionenum

The output resolution for date type.
Valid resolutions are: milliseconds, seconds, minutes, hours, days 

seconds
decimalScaleintegerThe output scale for decimal type0

Additional Properties Defined in the Schema

The <schema>, <field> and <realtimeField> elements may contain additional properties that can be accessed by various workflow components. They may include, for example, the type of tokenization to use for a field. Properties are typically defined in the schema, rather than in other configuration files, when they are related to the index in some way and are shared by multiple components.

Field Caches and Memory Usage

Schema field configuration can impact the amount of RAM used by the system:

Field caches contain data that is stored in RAM while the Attivio engine is running. They are used for the following:

  • faceting on a field
  • sorting by a field
  • boosting by a field's value
  • performing inner/outer/allow/deny join on a field

See Configuring Query Caches for more detailed information about configuring when these caches are loaded into memory, and monitoring their memory usage.


Default schema fields

The Attivio schema has a number of default fields.  This section calls out some of these fields that have universal application to Attivio stages.

Field NameDescription
processing.feedback.levelContains the level at which the processing feedback occurred: DEBUG, WARN, or ERROR.
processing.feedback.messageContains the specific message associated with the processing feedback.
processing.feedback.componentContains the component (stage) that provided the processing feedback.
processing.feedback.codeContains the system error code (e.g., INDEX_WORKFLOW-15) if any associated with the processing feedback.

 

Processing Feedback

Processing feedback is potentially generated by any ingestion workflow stage in the system.  In general, any debug, warning, or error information associated processing of the document can be attached to the document.  This allows the processing results for the content to be accessed by applications using the search index.   For example, the following list (non-exhaustive) of workflow events will generate processing feedback to be attached to the document:

  • processing timeouts
  • failure to extract content due to file encryption or password protected content
  • data truncation
  • unknown content types
  • I/O errors

Processing feedback fields (those starting with processing.feedback) are associated multi-valued fields. When processing feedback occurs for the document, each field receives data appropriate to the feedback. The value appearing at a position x in the multi-value list for one field (e.g., processing.feedback.level) corresponds to the values for the other processing feedback fields.