Page tree
Skip to end of metadata
Go to start of metadata

Overview

An Attivio project scales up through partitioning and replication.

  • Partitioning divides a large index into multiple parts, with each part running in parallel (on separate hardware) to speed up ingestion and searching.
  • Replication creates additional copies of the partition set, called rows, so that you can scale to meet your application's query load.


Gliffy Macro Error

You do not have permission to view this diagram.

The Index Feature automatically generates these components based on a higher-level index configuration. The Index Feature coordinates very complex interactions among these nodes based on:

  • The number of desired partitions.
  • The names of the indexing nodes (in blue, above).
  • The names of the searching nodes (in green, above).

For the above system, the Index Feature configuration could be (almost) as simple as this:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <f:partitionSet size="5"/>
</f:index>

This configuration allocates documents to partitions by hashing on the document ID. If you really wanted to divide the documents alphabetically (as implied by the diagram) you would need to invoke Partition Routing.

Multi-node Semantic Data Catalog

The Semantic Data Catalog can be run in a clustered configuration to allow full SQL and BI tool access. Certain capabilities may be limited in clustered mode. 

View incoming links.

Index Feature

Configuration for an index feature is provided by the <f:index> element in the <PROJECT_DIR>/conf/features/core/Index.<INDEX_NAME>.xml file.

By default, a new Attivio project includes a single index, named index, configured in the <PROJECT_DIR>/conf/features/core/Index.index.xml file.

<f:index> Attributes

AttributeDefault ValueDescription
nameindexThe name of the index. The name of the index is used as a part of all generated component and bean names, as seen on the Attivio Administrator UI System Management > Indexes display. (Note: If you change the name of the index, be sure to change it everywhere it is referenced. Otherwise Attivio will give you errors during startup.)
schemadefaultThe name of the schema to use for this index. The default value refers to the schema in the <PROJECT_DIR>/conf/schema/default.xml file.
profiletrueSet this to false to disable profiling of index processes running in YARN.
It is recommended that you leave this set to true, otherwise it will not be possible to capture profiling snapshots for index processes.

 

<f:index> Elements

ElementDescription
<f:memory>Index Process Memory Configuration
<f:storage>Index Storage Configuration
<f:contentDispatcher>Content Dispatcher Configuration
<f:queryDispatcher>Query Dispatcher Configuration
<f:partitionSet>Partitioning Configuration

<f:writer>

Index Writer Configuration
<f:searchers>Index Searcher Configuration

Child elements for <f:index> must be specified in the order in which they are listed above.

Minimal Example Config

<f:index name="index"/>

Index Process Memory

The <f:memory> element is used to configure memory allocation for index processes in a clustered Attivio project. 

AttributeDefault ValueDescription
size4G

The amount of memory to allocate to each index process running in YARN. Values ending with "G" will be specified in gigabytes (so "4G" is 4 GB). Values ending in "M" will be specified in megabytes. Values ending in "K" will be specified in kilobytes. All other values will be specified as bytes.

diskCacheSize50%The amount of memory to allocate for HDFS disk cache. If specified as a percentage, the disk cache size will be computed based on the configured size minus some overhead.
Alternatively, disk cache size can be specified explicitly, e.g., "2G" for 2 GB.

Example

<f:index name="index">
  <f:memory size="4G" diskCacheSize="50%"/>
</f:index>

Index Storage

The index storage can be configured using the <f:storage> element of the <f:index> feature.

For clustered projects, the index is stored in HDFS. For unclustered projects, the index is stored on the local file system.

Unclustered Storage Properties

The following properties are available for configuring storage for unclustered projects.

Property NameTypeDefaultDescription
pathStringprojectDataDir/indexLocation on disk for storing index
mmapbooleantrueConfigures if memory mapping is used for reading index files. Setting this to false disables all memory mapping.
unmapbooleantrueIf mmap is true, this option will support enabling/disabling ability to use unsafe unmap operations.
mmapTermIndexbooleantrueIf mmap is true, and mmapTermIndex is true, then the Term Index will be memory mapped.
mmapStoredFieldsbooleantrueIf mmap is true, and mmapStoredFields is true, then Stored Fields will be memory mapped.
mmapTermVectorsbooleantrue

If mmap is true, and mmapTermVectors is true, then Term Vectors will be memory mapped.
Term Vectors are used for highlighting returned documents.

mmapDocValuesbooleantrueIf mmap is true, and mmapDocValues is true, then Doc Values will be memory mapped.
Doc Values are used for sorting and other various aggregation functions.
It is recommended that doc values are always memory mapped. 
mmapNormsbooleantrue

If mmap is true, and mmapNorms is true, then field norms will be memory mapped.
Field norms are used for computing a document's score.
It is recommended that norms are always memory mapped. 

mmapLiveDocsbooleantrueIf mmap is true, and mmapLiveDocs is true, then Live Docs will be memory mapped.
Live Docs files contain a list of all documents that have not been deleted for filtering deleted documents from returned results 

Example

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index>
  <f:storage>
    <f:property name="path" value="/opt/attivio/disk_alt/data-index" />
  </f:storage>
</f:index>

MMap Example

<f:index>
  <f:storage>
    <f:property name="mmap" value="true" />
    <!-- disable mmap for stored fields/term vectors -->
    <f:property name="mmapStoredFields" value="false"/>
    <f:property name="mmapTermVectors" value="false"/>
  </f:storage>
</f:index>

Clustered Storage Properties

The following properties are available for configuring storage for clustered projects.

Property NameTypeDefaultDescription
totalMemorylongderived

Total memory for caching HDFS index (this replaces the operating system filesystem cache). The default value is derived from the <f:memory> configuration.

Setting a value of "0" disables caching. Configuring this value directly is not recommended.

blockSizeint32KBlock size for cached HDFS reads (must be a power of 2)
bufferSizeint4M

Maximum amount of memory allocated for HDFS block reads

directMemorybooleantrue

If true, direct memory allocations are performed

flushBlockSizelong1MSize of HDFS blocks for flushes (should be small since flushes tend to produce many small files)
mergeBlockSizelongderived Size of HDFS blocks for merges. The default value is derived from HDFS site configuration.
slabSizeint128 M

Size of a slab of memory. (This is the internal size of allocated ByteBuffers)

syncBlockbooleanfalseIf true, blocks will be sync'd to disk when closed

Example

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index>
  <f:storage>
    <f:property name="totalMemory" value="17179869184"/> <!-- 16G -->
    <f:property name="directMemory" value="true"/> <!-- use non-heap memory -->
  </f:storage>
</f:index>

Index Writer

The <f:writer> element is used to configure the index writer row. This is the Attivio engine instance which writes out index files as new documents are indexed.

AttributeDefault ValueDescription
searchtrueIf false, index writers will not be used to serve any user queries. If true, the query dispatcher will only query the writer in the event of failover when no dedicated search nodes are available.
defaultZoneNamenullSpecifies the name of the zone to route documents if the document does not explicity specify its zone routing.
logCommitsfalseIf true, commit events will be logged at info level

It is a best practice to configure one or more searcher rows (see below) and to disable search on the writer row, so that queries are not slowed down by the CPU-intensive ingestion tasks.


Index Writer Properties

Advanced settings on the Index Writer can be configured using the <f:property> element on the <f:writer> element.  See  Index Engines for more detail on the possible properties, and  Index Writer Properties for a complete list of available index writer properties.

Example: Set the maximum number of segments produced by optimization to 3 (instead of the default of 5):

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <f:writer>
    <f:property name="optimizeMaxSegments" value="3" />
  </f:writer>
</f:index>

Zones

It is sometimes more efficient to create a project where the index has been subdivided into zones. These zones behave like separate indexes; they can be queried separately or in parallel, creating opportunities for more efficient queries. Zones can be confured on the index writer using the <f:zone> element. For full discussion on zones, see: Zoned Indexing.

<f:zone> Attributes

AttributeDefaultDescription
namenullSpecifies the name of the zone. Required.
namespacenullNamespace for the zone. Namespaces are in general only used internally by Attivio components.
Configuring namespaces for custom defined zones is not recommended.

Zone Routing

The <f:route> element of the <f:zone> element allows configuring a field value based routing to a zone.

For instance, this will typically be used to route documents to a zone based on a table (or similar collection identification) field.

AttributeDescription
fieldField name to use for routing
valueField value to match against.

Example: Route documents with table schema field values of "foo" to the bar zone. All other documents will be routed to the default zone.

<f:index name="index">
  <f:writer defaultZoneName="default">
    <f:zone name="default"/>
    <f:zone name="bar">
      <f:route field="table" value="foo"/>
    </f:zone>
  </f:writer>
</f:index>

Zone Properties

Index Writer properties can be configured on a per-zone basis by using the <f:property> element of the <f:zone> element. See Index Writer Properties for a complete list of available index writer properties.

The Zone Adding Feature

Sometimes it is desirable to add a zone to an index which is configured somewhere else. For example, a module may require its own separate zone in the default index. The following defintion will add a hidden zone called myzone to the index named index:

Add Zone Feature
    <f:addIndexZone index="index">
      <f:zone name="myzone">
        <f:route field="table" value="somevalue"/>
        <f:property name="hidden" value="true"/>
      </f:zone>
    </f:addIndexZone>

 

Index Searchers

The <f:searchers> element is used to configure index searcher rows. This configuration applies to all searcher rows and the writer row (if search="true" and a dedicated searcher is not available or configured).

AttributeDefault ValueDescription
rows0Specify the number of search rows to make available at start up. The rows setting can be dynamically updated on clustered systems using the addrow and removerow commands of the AIE-CLI.
defaultJoinModeremoteSpecify the default mode for resolving join queries in a multi-partition configuration. If local is specified, parent documents will only join to child documents that are on the same node as the parent.

Index Searcher Properties

Advanced settings on the Index Searcher can be configured using the <f:property> element on the <f:searchers> element. See Index Searcher Properties for a complete list of available index searcher properties.

Example: Allow queries to request offsets up to 20000:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <searchers>
    <f:property name="searchMaxRows" value="20000" />
  </searchers>
</f:index>

Join Configuration

Join execution performance can be explicitly configured to execute locally per-partition, or to execute across all partitions. The default is to execute join queries across all partitions for correctness. This default can be changed, or overridden on a per join key basis. The default join configuration can be changed by setting the defaultJoinMode attribute of the <f:searchers> element. Per-field join execution can be configured using the <f:join> element of <f:searchers>.

Join Modes

mode

description

remote

Join is executed across all partitions in the index. This means that child documents will be joined to parent documents from any index partition.

local

Join is executed locally on each partition. This means that child documents from a given index partition will only be joined to parent documents from the same partition.

When using local joins, it is recommended that you also configure routing of documents to ensure that all documents that should join together will be routed to the same partition.
If this guarantee is not met, then join queries will not return accurate results when running in local mode.

 

<f:join> Attributes

Attribute

Description

primary

Specifies the primary key for a join clause to configure

foreign

Optional: Specifies the foreign key (if any) for the join clause to configure

mode

Specifies the mode for executing joins for this key.

Example: Set default join configuration to execute join queries locally per partition:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <f:searchers defaultJoinMode="local"/>
</f:index>

Example: Override join execution on a per-key basis:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <f:searchers defaultJoinMode="remote">
    <f:join primary="primaryKey" foreign="foreignKey" mode="local"/>
  </f:searchers>
</f:index>

Index Partitioning

Most Attivio users index very large collections of data. Trying to hold all that data in a single, monolithic index is not practical. Some indexes are just too large for any one server to handle. Index building, maintenance, and querying all impose loads on the system that are proportional to the size of the index. For all of these reasons, it is advantageous to break up the monolithic index into partitions, and treat each of the partitions as if it were a small independent index. Attivio automatically distributes incoming documents across partitions, and dispatches each query to a set of relevant partitions before combining their partial results into a single response document.

PartitioningBigAndSmall

During content ingestion, each document is sent to a specific partition based on the value of one or more fields. Each partition has a dedicated indexer node that ingests the document and adds it to the local index. Therefore, each partition contains only a fraction of the full index, and is only a fraction of the size of the full index.

Partition before you begin ingesting

Index partitioning is the easiest and most efficient mechanism for increasing system capacity, with one caveat.To redistribute existing index records into a larger number of partitions requires that you discard the existing index, define a new index with more partitions, and then re-ingest all of the content from the beginning. For this reason it is a best practice to create a generous number of partitions at the outset, even though the partitions will be very small at first and might originally all reside on one node. When more hardware capacity is eventually required, the partitions can be migrated to new hardware without requiring re-indexing of content.

Partitioning is configured by adding the <f:partitionSet> element to the Index Feature definition. This element has a size attribute that indicates the number of partitions for the index. Just tell Attivio how many partitions you want and Attivio will set them up.

AttributeDefault ValueDescription
size1The number of partitions to generate.
loadFactor1

The maximum number of index partitions to host in a single process. Valid load factors must be in the range 0 < loadFactor <= size and size must be evenly divisible by loadfactor. Specifying a loadFactor greater than zero will allow oversubscription of hardware. As the index grows, the loadFactor can be reduced in order to distribute the load over more hardware.

NOTE: This setting only applies to clustered systems. The loadFactor can be changed dynamically via the flexindex command in the AIE-CLI.

This is a simple two-partition index:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <f:partitionSet size="2">
  </f:partitionSet>
</f:index>


How many partitions should you create?

This is a complex question depending on the size and dynamism of your data and your anticipated query load; there is not a "one size fits all" answer. The best course is to engage with Attivio's Professional Services team for project sizing guidelines. Please contact sales@attivio.com to discuss a Professional Services engagement.


Partition Routing

Attivio indexes can route incoming documents to specific partitions based on document properties. Conditional routing is accomplished by adding multiple <f:route> elements to the index declaration. Routing elements act as a series of filters: the first one to apply to a document is used. The list of route elements should end with an empty <f:route/> element to catch documents which do not match any preceding routing element.

The <f:route> element lets us set up a series of conditional routing tests to apply to each incoming document. The first successful match determines the partition where the document will be stored. It also optionally lets us send a document to all partitions.

Attribute

Default Value

Description

field

null

If specified, this field will be used for hash based partitioning of documents. If not specified, documents will be hash partitioned according to the document id.

broadcast

false

If true, this router broadcasts matching documents to all partitions, otherwise the document will be partitioned as normal.

consistent

true

This feature is for systems that use field-based routing. It has no impact on Attivio's default ID-hash partition routing.

The feature addresses the case where a document has been ingested more than once using a field-based routing scheme, and the value of the routing field changed in the meantime. If this happens, the document may be duplicated on multiple partitions. When the document is routed to a partition, this feature attempts to delete all previous copies that might exist on other partitions.

If true, delete messages for the document will be sent to all partitions except the one where the document has just been added. If false, no delete messages will be sent. This property should only be set to false if the partition field value for a document will never change, or if documents will never be updated.

With Attivio's default ID-hash partition routing, delete messages will not be sent (effectively consistent=false).

multiValue

false

If true, all values in the partitioning field will be evaluated to determine the set of destination partitions. For example, if the field contains two field values which hash to different partitions, then both partitions will receive copies of the document. If false, only the first field value is used for partitioning purposes.

The <f:filter>  element allows applying filters to an <f:route> element .

AttributeDefault ValueDescription
fieldnullSpecifies the document field to use for filtering. Required.
valuenullSpecify the value that the specified field must have in order to match this filter.
If not specified, this filter will require that the specified field exists (with any value). 
accepttrueIf false, this filter is inverted.

The partition element allows specifying which partitions to route across. If no partition elements are configured for a route, routing will occur across all partitions.

AttributeDefault ValueDescription
indexnullSpecify the index number for a partition. Required.

Example: Route documents using hashfield as the partition field:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index>
  <f:partitionSet size="2">
    <!-- Default route - hash by value of "hashfield" -->
    <f:route field="hashfield"/>
  </f:partitionSet> 
</f:index>

Example: Route documents with table schema field values of "acl" to all index partitions:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index>
  <f:partitionSet size="2">
    <f:route broadcast="true">
      <f:filter field="table" value="acl"/>
    </f:route>

    <!-- Default Route -->
    <f:route/>
  </f:partitionSet>
</f:index>

Example: Route documents with table schema field values of "user" with different properties than those routed by the default route:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index>
  <f:partitionSet size="2">
    <f:route multiValue="true" consistent="true">
      <f:filter field="table" value="user"/>
    </f:route>

    <!-- Default Route -->
    <f:route consistent="false"/>
  </f:partitionSet>
</f:index>

Example: Field value-based partition routing:

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index name="index">
  <f:partitionSet size="4">
    <!-- route documents with table=country to partition 0 -->
    <f:route>
      <f:filter field="table" value="country"/>
      <f:partition index="0"/>
    </f:route>

    <!-- route documents with table=city to partition 1 -->
    <f:route>
      <f:filter field="table" value="city"/>
      <f:partition index="1"/>
    </f:route>

    <!-- route documents with table=news to partition 2 -->
    <f:route>
      <f:filter field="table" value="news"/>
      <f:partition index="2"/>
    </f:route>

    <!-- route documents with table=medal to partition 3 -->
    <f:route>
      <f:filter field="table" value="medal"/>
      <f:partition index="3"/>
    </f:route>
  </f:partitionSet>
</f:index>


Content and Query Dispatchers

Sometimes it is useful to specify which workflow the content dispatchers (the entry point for documents to the index) or query dispatchers (the entry point for queries to the index) will be placed in. In the example below, the content dispatcher is placed in a workflow named customIndexer and the query dispatcher is placed in a workflow named customSearcher. Multiple content and query dispatcher specifications may be provided as needed.

<PROJECT_DIR>/conf/features/core/Index.index.xml
<f:index>
  <f:contentDispatcher workflow="customIndexer"/>
  <f:queryDispatcher workflow="customSearcher"/>
</f:index>

Content and Query Dispatcher Elements

The <f:contentDispatcher> element and the <f:queryDispatcher> element share the following elements:

ElementDefault ValueDescription
workflowindexer (Content Dispatcher),
searcher (Query Dispatcher)

The workflow into which the dispatcher component should be inserted.

Since the default values match the default Attivio workflows, this attribute is not commonly used. However, if multiple index features are in use, this attribute becomes necessary in order to place the dispatchers for each unique index in different workflows so that content and queries are properly directed.

positionfirstWhere in the workflow to place the dispatcher component. May be first, last, before, or after. If set to before or after, must also set relative-component (see below).
relative-componentnullWhen position is set to before or after, the dispatcher component is placed before or after this component. Not used if position is set to first or last.
skip-if-existsfalseIf true, skip insertion of the dispatcher component if it already exists at the specified place in the destination workflow. This setting prevents double-insertion of components that are independently created by multiple modules.


Query Dispatcher Properties

Query dispatcher properties can be configured on the <f:queryDispatcher> element.

Property Name

Type

Default Value

Description

connectTimeout

integer

1000

Connection timeout (in milliseconds) for establishing connection to engine

maxConnectTimeoutinteger60000

Maximum engine connection timeout (in milliseconds) for establishing connection to an engine. This connection timeout will be used if there is only one available engine for a partition in order to increase query stability when no failover engines are available.

For best system stability, this should be set to be longer than the longest running Full GC seen by engines.

Example: Configure a 5-second connection timeout for the Query Dispatcher:

<PROJECT_DIR>/conf/features/core/Index.index.xml
 <f:index>
  <f:queryDispatcher>
    <f:property name="connectTimeout" value="5000"/>
  </f:queryDispatcher>
  ...
</f:index>

Content Dispatcher Properties

Content dispatcher properties can be configured on the <f:contentDispatcher> element.

Property Name

Type

Default Value

Description

connectTimeout

integer

60000

Connection timeout (in milliseconds) for establishing connection to engines.
If session connection times out, the operation will be retried until successful, or until the system is stopped.

Example: Configure a 2-minute connection timeout for the Content Dispatcher:

<PROJECT_DIR>/conf/features/core/Index.index.xml
 <f:index>
  <f:contentDispatcher>
    <f:property name="connectTimeout" value="120000"/>
  </f:contentDispatcher>
  ...
</f:index>