An Attivio project scales up through partitioning and replication.
- Partitioning divides a large index into multiple parts, with each part running in parallel (on separate hardware) to speed up ingestion and searching.
- Replication creates additional copies of the partition set, called rows, so that you can scale to meet your application's query load.
The Index Feature automatically generates these components based on a higher-level index configuration. The Index Feature coordinates very complex interactions among these nodes based on:
- The number of desired partitions.
- The names of the indexing nodes (in blue, above).
- The names of the searching nodes (in green, above).
For the above system, the Index Feature configuration could be (almost) as simple as this:
This configuration allocates documents to partitions by hashing on the document ID. If you really wanted to divide the documents alphabetically (as implied by the diagram) you would need to invoke Partition Routing.
Be sure to see the Index Feature Detailed Diagrams.
View incoming links.
Configuration for an index feature is provided by the
<f:index> element in the
By default, a new Attivio project includes a single index, named index, configured in the
|index||The name of the index. The name of the index is used as a part of all generated component and bean names, as seen on the Attivio Administrator UI System Management > Indexes display. (Note: If you change the name of the index, be sure to change it everywhere it is referenced. Otherwise Attivio will give you errors during startup.)|
|default||The name of the schema to use for this index. The default value refers to the schema in the |
|true||Set this to false to disable profiling of index processes running in YARN.|
It is recommended that you leave this set to true, otherwise it will not be possible to capture profiling snapshots for index processes.
|Index Process Memory Configuration|
|Index Storage Configuration|
|Content Dispatcher Configuration|
|Query Dispatcher Configuration|
|Index Writer Configuration|
|Index Searcher Configuration|
Child elements for
<f:index> must be specified in the order in which they are listed above.
Minimal Example Config
Index Process Memory
<f:memory> element is used to configure memory allocation for index processes in a clustered Attivio project.
The amount of memory to allocate to each index process running in YARN. Values ending with "G" will be specified in gigabytes (so "4G" is 4 GB). Values ending in "M" will be specified in megabytes. Values ending in "K" will be specified in kilobytes. All other values will be specified as bytes.
|50%||The amount of memory to allocate for HDFS disk cache. If specified as a percentage, the disk cache size will be computed based on the configured size minus some overhead. |
Alternatively, disk cache size can be specified explicitly, e.g., "2G" for 2 GB.
The index storage can be configured using the
<f:storage> element of the
For clustered projects, the index is stored in HDFS. For unclustered projects, the index is stored on the local file system.
Unclustered Storage Properties
The following properties are available for configuring storage for unclustered projects.
|String||projectDataDir/index||Location on disk for storing index|
|boolean||true||Configures if memory mapping is used for reading index files. Setting this to false disables all memory mapping.|
Doc Values are used for sorting and other various aggregation functions.
It is recommended that doc values are always memory mapped.
Live Docs files contain a list of all documents that have not been deleted for filtering deleted documents from returned results
Clustered Storage Properties
The following properties are available for configuring storage for clustered projects.
Total memory for caching HDFS index (this replaces the operating system filesystem cache). The default value is derived from the
Setting a value of "0" disables caching. Configuring this value directly is not recommended.
|int||32K||Block size for cached HDFS reads (must be a power of 2)|
Maximum amount of memory allocated for HDFS block reads
If true, direct memory allocations are performed
|long||1M||Size of HDFS blocks for flushes (should be small since flushes tend to produce many small files)|
|long||derived||Size of HDFS blocks for merges. The default value is derived from HDFS site configuration.|
Size of a slab of memory. (This is the internal size of allocated ByteBuffers)
|boolean||false||If true, blocks will be sync'd to disk when closed|
<f:writer> element is used to configure the index writer row. This is the Attivio engine instance which writes out index files as new documents are indexed.
|true||If false, index writers will not be used to serve any user queries. If true, the query dispatcher will only query the writer in the event of failover when no dedicated search nodes are available.|
|null||Specifies the name of the zone to route documents if the document does not explicity specify its zone routing.|
|false||If true, commit events will be logged at info level|
It is a best practice to configure one or more searcher rows (see below) and to disable search on the writer row, so that queries are not slowed down by the CPU-intensive ingestion tasks.
Index Writer Properties
Advanced settings on the Index Writer can be configured using the
<f:property> element on the
<f:writer> element. See Index Engines for more detail on the possible properties, and Index Writer Properties for a complete list of available index writer properties.
Example: Set the maximum number of segments produced by optimization to 3 (instead of the default of 5):
It is sometimes more efficient to create a project where the index has been subdivided into zones. These zones behave like separate indexes; they can be queried separately or in parallel, creating opportunities for more efficient queries. Zones can be confured on the index writer using the
<f:zone> element. For full discussion on zones, see: Zoned Indexing.
|null||Specifies the name of the zone. Required.|
|null||Namespace for the zone. Namespaces are in general only used internally by Attivio components. |
Configuring namespaces for custom defined zones is not recommended.
<f:route> element of the
<f:zone> element allows configuring a field value based routing to a zone.
For instance, this will typically be used to route documents to a zone based on a table (or similar collection identification) field.
|Field name to use for routing|
|Field value to match against.|
Example: Route documents with table schema field values of "foo" to the bar zone. All other documents will be routed to the default zone.
Index Writer properties can be configured on a per-zone basis by using the
<f:property> element of the
<f:zone> element. See Index Writer Properties for a complete list of available index writer properties.
The Zone Adding Feature
Sometimes it is desirable to add a zone to an index which is configured somewhere else. For example, a module may require its own separate zone in the default index. The following defintion will add a hidden zone called myzone to the index named index:
<f:searchers> element is used to configure index searcher rows. This configuration applies to all searcher rows and the writer row (if
search="true" and a dedicated searcher is not available or configured).
|0||Specify the number of search rows to make available at start up. The rows setting can be dynamically updated on clustered systems using the addrow and removerow commands of the AIE-CLI.|
|remote||Specify the default mode for resolving join queries in a multi-partition configuration. If local is specified, parent documents will only join to child documents that are on the same node as the parent.|
Index Searcher Properties
Advanced settings on the Index Searcher can be configured using the
<f:property> element on the
<f:searchers> element. See Index Searcher Properties for a complete list of available index searcher properties.
Example: Allow queries to request offsets up to 20000:
Join execution performance can be explicitly configured to execute locally per-partition, or to execute across all partitions. The default is to execute join queries across all partitions for correctness. This default can be changed, or overridden on a per join key basis. The default join configuration can be changed by setting the
defaultJoinMode attribute of the
<f:searchers> element. Per-field join execution can be configured using the
<f:join> element of
Join is executed across all partitions in the index. This means that child documents will be joined to parent documents from any index partition.
Join is executed locally on each partition. This means that child documents from a given index partition will only be joined to parent documents from the same partition.
When using local joins, it is recommended that you also configure routing of documents to ensure that all documents that should join together will be routed to the same partition.
Specifies the primary key for a join clause to configure
Optional: Specifies the foreign key (if any) for the join clause to configure
Specifies the mode for executing joins for this key.
Example: Set default join configuration to execute join queries locally per partition:
Example: Override join execution on a per-key basis:
Most Attivio users index very large collections of data. Trying to hold all that data in a single, monolithic index is not practical. Some indexes are just too large for any one server to handle. Index building, maintenance, and querying all impose loads on the system that are proportional to the size of the index. For all of these reasons, it is advantageous to break up the monolithic index into partitions, and treat each of the partitions as if it were a small independent index. Attivio automatically distributes incoming documents across partitions, and dispatches each query to a set of relevant partitions before combining their partial results into a single response document.
During content ingestion, each document is sent to a specific partition based on the value of one or more fields. Each partition has a dedicated indexer node that ingests the document and adds it to the local index. Therefore, each partition contains only a fraction of the full index, and is only a fraction of the size of the full index.
Partition before you begin ingesting
Index partitioning is the easiest and most efficient mechanism for increasing system capacity, with one caveat.To redistribute existing index records into a larger number of partitions requires that you discard the existing index, define a new index with more partitions, and then re-ingest all of the content from the beginning. For this reason it is a best practice to create a generous number of partitions at the outset, even though the partitions will be very small at first and might originally all reside on one node. When more hardware capacity is eventually required, the partitions can be migrated to new hardware without requiring re-indexing of content.
Partitioning is configured by adding the
<f:partitionSet> element to the Index Feature definition. This element has a
size attribute that indicates the number of partitions for the index. Just tell Attivio how many partitions you want and Attivio will set them up.
|1||The number of partitions to generate.|
The maximum number of index partitions to host in a single process. Valid load factors must be in the range 0 <
NOTE: This setting only applies to clustered systems. The loadFactor can be changed dynamically via the
This is a simple two-partition index:
How many partitions should you create?
This is a complex question depending on the size and dynamism of your data and your anticipated query load; there is not a "one size fits all" answer. The best course is to engage with Attivio's Professional Services team for project sizing guidelines. Please contact email@example.com to discuss a Professional Services engagement.
Attivio indexes can route incoming documents to specific partitions based on document properties. Conditional routing is accomplished by adding multiple
<f:route> elements to the index declaration. Routing elements act as a series of filters: the first one to apply to a document is used. The list of route elements should end with an empty
<f:route/> element to catch documents which do not match any preceding routing element.
<f:route> element lets us set up a series of conditional routing tests to apply to each incoming document. The first successful match determines the partition where the document will be stored. It also optionally lets us send a document to all partitions.
If specified, this field will be used for hash based partitioning of documents. If not specified, documents will be hash partitioned according to the document id.
If true, this router broadcasts matching documents to all partitions, otherwise the document will be partitioned as normal.
This feature is for systems that use field-based routing. It has no impact on Attivio's default ID-hash partition routing.
The feature addresses the case where a document has been ingested more than once using a field-based routing scheme, and the value of the routing field changed in the meantime. If this happens, the document may be duplicated on multiple partitions. When the document is routed to a partition, this feature attempts to delete all previous copies that might exist on other partitions.
If true, delete messages for the document will be sent to all partitions except the one where the document has just been added. If false, no delete messages will be sent. This property should only be set to false if the partition field value for a document will never change, or if documents will never be updated.
With Attivio's default ID-hash partition routing, delete messages will not be sent (effectively consistent=false).
If true, all values in the partitioning
<f:filter> element allows applying filters to an
<f:route> element .
|null||Specifies the document field to use for filtering. Required.|
|null||Specify the value that the specified field must have in order to match this filter.|
If not specified, this filter will require that the specified field exists (with any value).
|true||If false, this filter is inverted.|
The partition element allows specifying which partitions to route across. If no partition elements are configured for a route, routing will occur across all partitions.
|null||Specify the index number for a partition. Required.|
Example: Route documents using hashfield as the partition field:
Example: Route documents with table schema field values of "acl" to all index partitions:
Example: Route documents with table schema field values of "user" with different properties than those routed by the default route:
Example: Field value-based partition routing:
Content and Query Dispatchers
Sometimes it is useful to specify which workflow the content dispatchers (the entry point for documents to the index) or query dispatchers (the entry point for queries to the index) will be placed in. In the example below, the content dispatcher is placed in a workflow named customIndexer and the query dispatcher is placed in a workflow named customSearcher. Multiple content and query dispatcher specifications may be provided as needed.
Content and Query Dispatcher Elements
<f:contentDispatcher> element and the
<f:queryDispatcher> element share the following elements:
|indexer (Content Dispatcher),|
searcher (Query Dispatcher)
The workflow into which the dispatcher component should be inserted.
Since the default values match the default Attivio workflows, this attribute is not commonly used. However, if multiple index features are in use, this attribute becomes necessary in order to place the dispatchers for each unique index in different workflows so that content and queries are properly directed.
|first||Where in the workflow to place the dispatcher component. May be first, last, before, or after. If set to before or after, must also set |
|false||If true, skip insertion of the dispatcher component if it already exists at the specified place in the destination workflow. This setting prevents double-insertion of components that are independently created by multiple modules.|
Query Dispatcher Properties
Query dispatcher properties can be configured on the <f:queryDispatcher> element.
Connection timeout (in milliseconds) for establishing connection to engine
Maximum engine connection timeout (in milliseconds) for establishing connection to an engine. This connection timeout will be used if there is only one available engine for a partition in order to increase query stability when no failover engines are available.
For best system stability, this should be set to be longer than the longest running Full GC seen by engines.
Example: Configure a 5-second connection timeout for the Query Dispatcher:
Content Dispatcher Properties
Content dispatcher properties can be configured on the
Connection timeout (in milliseconds) for establishing connection to engines.
Example: Configure a 2-minute connection timeout for the Content Dispatcher: