A typical AIE project uses a single index. Though you can partition and/or duplicate that index across many AIE instances, it is still one index – updated, committed, and optimized as a single entity.
It is sometimes more efficient to subdivide the index into zones. Each document is placed in one zone, which is specified in the document. By default, all queries are executed against all zones in parallel, but they can be limited to one or more zones. This creates opportunities for more-efficient queries and more efficient indexing.
This topic discusses the benefits and drawbacks of zoned indexing, the difference between zones and partitions, configuring zoned indexes, and provides a Tutorial Example.
This is an expert-level feature.
In some circumstances, using zones increases query performance. In other circumstances, zones can slow down queries. Discussing your index design with Attivio's Support Department before creating projects with zoned indexes is recommended.
View incoming links.
What is a Zoned Index?
A zoned index refers to a single index, divided into zones so that each zone contains a subset of the indexed documents. Logically there is still only one index, but each zone within it uses a separate set of index files, allowing users to update each zone separately but query all of them together as one index or one or more of them separately. Additionally:
- Non-zoned indexes can be thought of as a zoned index with a single zone, called the default zone.
- Zoned indexes also have a default zone, to contain the records not explicitly directed to one of the other zones.
- A single index can include multiple zones.
Benefits of Zoned Indexing
In certain cases, dividing and index into zones can:
- Reduce cache sizes
- Reduce indexing time and index size
- Accelerate query times for some join queries
Each of these situations is discussed below.
Reducing Cache Size
If an index contains fields used for sorting or joining, and these fields appear only in a well-defined subset of records (such as specific tables or record types), putting such records in their own zones reduces the size of sort and join caches. This is because:
- The size of join and sort caches is proportional to the number of records.
- Caches are created per-segment. If no record in a segment contains a given field, that segment will not have a cache for that field.
In a common document-level security model, there is one security record containing an access control list (ACL) per data record. This model doubles the number records in the index, and as a result, all sort and join caches double in size. Putting the security records in their own zone alleviates this issue.
Metadata vs Data Records
A common use case indexes two records for every document: a metadata record containing a number of small fields (that is: name, date, address) and a data record containing the text extracted from the source document. As with the security scenario, this can double memory usage for caching, and placing the metadata records in their own zone alleviates the issue.
Reducing Indexing Time/Size
The three cases where zone indexing can reduce indexing time, as well as the index size are described below.
Real-Time Fields and Partial Updating
Both real-time fields and partial updating of records are zone attributes that slow indexing time and require more disk space. Typically, some of the records in the index require one of these features while others don't. For example, metadata, where documents' metadata (requiring partial updating) is split off from the main record into a real-time fields or a separate record that is partially updateable. If such records share a zone with the data records not requiring these features, the time penalty still occurs for all records. Placing the metadata in their own zone alleviates the issue.
Sometimes large amounts of data require indexing over time. This typically causes the index grow very large, and require increasing indexing resources over time.
One way to reduce the required indexing resources is to divide the index into silos (for example, one per month), so that all current indexing activity occurs within one silo at a time. You can place each silo into a separate partition, so only one partition ingests documents at any one time. But querying across multiple partitions is significantly slower, especially for non-collocated joins.
Zoned indexing – using one zone per silo – achieves the same reduction in the resources needed for indexing without hindering query performance.
Accelerating Query Times
Join queries may execute faster if one or more of the index zones do not contain any of the join keys used. This is not a common case, but sometimes it can have a significant impact on performance.
Drawbacks of Zoned Indexing
Zoned indexing is efficient when used skillfully, but there is also a downside. Zoned indexing can slow down query execution due to the overhead of dispatching the query to multiple zones and then assembling the matches into a single response document. The more zones there are, the slower the queries become. The impact of a handful of zones is negligible, but dividing an index into more than 20 or so zones is not recommended.
Why Split Documents into Separate Zones?
Look for these opportunities to apply zoned indexing:
- The system contains distinct classes of documents (countries, medals, news stories, cities):
- Classes that sort, facet, or join on different fields: Placing each class in its own zone reduces memory usage and accelerates commits.
- Classes that are frequently searched together using the same fields: Placing these classes in the same zone accelerates searches and reduces the index size.
- Documents that require partial updates vs. documents that are replaced in their entirety: Placing the former in separate zones increases the indexing speed and reduces the resources used for indexing the latter.
Zones vs. Partitions
- Zones separate different types of content to minimize cache reloading during ingestion, reduce the memory footprint of searches that only interact with one zone, and improve performance.
- Index partitions assist in scaling over large number of documents, and also aid in parallelization of query execution.
- In a partitioned index, all partitions contain the same zones. As a result, zones can be viewed logically as a top-level construct, affecting all partitions and all replicas of the partitions.
If the project uses both partitions and zones, each partition contains all of the zones.
Configuring a Zoned Index
Zones are configured via the Index Feature, in <project-dir>\conf\features\core\Index.index.xml.
See Index Writer Configuration for syntax for configuring zones.
The following example shows the XML syntax for configuring zoned-index features. Each feature is described in the paragraphs that follow.
Default Implied Zone Configuration
If zones are not configured at all, the default implied zone configuration is in effect. The implied default zone behaves as if it had the following configuration.
Therefore, if you don't configure any zones, there is always one zone named default. However, if you configure zones, you must explicitly configure the default zone in addition to the other zone.
Setting the .zone Field
In a zoned project, incoming documents have an associated virtual field called .zone.
It is possible to set the value of .zone programmatically using the IngestDocument.setZone() method. This approach is not recommended because the indexer will set the .zone value anyway, and if the indexer's value does not match the one you set programmatically, AIE will issue an error and drop the document.
Note that .zone can be useful when querying. See the discussion later on this page.
- Removing zones from the configuration does not automatically delete index files on disk.
- Renaming zones is not permitted.
- Zones may be added to the system over time.
- You must restart AIE for changes to take affect.
Committing a Zoned Index
Updating Documents and Fields
By using zoned indexing, you can declare all fields in a zone's index as individually updatable, as if each one of them participated in the Real-Time Updates feature. The new mechanism, however, is quite different and avoids some of the performance penalties associated with Real-Time Updates.
See Index Writer Properties for more information.
How to Update Individual Fields
Documents in a zone with field granularity can be updated using PARTIAL-mode documents as described on the Real-Time Updates page. Unlike real-time updates, however, these PARTIAL-mode updates will not take effect on an index refresh; you must commit the index (or at least the zone in question) before your updates will take effect.
The factbook project in the Quick Start Tutorial is a good example for demonstrating zoned indexing. The factbook example has four feeds. Three of them (country, medals, cities) load static records that change or update very infrequently. The list of Olympic medals needs updating every four years. The list of cities needs updating with census data about every ten years. New countries sometimes appear, invalidating countless maps and globes, but it doesn't happen often. The news feed, however, likely updates at least once a day, and possibly more often. Each time this feed reloads, there are new RSS articles to ingest, and each round of ingestion ends with committing the index. This example separates the static records from the dynamic ones by using two zones.
In the tutorial below, you set up these two zones, load data into them, and learn to write queries that are zone-aware.
This tutorial presumes that you have completed the Quick Start exercise and understand its data, connectors, and query characteristics.
To work the zone example described here, run createproject to create a new factbook project, but you must first configure the zones by editing <project-dir>\conf\features\core\Index.index.xml before running it for the first time.
Configure the Zones
Edit <project-dir>\conf\features\core\Index.index.xml. Look for the index feature element of the configuration (<f:index>... </f:index>). Delete that element and substitute this one:
The writer element configures the index writer. The defaultZoneName attribute must be the name of a defined zone, which will be the zone where unzoned records are sent. You can direct unzoned records to any one of your zones. You could set defaultZoneName="static", for instance. In this example you create an explicit default zone to receive any unzoned records that may appear.
- The default zone is the explicit zone for unzoned records.
- The static zone is uses the default updateGranularity of document. This means that any updates to this zone occur as entire documents. This is the normal, default behavior for AIE indexes.
- The dynamic zone is defined with updateGranularity set to field. This means you expect to update individual fields rather than whole documents. This isn't strictly necessary, but it lets you do dynamic updates of individual fields in the document record, as if correcting an error in a news article. (Field zones have some special query characteristics that this example demonstrates.)
The route elements read the value of a document's table field and routes the document in the appropriate zone. In this case, when the table field contains "country", "city", or "medal", the document goes to the static zone. If the table field says "news", the document goes to the dynamic zone. (The table field is explicitly set by the connectors that load the four types of data in the factbook example.)
That is all it takes to set up zones in the factbook demo. Save the file.
Create the Project
Run createproject to create a new factbook project, as in this Windows command:
Start AIE using the AIE Agent and its Command-Line Interface (CLI):
Run the Agent in a Command Window:
Run the Command-Line Interface in a second Command Window. Note that the CLI is invoked for a specific project.
- To run the project use the start all command in the Command-Line Interface:
Examine the Index Files
After starting AIE, go to the <data-agent>\projects\<project-name>\default\data\data-local\index\index subdirectory and view the directory structure below that point. You'll see separate index files for the three zones (default, static and dynamic).
The zones are all one index, and you can query them all at the same time. However, since the zones are implemented as separate sets of index files, you can also query them separately. Doing so can reduce memory requirements for some queries.
Loading the Feeds
Load the feeds in the usual way. AIE dispatches documents to different zones, based (in this example) on the value of the document's table field. There is nothing you need to do except run the connectors.
Zoned indexes do offer some interesting query extensions, and one or two limitations. These are shown in the following example queries.
Using the .zone Field in Queries
You can use the .zone field in queries to restrict a query to a specific zone.
If you mention a zone in the query, AIE attempts to dispatch the query in the smartest possible way. This might mean sending the query to that one zone only. This can potentiallly improve performance. The Simple Query Language and Advanced Query Language recognize the .zone field, so you can restrict their queries to records in a specific zone.
For example, this query returns all records from the static zone:
This Simple Query Language query returns all records from the static zone that contain the word "diamond":
In the Advanced Query Language that is:
Both of these queries are dispatched to the static zone only.
Zoned indexing requires some adjustments in how AIE handles document ID fields in IngestDocuments. In unzoned documents, the document ID is the value in the docid field. With zoned documents, however, the primary key of a document is a (zonename, docid) tuple. As such, it is technically possible for more than one document to have the same value in the docid field. However the (zonename, docid) pair are guaranteed to be unique.
IngestDocument Methods for Zoned Indexing
The IngestDocument has two zone methods:
- setZone(String) - set the name of the zone the document will be placed in.
- getZone() - get the name of the zone the document is in.
Furthermore, the get/setField() methods can handle the .zone field. For instance, if setField() is called with the field being ".zone", then the internal zone name attribute would be populated.
This allows both the zone name to be manipulated by all standard ingest transformers.
Virtual Field .zone
The virtual field ".zone" is:
- returnable (stored)
This field is protected from having multiple values set. Trying to add multiple values to this fields will result in a runtime exception.
Deleting from a Zoned Index
If you are using Zoned Indexing, note that the Delete Document Connector (delete by document ID number) operates on the default zone only. It cannot reach documents in non-default zones.
The Delete By Query Connector, however, uses a query to located the target documents. The query searches all zones by default, and can be adjusted to operate on specific zones if needed.