Overview
AIE's service oriented architecture provides granular performance tuning capabilities that let developers take full advantage of system resources for maximum performance. This flexibility does allow developers to push the system beyond the capabilities of the underlying hardware infrastructure if the system configuration is not properly tuned. This section describes how to tune an AIE configuration to provide optimal performance while maintaining operational stability for a given environment.
Required Modules
These features require the inclusion of the memory module when you run createproject to create the project directories. The memory module is part of the default module group, and is therefore usually present in every project.
View incoming links.
Introduction
The default AIE configuration enables many of AIE's full range of capabilities. It is important to note that these features can also have an impact on memory footprint of AIE. By default, AIE's configuration is generally tuned to provide good ingestion rates for moderately-sized documents or records in 4GB of JVM heap space.
You can tune AIE's configuration based on the following factors:
- Average document size
- Maximum document size
- Available memory
- Required ingestion rate
- Desired ingestion features (e.g., linguistics processing)
Tuning these parameters can improve ingestion rates and throughput, but can also impact memory usage.
Factors Affecting Memory Usage
The configuration settings that affect memory usage are:
- Document Batch Size - The number of documents held in memory by connectors and the default client before they are sent as a batch to the server.
- Component Queue Size - Indicates the size of the in-memory queue for each component. The queue size controls the number of messages that are held in memory as input to each stage of a workflow.
- Maximum File Size - Some connectors such as the
FileConnector
are configured to reject documents above a certain size. - Maximum Tokens - Limits the number of tokens that are indexed for a document. Documents with an unbounded number of tokens can dramatically increase memory usage.
- Number of Component Instances - Each component in an AIE workflow can have one or more instances.
Other factors that affect memory usage are:
- Component operations - Certain stages add metadata to a document. For example, tokenization creates a token list containing all tokens from a given document and appends that to the message, increasing the memory footprint of the document significantly.
Default Configuration Settings
Modifying the default setting is NOT recommended. Modifications should only be made to accommodate very high-speed, large volume bulk loading of content.
The following table contains AIE default configuration settings. Adjusting these factors can enhance performance.
Architecture | JVM Heap Size | Component Instances | Component Queue size | Batch Size | Message Ordering | Max File Size | Maximum Tokens |
---|---|---|---|---|---|---|---|
64-bit | 4GB or 50% of available memory if available memory is less than 5.33GB | Number of Processing Cores | 3 | 5,000 | ordered commits | 5MB | 10,000 |
Memory Usage Configuration
The maximum memory available to the JVM is controlled by the maxmem attribute for the node. See Shared Configuration for details.
Configuring Search Caches
You can configure search caches to reduce/control memory use for executing queries. See Configuring Query Caches for more information.
Configuring Component Queue Size
AIE's architecture is based on the Staged Event-Driven Architecture (SEDA) pattern. Each component has a work queue which receives messages intended for the component. Component instances pull work from their input queue and forward system messages to the queue of the next component in the workflow. This architecture results in a well-conditioned service environment that ideally keeps the system running at ~80% CPU utilization, which is viewed as optimal for performance in many applications.
The more messages that are stored in a component's input queue, the more memory is consumed during processing. If a component in a workflow is much slower than other components, its associated message queue fills up. This blocks the preceding component, which in turn fills up its input queues. This intended consequence naturally throttles the system by eventually blocking the originator of system load (the client). It is a normal condition for all queues to fill at various times (for instance, during index commits), so the maximum queue size acts as a throttle on system memory consumption.
You can set the number of messages held in a single queue with the message.queueSize
property. By default the <install-dir>\conf\core-app\attivio.core-app.properties file sets this value to 3. The core-app/attivio-base.xml file references this property as follows:
<message-queue size="${message.queueSize}" />
The queue size must be greater than zero. Larger values consume more memory but provide insulation against skew when disparate processing times are encountered on documents.
You can adjust queue sizes for individual components by appending the queue name to the message.queueSize
property. The following example sets the queue size for just the indexer workflow: message.queueSize.avm://indexer=10
.
The configuration of components also impacts memory usage. See Processor Utilization Tuning for more information.
Using Memory-Capped Component Queues
For applications which require predictable memory utilization, you can configure AIE to use memory-capped SEDA component message queues for ingestion workflows. Memory-capped queues restrict the size of a message queue based on the amount of memory required to store the message rather than the quantity of messages. When memory-capped queues are in use, adding a message to a component input queue will block if the size of the message would cause the queue to exceed its memory capacity. As in the case of standard message queues, blocking the sender results in the desired SEDA client-load blocking behavior. Since query workflows are executed synchronously, they do not participate in memory-capped queue restrictions.
Memory-capped queue configuration is controlled via a new transport definition (see below) and a new transport scheme: mavm
. This transport is defined by the memory module default xml file module.xml
. The megabytesAvailableForQueues
property indicates the amount of memory to be devoted to messages queues in megabytes. This amount is divided amongst the queues for all components which are receiving or holding messages, thus providing the maximum amount of memory used by the system for queuing messages. This value is shared among all active memory-capped queues.
<transports> <transport class="com.attivio.memory.MemoryCappedTransport"> <properties> <property name="megabytesAvailableForQueues" value="${memory.queues.max.megabytes}"/> </properties> </transport> </transports>
The memory.queues.max.megabytes property is set in <project-dir>\conf\properties\memory\memory.properties. The default value is 512 megabytes.
In addition to the transport, a new feature controls whether memory-capped queues are used by default (all in-memory queues become memory capped) or not. The following bean is commented out in <install-dir>\conf\memory\features.xml. When you uncomment it and create a new project using the memory module, you will find it in <project-dir>\conf\features\memory\/ReplaceAvmQueuesFeature.xml:
<ff:features xmlns:ff="http://www.attivio.com/configuration/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fbase="http://www.attivio.com/configuration/features/base" xmlns:f="http://www.attivio.com/configuration/features/core" xsi:schemaLocation="http://www.attivio.com/configuration/config http://www.attivio.com/configuration/config.xsd http://www.attivio.com/configuration/features/base http://www.attivio.com/configuration/features/baseFeatures.xsd http://www.attivio.com/configuration/features/core http://www.attivio.com/configuration/features/coreFeatures.xsd"> <f:featurePostProcessor class="com.attivio.memory.config.feature.ReplaceAvmQueuesFeature" /> </ff:features>
Memory-capped queues always allow at least one message regardless of size. This is required to prevent deadlock in message processing. Due to this requirement, in certain cases, the memory consumed by memory-capped queues can exceed the specified maximum.
The System Monitoring diagnostics page contains detailed information about the current status of memory-based queues.
Configuring Document Batch Size
For Expert Users Only
Adjusting batch size is for expert users only. Unskillful batching can cause AIE to run out of memory and crash.
Document processing can occur in document batches meaning that groups of documents can be passed through AIE ingestion workflows as a single message.
The advantage of batching is apparent when there are many small documents flooding the ingestion workflows. Batching these small documents reduces network and messaging overhead, and speeds up document ingestion. Batches can be made too large, however, which can drain system resources and slow down or crash the system. Use this feature with caution.
Ingest large documents individually. The default batch size is 5,000, meaning five thousand documents per message.
Depending on how content is fed into AIE there are several different ways to configure the batch size. In the examples below, the batch size is set to 20.
Via Configured Connector
If a configured connector is being used within an AIE instance to feed content, you can configure the batch size using the documentBatchSize
property in the connector configuration.
<connector name="myFileConnector" > <scanner class="com.attivio.connector.FileScanner"> <property name="startDirectory" value="/opt/test/data" /> </scanner> <feeder class="com.attivio.connector.DirectMessagePublisher"> <properties> <property name="ingestWorkflowName" value="ingest" /> <property name="documentBatchSize" value="20" /> </properties> </feeder> </connector>
Via Command-line Connector
When feeding documents via the command line, set the batch size using the --batch-size
option.
aie-connect.exe file -A http://localhost:17001 -W ingest --batch-size 20 -d /opt/mydata
Via Content API
If content is being fed via the API using the
com.attivio.client.ContentFeeder, the batch size can be set by the setDocumentBatchSize()
setter.
Code Sample
ContentFeeder feeder = new DefaultAieClientFactory().createLocalContentFeeder(); feeder.setIngestWorkflow(workflows); feeder.setDocumentBatchSize(20);
Configuring Message Ordering
Please refer to the main article on Message Ordering for memory utilization issues associated with this feature.
Configuring Max File Size
The AIE
com.attivio.connector.FileScannercan specify the maximum size file that is sent into the engine for processing. By default, the maxFileSize parameter is 5MB. You can set the FileScanner maxFileSize parameter in configuration or specified when invoking the FileScanner from the command line.
Via Configured Connector
If a configured connector is being used within an AIE instance to feed content, the maxFileSize can be configured using the maxFileSize
property in the connector configuration.
<connector name="myFileConnector"> <scanner class="com.attivio.connector.FileScanner"> <property name="startDirectory" value="/opt/test/data" /> <property name="maxFileSize" value="10"/> <!-- Sets Maximum File Size to 10MB --> </scanner> <feeder class="com.attivio.connector.DirectMessagePublisher"> <properties> <property name="ingestWorkflowName" value="ingest" /> </properties> </feeder> </connector>
Via Command-line Connector
To set the maxFileSize via a connector, use the --max-file-size
command line option to the connect
command line.
aie-connect.exe file -A http://localhost:17001 -W ingest --max-file-size 20 -d /opt/mydata
Configuring Maximum Tokens
By default AIE will tokenize at most, the first 10,000 words (token) in a document. This setting prevents large documents from consuming all memory in the system during tokenization. It is possible to increase maximum number of tokens that will be processed by setting the following schema field property:
<field name="text" type="string" tokenize="true" indexed="false" stored="true" > <properties> <property name="index.maxTokens" value="100000" /> <!-- ... other field properties --> </properties> </field>
HTTP input configuration
Diagnosing Memory Issues
Memory Utilization Statistics
The AIE Administrator contains a link to a Memory Statistics page that shows information such as the free memory, number of full garbage collections, tenured heap size, etc. The PeriodicMemoryLoggingService
also logs some of this information when it logs memory information every five minutes by default. You can modify the frequency (in seconds) by setting the memory.logging.frequency
property.
Garbage Collection Statistics
By default, AIE logs Java JVM garbage collection information. Each time AIE, starts new log file is created in: <install_dir>/logs/gc-YY-MM-DD-HHMMSS.log
. You can disable this logging by adding the argument -XX:-PrintGCDetails
to the command line when starting AIE.
Disabling garbage collection logging is NOT recommended for new production systems, installations, or configurations, as they may present new system loads where garbage collection logs may contain critical information for debugging performance and stability issues in the running systems.