Page tree
Skip to end of metadata
Go to start of metadata

Overview

AIE's service oriented architecture provides granular performance tuning capabilities that let developers take full advantage of system resources for maximum performance. This flexibility does allow developers to push the system beyond the capabilities of the underlying hardware infrastructure if the system configuration is not properly tuned. This section describes how to tune an AIE configuration to provide optimal performance while maintaining operational stability for a given environment. 

Required Modules

These features require the inclusion of the memory module when you run createproject to create the project directories. The memory module is part of the default module group, and is therefore usually present in every project.

View incoming links.

Introduction

The default AIE configuration enables many of AIE's full range of capabilities. It is important to note that these features can also have an impact on memory footprint of AIE. By default, AIE's configuration is generally tuned to provide good ingestion rates for moderately-sized documents or records in 4GB of JVM heap space.

You can tune AIE's configuration based on the following factors:

  • Average document size
  • Maximum document size
  • Available memory
  • Required ingestion rate
  • Desired ingestion features (e.g., linguistics processing)

Tuning these parameters can improve ingestion rates and throughput, but can also impact memory usage.

Factors Affecting Memory Usage

The configuration settings that affect memory usage are:

  • Document Batch Size - The number of documents held in memory by connectors and the default client before they are sent as a batch to the server.
  • Component Queue Size - Indicates the size of the in-memory queue for each component. The queue size controls the number of messages that are held in memory as input to each stage of a workflow.
  • Maximum File Size - Some connectors such as the FileConnector are configured to reject documents above a certain size.
  • Maximum Tokens - Limits the number of tokens that are indexed for a document. Documents with an unbounded number of tokens can dramatically increase memory usage.
  • Number of Component Instances - Each component in an AIE workflow can have one or more instances.

Other factors that affect memory usage are:

  • Component operations - Certain stages add metadata to a document. For example, tokenization creates a token list containing all tokens from a given document and appends that to the message, increasing the memory footprint of the document significantly.

Default Configuration Settings

Modifying the default setting is NOT recommended. Modifications should only be made to accommodate very high-speed, large volume bulk loading of content.

The following table contains AIE default configuration settings. Adjusting these factors can enhance performance.

Architecture

JVM Heap Size

Component Instances

Component Queue size

Batch Size

Message Ordering

Max File Size

Maximum Tokens

64-bit

4GB or 50% of available memory if available memory is less than 5.33GB

Number of Processing Cores

3

5,000

ordered commits

5MB

10,000

Memory Usage Configuration

The maximum memory available to the JVM is controlled by the maxmem attribute for the node. See  Shared Configuration for details.

Configuring Search Caches

You can configure search caches to reduce/control memory use for executing queries. See Configuring Query Caches for more information.

Configuring Component Queue Size

AIE's architecture is based on the Staged Event-Driven Architecture (SEDA) pattern. Each component has a work queue which receives messages intended for the component. Component instances pull work from their input queue and forward system messages to the queue of the next component in the workflow. This architecture results in a well-conditioned service environment that ideally keeps the system running at ~80% CPU utilization, which is viewed as optimal for performance in many applications.

The more messages that are stored in a component's input queue, the more memory is consumed during processing. If a component in a workflow is much slower than other components, its associated message queue fills up. This blocks the preceding component, which in turn fills up its input queues. This intended consequence naturally throttles the system by eventually blocking the originator of system load (the client). It is a normal condition for all queues to fill at various times (for instance, during index commits), so the maximum queue size acts as a throttle on system memory consumption.

You can set the number of messages held in a single queue with the message.queueSize property. By default the <install-dir>\conf\core-app\attivio.core-app.properties file sets this value to 3. The core-app/attivio-base.xml file references this property as follows:

<message-queue size="${message.queueSize}" />

The queue size must be greater than zero. Larger values consume more memory but provide insulation against skew when disparate processing times are encountered on documents.

You can adjust queue sizes for individual components by appending the queue name to the message.queueSize property. The following example sets the queue size for just the indexer workflow: message.queueSize.avm://indexer=10

The configuration of components also impacts memory usage. See Processor Utilization Tuning for more information.

Using Memory-Capped Component Queues

For applications which require predictable memory utilization, you can configure AIE to use memory-capped SEDA component message queues for ingestion workflows. Memory-capped queues restrict the size of a message queue based on the amount of memory required to store the message rather than the quantity of messages. When memory-capped queues are in use, adding a message to a component input queue will block if the size of the message would cause the queue to exceed its memory capacity. As in the case of standard message queues, blocking the sender results in the desired SEDA client-load blocking behavior. Since query workflows are executed synchronously, they do not participate in memory-capped queue restrictions.

Memory-capped queue configuration is controlled via a new transport definition (see below) and a new transport scheme: mavm. This transport is defined by the memory module default xml file module.xml. The megabytesAvailableForQueues property indicates the amount of memory to be devoted to messages queues in megabytes. This amount is divided amongst the queues for all components which are receiving or holding messages, thus providing the maximum amount of memory used by the system for queuing messages. This value is shared among all active memory-capped queues.

<install-dir>\conf\memory\module.xml
  <transports>
    <transport class="com.attivio.memory.MemoryCappedTransport">
       <properties>
         <property name="megabytesAvailableForQueues" value="${memory.queues.max.megabytes}"/>
       </properties>
    </transport>
  </transports>

The memory.queues.max.megabytes property is set in <project-dir>\conf\properties\memory\memory.properties. The default value is 512 megabytes.

In addition to the transport, a new feature controls whether memory-capped queues are used by default (all in-memory queues become memory capped) or not. The following bean is commented out in <install-dir>\conf\memory\features.xml. When you uncomment it and create a new project using the memory module, you will find it in <project-dir>\conf\features\memory\/ReplaceAvmQueuesFeature.xml:

<project-dir>\conf\features\memory\/ReplaceAvmQueuesFeature.xml
<ff:features xmlns:ff="http://www.attivio.com/configuration/config"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xmlns:fbase="http://www.attivio.com/configuration/features/base"
             xmlns:f="http://www.attivio.com/configuration/features/core"
             xsi:schemaLocation="http://www.attivio.com/configuration/config http://www.attivio.com/configuration/config.xsd http://www.attivio.com/configuration/features/base http://www.attivio.com/configuration/features/baseFeatures.xsd http://www.attivio.com/configuration/features/core http://www.attivio.com/configuration/features/coreFeatures.xsd">

        <f:featurePostProcessor class="com.attivio.memory.config.feature.ReplaceAvmQueuesFeature" />
</ff:features>

Memory-capped queues always allow at least one message regardless of size. This is required to prevent deadlock in message processing. Due to this requirement, in certain cases, the memory consumed by memory-capped queues can exceed the specified maximum.

The System Monitoring diagnostics page contains detailed information about the current status of memory-based queues.

Configuring Document Batch Size

For Expert Users Only

Adjusting batch size is for expert users only. Unskillful batching can cause AIE to run out of memory and crash.

Document processing can occur in document batches meaning that groups of documents can be passed through AIE ingestion workflows as a single message.

The advantage of batching is apparent when there are many small documents flooding the ingestion workflows.  Batching these small documents reduces network and messaging overhead, and speeds up document ingestion. Batches can be made too large, however, which can drain system resources and slow down or crash the system.  Use this feature with caution.

Ingest large documents individually. The default batch size is 5,000, meaning five thousand documents per message.

Depending on how content is fed into AIE there are several different ways to configure the batch size. In the examples below, the batch size is set to 20.

Via Configured Connector

If a configured connector is being used within an AIE instance to feed content, you can configure the batch size using the documentBatchSize property in the connector configuration.

<connector name="myFileConnector" >
    <scanner class="com.attivio.connector.FileScanner">
      <property name="startDirectory" value="/opt/test/data" />
    </scanner>
    <feeder class="com.attivio.connector.DirectMessagePublisher">
      <properties>
        <property name="ingestWorkflowName" value="ingest" />
        <property name="documentBatchSize" value="20" />
      </properties>
    </feeder>
</connector>

Via Command-line Connector

When feeding documents via the command line, set the batch size using the --batch-size option.

aie-connect.exe file -A http://localhost:17001 -W ingest --batch-size 20 -d /opt/mydata

Via Content API

If content is being fed via the API using the

com.attivio.client.ContentFeeder

, the batch size can be set by the setDocumentBatchSize() setter.
Code Sample

ContentFeeder feeder = new DefaultAieClientFactory().createLocalContentFeeder();
feeder.setIngestWorkflow(workflows);
feeder.setDocumentBatchSize(20);

Configuring Message Ordering

Please refer to the main article on Message Ordering for memory utilization issues associated with this feature.

Configuring Max File Size

The AIE

com.attivio.connector.FileScanner

can specify the maximum size file that is sent into the engine for processing. By default, the maxFileSize parameter is 5MB. You can set the FileScanner maxFileSize parameter in configuration or specified when invoking the FileScanner from the command line.

Via Configured Connector

If a configured connector is being used within an AIE instance to feed content, the maxFileSize can be configured using the maxFileSize property in the connector configuration.

<connector name="myFileConnector">
    <scanner class="com.attivio.connector.FileScanner">
      <property name="startDirectory" value="/opt/test/data" />
      <property name="maxFileSize" value="10"/> <!-- Sets Maximum File Size to 10MB -->
    </scanner>
    <feeder class="com.attivio.connector.DirectMessagePublisher">
      <properties>
        <property name="ingestWorkflowName" value="ingest" />
      </properties>
    </feeder>
</connector>

Via Command-line Connector

To set the maxFileSize via a connector, use the --max-file-size command line option to the connect command line.

aie-connect.exe file -A http://localhost:17001 -W ingest --max-file-size 20 -d /opt/mydata

Configuring Maximum Tokens

By default AIE will tokenize at most, the first 10,000 words (token) in a document. This setting prevents large documents from consuming all memory in the system during tokenization. It is possible to increase maximum number of tokens that will be processed by setting the following schema field property:

<field name="text" type="string" tokenize="true" indexed="false" stored="true" >
  <properties>
    <property name="index.maxTokens" value="100000" />
    <!-- ...  other field properties -->
  </properties>
</field>

HTTP input configuration

Main Article

Diagnosing Memory Issues

Memory Utilization Statistics

The AIE Administrator contains a link to a Memory Statistics page that shows information such as the free memory, number of full garbage collections, tenured heap size, etc. The PeriodicMemoryLoggingService also logs some of this information when it logs memory information every five minutes by default. You can modify the frequency (in seconds) by setting the memory.logging.frequency property.

Garbage Collection Statistics

By default, AIE logs Java JVM garbage collection information. Each time AIE, starts new log file is created in: <install_dir>/logs/gc-YY-MM-DD-HHMMSS.log. You can disable this logging by adding the argument -XX:-PrintGCDetails to the command line when starting AIE.

Disabling garbage collection logging is NOT recommended for new production systems, installations, or configurations, as they may present new system loads where garbage collection logs may contain critical information for debugging performance and stability issues in the running systems.