Page tree
Skip to end of metadata
Go to start of metadata

Overview

Clustered (multi-node) Attivio projects use Apache™ Hadoop®, along with YARN, the Hadoop® File System (HDFS)Apache ZooKeeper™, and Apache HBase™ to create a scalable, highly available Attivio platform configuration.

This page demonstrates how to setup your Attivio project to run on an external Hadoop cluster.

Before continuing, verify the steps on Configure the Hadoop Cluster have been followed to prepare the Hadoop cluster for Attivio project creation and deployment.

 

View incoming links.

Architecture

When running Attivio with an external Hadoop instance, there are usually two distinct types of nodes; Hadoop cluster nodes and Attivio cluster (aka Gateway or Edge) nodes. The Hadoop cluster nodes run YARN, HDFS, ZooKeeper and HBase. The Attivio Index runs on a subset of these nodes, managed by YARN. The Attivio cluster nodes are non-Hadoop nodes that run all other Attivio processes and which communicate with the Hadoop nodes.

Although Attivio nodes can run on the Hadoop nodes, it is usually not recommended as they would compete for memory, disk, and CPU resources. 

What follows is a simplified view of the Attivio and Hadoop Cluster to help visualize the different running processes. The placement and number of nodes will vary based on project requirements, as will the number of nodes running YARN, ZooKeeper, HBase and the Attivio Index.

 

AIE Clustered Overview

  

Setting Up The Attivio Cluster - The system where Attivio is installed and runs

 

Install Attivio Software

The first step is to install the Attivio software on all Attivio cluster nodes as described on the Install Attivio page.

If your Hadoop cluster is configured with Kerberos security, you must install Java Cryptology Extension (JCE) Unlimited Strength to the Attivio installation. To do this, follow the instructions at Java Cryptography Extension Installation for Hadoop Clusters with Kerberos.

Verify the Linux Configuration

If you have not already done so on this machine, verify that the Linux machine meets Attivio requirements and passes the Linux Checker script as mentioned in System Requirements - Linux Settings section.

Verify the Hadoop Configuration

If you have not already done so on the Hadoop cluster, verify your Hadoop configuration by running the checkHadoop program. Note that this program is not able to check all Attivio requirements for Hadoop, so it is important to first verify your settings are correct based on the Hadoop Requirements section before continuing here.

Reminder: The steps at Configure the Hadoop Cluster must be followed before continuing, including uploading cluster configuration via the aieuploadclusterinfo program.

 

To run the checkHadoop program, follow these steps:

  1. Determine your ZooKeeper hostnames and ports. For this example, we will assume ZooKeeper is running at example.com on port 2181.
  2. Create a properties file locally, named something like checkHadoop.properties. This properties file can be empty for some configurations. However if any of the following situations exist, then you should provide a properties file with appropriate values as specified in the Optional Properties section:

    • your Hadoop Cluster has Kerberos enabled

    • your user has limited namespace creation privileges and you are not specifying a project on the checkHadoop command

  3. Run the checkHadoop program:

    <attivio-installation-directory>/bin/aie-exec checkHadoop -z example.com:2181 -p checkHadoop.properties

If all goes well, the checkHadoop output will look something like:

 

Checking HDFS .............
Checking HDFS connection                [OK]
Checking HDFS directory create          [OK]
Checking HDFS file create               [OK]
Checking HDFS file append               [OK]
Checking HDFS replication               [OK]
Checking HDFS file read                 [OK]

Checking HBase .............
Checking HBase Namespace create         [OK]
Checking HBase Table create             [OK]

Checking Slider .............
Checking Slider list                    [OK]
Checking Slider validate user           [OK]
Checking Slider read and write to HDFS  [OK]

 

Create the Attivio Clustered Project

Create The Project

For complete information on creating a project, see the steps described on the Create a New Project page. A quick way to create a very simple project is to run the createproject command as follows:

Create a simple project named project1
<attivio-installation-directory>/bin/createproject -n project1

Modify the Configuration

Configure ZooKeepers

Attivio coordinates projects through one or more ZooKeeper instances (known as configuration servers in previous versions of Attivio). Configuration servers monitor the status of the Attivio project nodes and coordinate changes in the Attivio project's configuration, keeping all nodes in harmony.

In a clustered (multi-node) project, Attivio leverages the ZooKeeper instance in the Hadoop cluster and the list of configuration servers is specified using the  zookeepers  element in topology-nodes.xml .  A highly available production system requires at least three configuration servers to provide fault-tolerant recovery.  

<zookeepers>
  <zookeeper host="host1" baseport="16980"/>
  <zookeeper host="host2" baseport="16980"/>
  <zookeeper host="host3" baseport="16980"/>
</zookeepers>  

Modify configuration.xml

Edit <project-dir>/conf/configuration.xml in a text editor and set externalCluster="true":

<configuration projectName="Factbook" externalCluster="true"> 


Modify the Index Feature

No changes are required for the default Index configuration, but if a non-default Index configuration is desired, modify the <project-dir>/conf/features/core/Index.index.xml feature to specify the number of writers, searchers and partitions desired. See Configure the Index for full documentation on configuring this feature.

An example of configuring the index to use 4 GB of memory and to have 2 search rows, with a single writer (that does not also act as a searcher) is as follows:

 

Sample Index Configuration
<f:index enabled="true" name="index" profile="true">
    <f:memory size="4G"/>
    <f:writer logCommits="true" nodeset="ingestion" search="false"/>
    <f:searchers rows="2"/>
</f:index>

 

Memory Considerations

YARN will spawn an Attivio Index writer process/container for every partition configured. YARN will also spawn an Attivio Index searcher process/container for every partition and row configured. It is important that there be enough memory available on the Hadoop cluster to handle this Index configuration and that YARN be configured to allocate the containers.

The full amount of memory required by Attivio on the cluster is determined by the Hadoop components, Attivio index processes, and Attivio modules. If the cluster does not have enough memory to launch a process, it will silently wait until enough resources become available.

To calculate the amount of memory required for the Hadoop cluster, sum up the following:

  • Hadoop component usage: 
    • HBase, HDFS, YARN, and ZooKeeper.
  • Attivio Index processes:
    • Calculate the amount of memory your project will require for the Attivio index on the Hadoop cluster with the following formula:

      Memory required
       Memory Required = (index_memory_size * number_of_partitions * (number_of_searcher_rows + 1)) + app_master_size

where:

      • index_memory_size defaults to 4 GB, but can be set with the f:memory element of the index feature's configuration,
      • number_of_partitions defaults to 1, but can be set with the f:partitionSet element of the index feature's configuration,
      • number_of_searcher_rows is set in the f:searchers element of the index feature's configuration,
      • app_master_size should be assumed to be 2 GB, but is configurable via the slider.am.memory property in the <project-dir>/conf/properties/core-app/attivio.core-app.properties file.
    • For example, a configuration with 2 partitions and 2 searcher rows, using the default index memory size of 4 GB, will require 26 GB for the Attivio components running on Hadoop:

      (4 GB * 2 partitions * (2 searcher rows + 1 writer row)) + 2 GB application master = 26 GB total
    • See Configure the Index for more information on index memory size, partitions, and searchers.
  • AIE modules:
    • Business Center — if installed, this module will require an additional 4 GB of memory for a separate index named abc-index (with index_memory_size=4, number_of_partitions=1, and number_of_searcher_rows=0).

Once the total memory required is calculated:

  • Set yarn.nodemanager.resource.memory-mb (the total memory available for the index in YARN) to the calculated total value, expressed in MB.
    • Example: For the 26 GB index described above, set yarn.nodemanager.resource.memory-mb=26624 (26 GB).
  • Set yarn.scheduler.maximum-allocation-mb (the maximum memory allowed for a single index container) to the larger of index_memory_size or app_master_size plus 1–2 GB overhead, expressed in MB.
    • Example: For the index described above, set yarn.scheduler.maximum-allocation-mb=5120 (5 GB); this is the index_memory_size of 4 GB plus 1 GB overhead.

Containers

The example above will result in four containers in total, two searchers plus one writer plus one application master (the Slider "AppMaster"). If we had specified a partition set size of 2, then there would have been four searchers plus two writers plus one application master for a total of seven containers.

Set Project Properties Needed in attivio.core-app.properties

There is a property file named attivio.core-app.properties that gets created in the project under <project-dir>/conf/properties/core-app.  You must specify the hadoop.cluster.java_home directory appropriately in this file.  This must be set to the location of a Java JDK 1.8 installation on the Hadoop cluster nodes.

PropertyDescription
hadoop.cluster.java_home=/usr/java/jdk1.8.0_60

Java JDK 1.8 or greater is required on all Hadoop nodes

Additional properties may need to be set, based on the explanation in Optional Properties below.

 

CPU considerations

A YARN cluster is composed of host machines. Hosts provide memory and CPU resources. A vcore, or virtual core, is a usage share of a host CPU. YARN will spawn an Attivio Index writer process/container for every partition configured. YARN will also spawn an Attivio Index searcher process/container for every partition and row configured. By default, 1 vcore will be requested for each of these containers. It is important that there be enough CPU resources available on the Hadoop cluster to handle this Index configuration and that YARN be configured to allocate the containers.

The total number of vcores required by Attivio on the cluster is determined by the number of writers and searchers which is based on the number of indexes and the number of partitions in each index. If the cluster does not have enough vcores to launch a process, it will silently wait until enough resources become available.

It is important to set the yarn.vcores property so that for each container spawned in YARN, an appropriate number of vcores will be requested. This is a global setting, applied to all containers for both writers and searchers, so you should select the maximum vcore setting required for any single container and apply this across all containers.

To calculate the number of vcores required for a container, sum up the following:

Required vcores = loadfactor + 1 (for the master container) 

For example, for an index with a loadfactor of 4, a writer would suggest a yarn.vcores value of 5. 

PropertyValue(s)
yarn.vcores 

The number of vcores requested for each container spawned by YARN.

Performance Tip

Increasing the number of vcores allocated per YARN container may yield better ingestion and/or query performance. There are other factors to consider, such as whether CGroups are enabled in YARN. The yarn.vcores property can be leveraged to request more than 1 vcore per YARN container.

Attivio generally recommends that only one Attivio index process be placed on any one node. Otherwise, you should increase your load factor to put more partitions in one process. Consult with Attivio for recommendations for your specific application.

Index Writer Locks

Each index partition has a single writer container which runs in YARN and is responsible for making edits to the partition (adding/deleting documents). When the writer process experiences unexpected death, the index's application master will start a new writer somewhere where there's capacity. The index partition, however, will be locked when it starts and ingestion will eventually back up and stop until the lock is removed. These locks will automatically expire and be removed, freeing up the partition to be claimed by a new writer.

The default time-to-live (TTL) of these writer locks is 1 hour. This value can be overridden by setting the following property:

PropertyValue(s)

attivio.index.writer.ttl

The time-to-live for indexer writer locks, set in milliseconds. Default is 3600000 (1 hour).

Cloudera CDH Properties

If your Attivio project is to connect to a Cloudera CDH external Hadoop cluster, you must set these two additional properties in the <PROJECT_DIR>/conf/properties/core-app/attivio.core-app.properties file:

PropertyValue(s)Available In

hadoop.adapter

The external Hadoop-library module to load. The default value is hadoop-3.x-hbase-2.x for a cluster running Hadoop 3.0.3 and HBase 2.1.4.

Set this property's value to hadoop-cdh5 if your external Hadoop cluster runs Cloudera CDH 5.16.1 or to hadoop-cdh6 if it runs Cloudera CDH 6.2.

Set this property's value to hadoop-2.x-hbase-1.x for legacy projects with clusters running Hadoop 2.7.7 and HBase 1.2.12.

Attivio Platform 5.6.2 and later releases
yarn.aa.placement.racks

A comma-separated list (e.g., /bostondc,/seattledc,/tokyodc) of the rack names configured for your external Hadoop cluster nodes on which index writers and searchers are allowed to run. Each rack name must include the leading slash.

You can find the names of your Hadoop cluster's racks in the Cloudera Manager UI; from the main page, select the Hosts > All Hosts menu entry, then expand the Rack entry in the left-hand Filters menu to see a list of rack names and the number of Hadoop nodes in each rack. Cloudera CDH creates a single rack named /default if no custom rack names are configured.

This property's value must be set if the index's sliderPlacementPolicy property is set to 32 (the default value) to enable High Availability for the index. Failure to set this value when using the default sliderPlacementPolicy may prevent index containers from being started correctly by YARN.

Attivio Platform 5.6.2 and later releases

Optional Properties

The following properties may need to be set in your properties file, if the conditions mentioned in the explanation column of the table apply:

PropertyExplanation of When Required
security.hadoop.principal=<user> 

These two properties must be set for connection to the cluster if Kerberos is enabled on the Hadoop cluster.

Refer to Kerberos on Hadoop for information on properly setting these properties' values.

security.hadoop.keytab=<keytab-file>
hdfs.store.root=<directory>If the default /attivio HDFS directory is not being used for storage of Attivio libraries and data, you must provide the alternate directory path here.
hadoop.rpc.protection=<value>If this setting is configured on your Hadoop cluster, this property must be set to the same value.
dfs.data.transfer.protection=<value>If this setting is configured on your Hadoop cluster, this property must be set to the same value.
hbase.store.namespace=<value>If the default namespace of attivio has been overridden, then this property must be set to the alternate namespace.

Start the Attivio Agent

The Attivio Agent provides communications between the project control tools (AIE-CLI) and the applications and services that make up the Attivio project. The Agent must be running on all Attivio Cluster nodes in your deployment before you can start any other processes.  Detailed instructions on starting and configuring the agent can be found on the Attivio Agent page.

For example, to start a single agent on a local machine, writing data to the directory /opt/attivio/data-agent, you would run the following command:

<install-dir>/bin/aie-agent -d /opt/attivio/data-agent

Deploy and Start the Attivio Project

The Attivio Command Line Interface (AIE-CLI or just CLI) is a small-footprint utility that runs in an interactive command window. It lets us start, stop, and monitor multiple servers, and also provides tools for deploying the project's source files to the configuration servers. A full explanation of the AIE-CLI along with its commands can be found on the AIE-CLI page.

For example, to start the CLI on your newly created project in /opt/attivio/project1, you would execute this command:

<install-dir>/bin/aie-cli -p /opt/attivio/project1

Once the CLI starts, it will present an "aie>" prompt where you can type commands. To deploy and start all of the Attivio project, type start all.

aie> start all

deploying to YARN

Be patient during this time as deploying and starting all of Attivio can take some time.

 

The state of the project can be determined by running the status command in the CLI.  The status of a running system might look something like this:

aie> status
Clustered (default)
TYPE    NAME  HOST      BASEPORT HANDLE PID   STATUS 
------------------------------------------------------------------------
perfmon       localhost 16960    3      20980 RUNNING
store         localhost 16970    1      20414 RUNNING
node    local localhost 17000    4      21144 RUNNING

index
          writer    row1      
------------------------------
part0     ONLINE    ONLINE    
     reader: 1 desired, 1 live, 0 active requests, 0 failed recently
     writer: 1 desired, 1 live, 0 active requests, 0 failed recently

 

Using the Attivio Java Client API

When writing custom Java code using the Java Client API, there may be a need to set system properties in the Java code.  For example, if Kerberos is enabled, then you would have to set security.hadoop.principal and security.hadoop.keytab For a full list of properties that may be required, see the list of Optional Properties.

class JavaClientApp {
    public static void main(String[] args) {
      System.setProperty("security.hadoop.principal","attivio@YOUR-LOCAL-REALM.COM");
      System.setProperty("security.hadoop.keytab","/opt/keytabs/attivio.keytab");
      ...
      <client-api-calls>
    }
}