The Attivio Platform provides powerful search, workflow, and content-processing capabilities plus robust, standards-based transport operation in a layered infrastructure. This guide provides a high-level overview of Attivio Platform architecture and component programs.
View incoming links.
Attivio Platform Architecture
The architecture of the Attivio Platform is based on the Staged Event-Driven Architecture (SEDA) pattern. In SEDA, each pool of components has a work queue in front of it. Components work on their input queue and forward system messages to the queue of the next component in the workflow. The SEDA architecture lets Attivio manage processing via sizing the queues and component instances all while processing content in an asynchronous fashion.
Attivio functionality can be viewed at a high-level as a set of concentric spheres. Visualizing Attivio as a sphere highlights the single API nature of the system that allows for bi-directional operation. The high-level architecture diagram below shows each of the functionality spheres:
- Security - Suite of security functions for securing Attivio and the information within Attivio
- Connectivity - Encompasses logic for connecting to and from external systems
- Workflow - lightweight Enterprise Service Bus (ESB) style workflows
- Universal Index - Repository for unstructured and structured information that supports precise SQL-like queries as well as full-text searches.
Figure 1: Attivio High-level Architecture
Attivio comprises the infrastructure for these spheres and provides set of default workflows for ingesting, indexing, and querying content. The In Depth Architecture Diagrams page shows the key system components and how they communicate.
Attivio is composed of workflows, components, transports, connectors, and engines. Information flows through Attivio via asynchronous and synchronous messages. Each instance of Attivio runs as a single multi-threaded Java process. Instances of Attivio may be combined together to provide highly-scalable multi-node solutions, to isolate particular components, or to construct highly available and/or fault tolerant systems.
The nodes of a running Attivio system are configurable in a number of different ways depending on the usage, load, and other considerations. Please see Multi-Node Topologies for more information on how to configure this aspect of Attivio.
Workflows and Message Processing
The workflow layer provides infrastructure for executing complex data processing tasks, including content ingestion, querying, and processing of query results. Workflows may operate synchronously or asynchronously and support loops and conditional execution, branching (providing different processing paths for different documents), and aggregation for recombining branched paths (see Workflow Routing for a more detailed discussion). Workflows consist of a series of data-processing components and may include references to other workflows. Generally queries operate synchronously so that results are fed back to the client, and ingestion operates asynchronously for optimal performance.
Attivio uses a message-based system for moving content, queries, and administrative commands between components in a workflow. For example, when a document is submitted via the API to an ingestion workflow, it is transported as part of a message. Attivio messages are designed to efficiently represent both structured and unstructured types of data. When processing a request asynchronously, workflows route messages from component to component via message queues. This decoupling allows each component to execute as efficiently as possible since it does not have to block waiting for downstream components to complete processing before it can begin processing the next message. When a component is ready to process a message, it pulls the next message off of its input queue.
Message queues have a fixed size based on number of messages or memory size. As noted earlier, this pattern is the Staged Event Driven Architecture (SEDA). When component
A attempts to transfer work to component
B, if the queue between the components
B is full, component
A waits for component
B to complete some tasks from the queue.
Attivio is capable of maintaining message order in the workflow even when branching and looping. Maintaining message order is important in some applications, for instance where it is important to know when a particular set of documents is ready for searching or when updates to a document appear within the same time period. In the former case, it is essential that the commit message for the set of documents be processed after the set of documents has been indexed. In the latter case, the latest version of a document must be indexed last, or the desired content will be out of date.
When processing a request synchronously, message queues are bypassed and messages are directly sent to the next available instance of a component. In this mode, the processing of a message occurs within a single Java thread.
By default asynchronous message transfers through sub-workflows bypass message queues when possible. This memory and execution optimization is referred to as Synchronous Subflow Execution. Synchronous subflow execution optimizes memory use by not consuming in-memory queue space unless absolutely necessary. Runtime analysis of messages and the workflow components determine when fallback to pure asynchronous transfer is necessary.
The ingestion audit system does not provide auditing of system API access, queries, or actions within the Admin UI.
Every message and document processed by the ingestion system is audited and associated with a unique client ID. Auditing tracks the moment a document or message is added to the system (
CREATE), processing events (
OK), arrival for processing on a particular node (
RECEIVE), loss due to system or node failures (
LOST), re-feeding of lost data (
REFEED), and completion of normal processing (
COMPLETE). The audit system replaces the previous message results system with a robust, durable, and accessible API.
The audit system allows Attivio users to answer questions based on client, timestamp, or individual document such as:
- Is processing for client X complete?
- How many document ingestion warning or errors were encountered?
- When did document Y become searchable?
- What were the processing errors for document Z?
- What occurred on the system between date/time A and date/time B?
Audit data is retained until purged. Audit data can be purged by client or in its entirety. Audit data is stored in the platform storage system, as described in the detailed architecture diagrams. The full capabilities of the audit system can be found in the javadoc for the
AuditReaderApi. Message and document auditing are central to the correctness of asynchronous ingestion in Attivio and cannot be disabled.
Attivio periodically checks (every 15 minutes) to see if any incomplete clients have been inactive for more than 2 hours (attivio.ft.client.idle.max, default=2 hours) If so, a loss-detection round is triggered for the client. A system event to this effect will be generated. The loss-detection round examines the audit trail of the client to find the documents and/or messages which did not complete. For such documents, the last node where the document was processed is checked to see if the document is still being processed for some reason. If not, then a
LOST audit record is added for the document or message. In non-fault-tolerant systems this audit record balances the accounts and allows the client to be considered as complete.
Automatically Purging Inactive Clients
Over time, the amount of audit information for the system can grow very large. To help manage this, Attivio can automatically cleanup inactive clients by purging audit information from clients that have been marked inactive for an extended period of time. This is controlled by setting a frequency to check for inactive clients (attivio.audit.purge.interval) and setting the amount of time to retain inactive clients, measured from when it was marked inactive (attivio.audit.purge.inactiveretentiontime, default=6 hours). Both of these properties have a unit of seconds. To disable purging, set the interval to -1 or leave commented out. Since fault tolerance relies on the audit details, if the system is set to be fault tolerant, automatic purging will be disabled (even if the interval is set).
Attivio provides the capability for ingestion fault tolerance. This mode is disabled by default. When fault tolerance is active, copies of documents and associated content are saved to Attivio's storage layer as they are introduced to the system. This occurs either as documents are created by clients or as child documents are created as the result of content extraction. The fault tolerance system relies on the ingestion audit system to keep track of which documents have successfully reached the universal index.
When enabled, Attivio ensures that the fault tolerance service is always running on one of the Attivio nodes (if it is running on a node that dies it will be restarted on another). This service scans the audit log for
LOST documents or messages associated with a client. When such documents or messages are found, the last good state for the item is fetched from the storage system and refed. This action generates a
REFEED audit record for the item. Once all items have been refed successfully, the associated client will be considered complete. Document copies and associated content are deleted from the storage system as soon as they have been processed (
OK audit) by the universal index.
The universal index provides a durable work log of processed documents. This work log ensures that documents which have been successfully processed by the index are safely retained. This state is safe even though the index may not have yet been committed. In the case of a system or node crash, the new master index will process the work log prior to accepting new documents. In the case of an unclustered system, the work log can be lost due to disk failure. For clustered systems, the work log is stored in the Hadoop HDFS; the safety of the work log then depends on the redundancy and fault tolerance of the cluster.
Components are general data processing units that operate on PlatformMessages . Components may perform simple conditional evaluation, transform content, or provide a custom implementation that wraps embedded components or interacts with external ones. A component is any java class which derives from the core class, PlatformComponent . Components execute "multi-instanced": each instance with the same name shares the same input message queue. In a running system, the same java class can be used as different components several times (each set with a different name), each with its own set of instances. Each component is configured with a maximum set of instances (or defaults to the configured system default). When all instances are in use, new messages to the component start to fill the component's input message queue.
Connectors are components whose purpose is to collect content or changes from external data sources and convert them into IngestDocuments suitable for ingestion within Attivio. Connectors pull data from diverse sources such as file systems, databases, and csv files. Connectors can be run inside Attivio as services, embedded within a Java program, or executed externally via the command line. For more information please refer to Connectors page.
Engines provide the core capability to store and retrieve data in the system. The Universal Index may be composed of one or more engines, each tailored for specific use cases.
The following engine-related services facilitate the processes of content ingestion, indexing, and querying:
Provides core indexing and search technology.
Accepts content to be indexed, and routes the content to one or more Attivio engines. Content Dispatchers can be stacked to support fault tolerance and distribution of indexes across multiple servers.
Accepts queries and routes them to one or more Attivio engines. Query Dispatchers can be stacked to support querying against indexes distributed across multiple servers.
The Attivio platform provides a complete, self-contained, abstract interface for sending information to Attivio and handling responses, including both query responses and callbacks for data processing results. It makes use of one or more transport connectors.
Main article: Developing with Attivio
The client API is used for developing applications which interact with an Attivio engine. Client APIs are provided for Java and HTTP REST.
Main article: Java Client API
Main article:JSON REST API
The Java server API is used for adding extensions to the Attivio platform.
Main article: Java Server API
Attivio includes both a JDBC driver and an ODBC driver.
Main article: Attivio JDBC Driver
Main article: Attivio ODBC Driver
Endpoints are the server-side interaction point for all client-based communications. Standard receivers process new content, queries, callbacks, and a host of other tasks. Any component can be turned into a receiver by simply adding an
input element to the component's declaration as follows: