Welcome to Attivio's Cognitive and Insights Platform! Using Attivio becomes much easier if you have a general grasp of the Attivio Concepts and Tools that are mentioned throughout the Attivio documentation.
This tutorial explores ingestion, which is the process of gathering content and preparing it for inclusion in the Attivio Universal Index.
View incoming links.
Concepts and Vocabulary
The process of content ingestion has many interacting parts that need to be understood in context. This section presents the basic concepts and vocabulary of ingestion, building context for later expansion.
Ingestion is the process of gathering content and preparing it for inclusion in the Attivio Universal Index.
A connector is the interface to the external documents. Some connectors extract text and metadata from the content source. Others pull entire files into the system for later text extraction. The incoming content is repackaged into IngestDocuments. The IngestDocuments are then passed to an ingestion workflow as messages.
A workflow is a chain of stages that the IngestDocument must pass through on its way to the index. Many of the stages are components that perform useful transformations on the IngestDocument. An Attivio ingestion path usually contains multiple workflows and many components. This is the simplified schematic diagram:
This is a very simplified overview of ingestion concepts and vocabulary. We will gradually expand on this theme as we explore the many types of connectors, scanners, workflow stages, components and transformers that Attivio has to offer.
What Can Attivio Ingest?
Attivio can read over 500 file formats, which include nearly every kind of document you would normally find on a computer file system. Attivio can ingest content from Web sites, content management systems, document management systems, and email systems, and can use SQL to read records from most common databases. In addition, Attivio provides an easily-extensible connector framework to handle customized integration with virtually any repository.
With the proper supporting modules, Attivio can ingest, analyze, index and search documents in forty-five languages. This includes documents that include multiple languages, such as Chinese documents with embedded English text.
What is an IngestDocument?
During the process of ingestion, Attivio temporarily stores a document's content in a container object called an IngestDocument . An IngestDocument is a multimap, which is a container for named fields. Fields can have multiple values. You can think of the IngestDocument as a two-column table where field names are paired with one or more values.
Document ID cannot be changed
Once an IngestDocument has been created, its ID field value cannot be changed.
At the simplest level, IngestDocuments are created by connectors and are refined and augmented by workflows. IngestDocuments can also be created by the Advanced Text Extraction workflow, which can operate on nested ZIP files. It is possible to create external applications using the Java Client API that build IngestDocuments and insert them into workflows, bypassing the connector level.
Eventually, all of the IngestDocument fields that are configured in the are written into the Attivio index. (There can be fields that are not destined for the index. These are discarded by the indexer.) At the end of its life cycle, each IngestDocument is passed to a special sink workflow where its memory resources are reclaimed.
Fields of IngestDocuments
describes the datatypes and behavior of strongly-typed fields in the Universal Index. Many users assume that these strong datatype restrictions must apply to IngestDocument fields, too, but this is a misleading idea.
Schema field definitions are not binding on the fields of an IngestDocument, even when the fields have the same names as Schema fields. An IngestDocument is a scratch pad where Attivio connectors and workflow components work up a description of an index entry. Field values can be transformed from one datatype to another during this process.
The schema field definitions are applied during indexing. The indexer attempts to cast the document's field values into the types required by the schema. If one or more field values cannot be correctly cast, the document is dropped.
What is a Message?
The architecture of Attivio is based on the Staged Event-Driven Architecture (SEDA) pattern. In SEDA, each pool of components has a message queue in front of it. Each component works on its input queue and forwards system messages to the queue of the next component in the workflow.
Attivio uses DocumentList messages for moving IngestDocuments from one stage to another in a workflow. A DocumentList message usually contains only one IngestDocument, although it is capable of transporting batches of documents when desired.
What is the Attivio Schema?
The first step of setting up ingestion is to decide how the data in your source documents should be mapped into the fields of an IngestDocument for indexing.
is a file (<project-dir>\conf\schema\default.xml) that defines every IngestDocument field recognized by the Attivio indexer. Your connectors and workflow stages can create, name and manipulate new IngestDocument fields quite freely. When an IngestDocument reaches the indexer, however, only the recognized fields are written into the index. Unrecognized fields are ignored.
Therefore, if you want a field to be indexed, you must either use a field that is already defined in the schema (such as title), or add a description of your new field to the schema file.
There is a convenient exception to this rule, involving new fields that have specific suffixes. For example, the schema file contains a wildcard definition that says any field ending with _s will be an indexed string field. If your document contains a field such as myTitle_s, it will be indexed although it is not explicitly named in the schema.
Seepage for information on this topic.
What is the Store?
Theis the Attivio utility that caches various kinds of data, either temporarily or permanently, against the time when it is needed. The store supports very different activities in Attivio:
- Document Store
- Connector History Store
What is the Document Store?
Sometimes extracting the documents from their native silo is the expensive part of ingestion. For instance, extracting records from a database can place an unwelcome load on the database engine. In such cases, Attivio can make a permanent cache of the extracted documents in case it becomes necessary to load them again at a later date. This is the fastest and least-expensive way to rebuild an index, for instance.
What is the Connector History Store?
The Store contains a fourth store, rarely mentioned because it doesn't need user configuration, that maintains a record of connector activity to support incremental updating of connectors.
What is a Connector?
An Attivio connector acquires and reads documents (or database records), repackages them into a IngestDocuments, wraps the IngestDocument in a message, and dispatches the message to an ingestion workflow. It also optionally listens for return messages confirming that the document has been indexed successfully.
A Connector encapsulates a scanner, a message publisher, and a message result listener.
It is very easy to create or modify connectors either by using the Attivio Administrator connector editor, or by directly editing your project's XML configuration files.
A scanner is the underlying Java class that acquires content for a connector. It is the first of three elements that define a connector.
Scanners convert the source data format into an IngestDocument. There are a number of classes in Attivio that implement the Scanner interface. Examples are the FileScanner and the XMLScanner, as well as two special scanners that can delete content from the Attivio index. There are also a number of specialized modules in Attivio that provide connectors which implement the Scanner interface. The dbconnector module, for example, contains the DatabaseScanner, and the more specialized JoiningDatabaseScanner.
Message Publisher (Feeder)
The second essential part of a connector is the publisher (or "feeder"). The publisher sends the new IngestDocuments to an ingestion workflow. Configuration is as simple as providing the name of the desired workflow.
The default feeder is the DirectMessagePublisher.
Message Result Listener
The third part of a connector is the message result listener. This mechanism listens for messages coming back from the ingestion workflow confirming that any specific document was, or was not, successfully indexed. The message result listener can optionally relay this information to an external application if desired.
The default message result listener is the LoggingResultListener, whose job is to log the returned results. This listener translates any document warnings and failures into WARN and ERROR log messages, accordingly.
Attivio includes many kinds of connectors, either as part of the basic product or in add-on modules. Follow the links for more information.
We can configure connectors using Dynamic Configuration in the .
The first step in configuring a connector is to open theand examine the New Connector dialog box.
This dialog box contains the list of connectors that are available in your project. The CSV Connector, File Connector, and XML Connector are always present, as are the connectors for deleting content. The other connectors come from special modules that must be added to your project using the tool.
Connectors are very simple to configure. The connector object is mainly a container for the parameters that govern scanner behavior. Configuring a connector is just a question of filling in the blanks.
For help interpreting the fields on the Connector Editor, hover the mouse over a field to view the tooltip message. It is also helpful to examine the methods in the scanner's javadoc page. The editor fields correspond to the scanner's "set" methods.
Examples of Configured Connectors
In addition, the XML definitions of the four Quick Start Tutorial connectors can be found in <install-dir>\conf\sdk\factbook.xml. Note that this file is in your Attivio installation tree, not your project tree.
What is a Workflow?
A "document" (or "ingest") workflow describes a path for incoming IngestDocuments to follow. A top-level workflow begins at a connector, proceeds through a series of subflows and transformers, and terminates at an indexer. You may think of a workflow as a conveyor belt that moves a document from one station to the next as it is processed.
A workflow can be configured to split, rejoin, and even loop if such is required.
There are also "query" workflows and "response" workflows. We will take them up in a later lesson.
Parts of a Workflow
This diagram shows part of the standard ingest workflow from the Quick Start Tutorial exercise. The large boxes are the workflows. The smaller labels tucked inside the boxes are subflows and components. Components are often transformers that modify the IngestDocument, but can also be routers (splitters and joiners) that control document flow from one workflow to another.
In the following diagram, the large boxes are workflows. The smaller boxes are subflows and components (transformers, splitters and joiners).
As you can see from the image above, the ingest workflow is simply a list of subflows that each incoming document will visit in sequence. The image shows the attivioLinguistics workflow expanded in detail. There are several components that perform text analysis on the documents, and there is one conditional splitter/joiner component in the entityExtraction workflow.
TIn this section we'll briefly examine workflow routing.
A subflow component temporarily diverts a flow of IngestDocuments to another workflow. When that workflow is finished with the documents, they return to the original subflow component, where they rejoin the parent workflow.
The return path is not shown in these graphs. You have to supply it in your imagination.
The Attivio ingest workflow consists entirely of subflow components that send IngestDocuments to other workflows in a prescribed order.
Note that the Component Details display identifies the underlying Java class object, the name of the destination workflow, and the source file where the component is defined.
Splitters and Joiners
Attivio offers a variety of splitter components that conditionally send some documents to one workflow while sending other documents down a different path. The textExtraction workflow in the Advanced Text Extraction Module provides two excellent examples of this feature in use.
In the first example, the textExtraction component sends HTML documents off to the htmlTextExtraction workflow, while sending more complex documents to the advteConvert workflow ("all other cases"). This split behaves like a conditional subflow. The documents eventually come back to the splitter location, rejoining the original workflow at the textExtraction-trustSiteHarvesterContentType-joiner component.
As it happens, the advteConvert workflow sometimes unpacks compound documents. Each of the "child" files is assigned to its own new IngestDocument. The new IngestDocuments are then routed back to the textExtractionWithDocTypeDetection-splitBasicAndAdvancedTextExtraction-joiner component.
These new IngestDocuments need to be analyzed as new input. The childDocRouter splitter recognizes the new documents and sends them back to the beginning of the textExtraction workflow. This splitter does not use a joiner. It sends documents on a one-way trip to a new destination. They simply flow through the system as new input. Eventually they will pass through both splitters again before eventually being dispatched to the ingest workflow for analysis and indexing.
Attivio's splitters can optionally generate a joiner when a "conditional subflow" behavior is needed, and can also support absolute splits where the documents do not automatically return to the parent workflow. This feature is controlled by the splitter's rejoin property, which is demonstrated in the configuration examples, below.
The most common splitter in Attivio workflows is SplitDocumentListByFieldValue
Configuring Workflow Components
Workflows are assembled from components, splitters and other elements. Let's define the elements first, and then the workflow.
Using the Attivio Administrator
Components are created in the Palette page of the . Click the New link to open a long list of component types. The ingestion transformers and splitters are in the "Document" branch of the tree.
Choose a transformer to open the Component Editor. This is the editor for the childDocRouter splitter mentioned in a previous example.
Fill in the essential fields and save the component. Repeat as necessary until you have all the pieces of your workflow ready to be assembled.
Tips on Component Editing
Attivio edits your project files in two locations, depending on whether you use Attivio Administrator edit the files on disk.
- The createProject tool manages a <project-dir> tree of local configuration files. These are local files on a development computer that can be backed up. Local edits change these files.
- Attivio Administrator uses Dynamic Configuration to change configuration files on the Configuration Servers. These are files that are downloaded to each of the Attivio nodes when it starts this project.
One needs to be careful about reconciling the two types of changes:
- To move Dynamic Configuration changes from the Configuration Servers to the <project-dir> tree, use the update command from the Attivio-CLI.exe tool.
- To move <project-dir> changes from the source tree to the Configuration Servers, use the deploy command from the Attivio-CLI.exe tool.
When using Dynamic Configuration, it is a best practice to update the changes and then immediately deploy them. This adds the dynamic changes to the <project-dir> source tree, and then deploys the source files out to the configuration servers. This lets you capture the dynamic changes in the <project-dir> tree so you can back them up.
Assembling a Workflow
Once your components are defined, you can assemble them into a workflow. A workflow is just a list of stages that IngestDocuments will visit in top-down order.
A typical ingestion (document) workflow consists of three kinds of stages:
- documentTransformer: A component that analyzes and makes changes in a document. (In the Workflow Editor, this is called a "Document Stage Type.")
- splitter: A component that reroutes documents to other workflows. It has the special magic that enables joiner logic if it is desired. (In the Workflow Editor, this is called a "Splitter Stage Type.")
- subflow: Sends all documents to another workflow, and then returns them all to this spot for further processing. (It is not technically a "stage" and therefore has no stage type.)
In addition you can include a <description> string to document the workflow.
Using the Attivio Administrator
To create or edit an ingestion workflow in the Attivio Administrator, navigate to System Management > Workflows > Ingest. Click the New link at the top of the list or select an existing workflow to edit. This opens the INGEST workflow editor.
Editing the workflow is mainly a matter of naming it and then picking components from lists. There are buttons to move the stages up or down the list. There's a button for adding a subflow stage.
The document transformers and splitters must be manually identified by using the Edit Stage Type button. Be sure you have correctly identified each of these components as a "document" or "splitter" before saving the workflow.
Examples of Configured Workflows and Components
You can find workflow and component examples in the following files in your <install-dir> tree:
The ultimate test of a workflow is the set of values returned by a query. Are you getting back the things you expected to see? The correct or incorrect values arriving in the index will pinpoint any transformer issues in the ingestion process.
Once you have the correct results, the next question is efficiency. Some transformers just copy a value from one field to another, but others undertake whole-document analysis and transformation. These stages naturally take more time to complete and can limit the overall throughput of the system.
The Attivio memory module provides multiple diagnostic tables in the Attivio Administrator. (The memory module is loaded by default with all Attivio projects, unless you specifically disable it.) Be sure to examine all of the interface displays listed under the Diagnostics page.
Distributing Workflow Stages
Suppose that, having corrected all known configuration errors, we still have a component that is overloaded. How can we distribute that workload across multiple Attivio nodes to take advantage of some parallel processing?
Distribute workflows, not components!
Note that one can distribute a workflow across multiple nodes, but not individual components. To distribute one troublesome component, wrap it in its own workflow and then proceed.
In the Quick Start Tutorial we set up the "factbook" demo and ran it. In the subsequent Multi-Node Topologies lesson we expanded the original demo to run across four Attivio nodes on a network, with all ingestion workflows running on node1. Let's modify the four-node example to distribute the extractBaseEntities workflow across the four nodes for greater efficiency.
This turns out to be very easy to do. We need is a named nodeset containing all four nodes:
We also need a single <workflow-location> element added to the project's existing topology-layout.xml file.
In a non-distributed project, the extractBaseEntitles workflow is a subflow of the entityExtraction workflow:
When you distribute the workflow across multiple nodes, this subflow is replaced by a "distributor" node. It has a pre-processing stage ahead of it, and a joiner stage after:
The distributor sends IngestDocuments to the distributed copies of the workflow, and coordinates bringing the processed documents back this point in the original workflow. Processing resumes as it did before.
Indexer and Sink Workflows
The last stage of the ingest workflow is the indexer.
In the indexer workflow is a content dispatcher. This stage processes the IngestDocument into an index entry, and then distributes the new entry among the index partitions that require it.
When Attivio has completely finished with an IngestDocument, either because it has been indexed or because it has been dropped due to an error, the IngestDocument is sent to the sink workflow. There it is unbuilt, freed, and recycled.
This lesson has followed the information path from your original documents to the Attivio Universal Index. We have explored all of the concepts and tools that are relevant to Attivio's ingestion workflows.