Page tree
Skip to end of metadata
Go to start of metadata

Overview

Several AIE Connectors allow copies of incoming documents to be set aside in a special Document Store. Later, if it becomes necessary to reload exactly the same documents (not updated documents), we can use the Document Store Connector to load the local copies instead of repeating the potentially expensive process of tracking down the original documents again. 

Required Modules

These features require the inclusion of the documentstore module when you run createproject to create the project directories. 

 

View incoming links.

How It Works

This section presents a general narrative of the Document Store mechanism.

Storing from the Original Connector

In the following diagram the lower tier depicts a typical AIE ingestion connector and workflow pathway. On the left are the original document files, presumed to be located in some file directory or database.  They are ingested by one of several kinds of AIE Connectors.  The resulting ingestDocuments

DocumentStoreSchematic

Storing from a StoreDocument Component

There could be situations where you want to capture the ingestDocument after some expensive text-analysis stage has been completed. Then you can reload the ingestDocuments and insert them into a workflow that is downstream of the expensive stage, avoiding a lot of unnecessary recomputation. The StoreDocument component lets us capture ingestDocuments at any point in the ingestion process and copy them to the Document Store. They can be retrieved later using a Document Store Connector.


ShowingStoreDocumentComponent50

Reloading with a Document Store Connector

As shown in the diagrams above, ingestDocuments in the Document Store can be reloaded using a Document Store Connector . The connector can be configured to drop the ingestDocuments at the beginning of any ingestion workflow.


Example

The following sections demonstrate how to configure document storage and retrieval, using the standard Attivio Factbook example. In it we'll configure country records to be reloadable.

To save the documents to the Document Store, either configure a standard connector to save them, or create a StoreDocument component and put it in a workflow.  It is not necessary to do both.

Configure a Connector

Many connectors can store pristine (unaltered) ingestDocuments in the Document Store.  Edit the connector in the AIE Administrator. Navigate to the Advanced Tab of the connector editor. Expand the Document Store group of properties.

There are four properties to configure:

PropertyRemarks
Store Documents Before FeedingIf true, the connector will copy ingestDocuments to the Document Store. If false (the default), documents will not be stored.
Collection NameWithin the Document Store, ingestDocuments are grouped in named collections. The collection name is used to retrieve the correct set of documents later. Generally, each connector would be configured to use its own collection, although multiple connectors can be configured to use the same collection.
Store ContentPointersIf true (the default), the connector will store the content associated with an ingestDocument's contentPointer along with the ingestDocument. This preserves the text of the original document so it can be used by later ingestion stages. When the ingestDocument is reloaded, the content is reposted to the Content Store.
Filter FieldsThe named fields each have their own indexes in the Document Store, allowing us to retrieve subsets of a collection based on exact-match queries of these fields. 

From this point on, running the connector will silently store copies of the generated ingestDocuments for potential reloading.

Create a StoreDocument Component

A StoreDocument component is a workflow stage that intercepts passing ingestDocuments and copies them to the Document Store.  It then passed the ingestDocuments to the next workflow stage without altering them.

In the AIE Administrator, navigate to the System Management > Palette screen. Open the New dialog and search for the StoreDocument document transformer.  Select it to open a component editor.

There are several properties to configure:

PropertyRemarks
NameThe first time you edit the component, you will have to give it a name. After that the name is fixed and cannot be changed.
TypeThis field displays the type of component.  It is not editable.
Workflow Referenced InAfter you insert this component in a workflow, this field will display the name of the workflow.  It is not editable.
Source URIThis field indicates the location of the dynamic configuration source file on the configuration server(s).  It is not editable.
Collection NameWithin the Document Store, ingestDocuments are grouped in named collections. The collection name is used to retrieve the correct set of documents later.
Filter FieldsThe named fields each have their own indexes in the Document Store, allowing us to retrieve subsets of a collection based on exact-match queries of these fields.
Store ContentPointersIf true (the default), the connector will store the content associated with an ingestDocument's contentPointer along with the ingestDocument. This preserves the text of the original document so it can be used by later ingestion stages. When the ingestDocument is reloaded, the content is reposted to the Content Store.

Creating the component is not enough. We must also insert the component in a workflow.  Navigate to System Management > Workflows > Ingest Workflows.  Then edit the desired workflow.  In this example, we inserted the StoreDocument component (MyStoreDocumentTransformer) into the countryXML workflow, which is used by the Factbook country connector:

From this point on, ingestDocuments from any source that pass through this component will be silently copied to the Document Store for potential reloading.

Create a Document Store Connector

To reload documents from a collection in the Document Store, we must create and run a Document Store Connector.

Navigate to System Management > Connectors and click on the New link. Create a new Document Store Connector.

There are several properties to be configured here:

PropertyRemarks
Connector NameThe first time you edit the connector, you will have to give it a name. After that the name is fixed and cannot be changed.
Collection NameWithin the Document Store, ingestDocuments are grouped in named collections. The collection name is used to retrieve the correct set of documents.
FiltersEnter a field name and a value. Documents in the collection that exactly match this filter will be reloaded. This is an alternative to using the Query feature. Do not attempt to use both at once.
QueryThis field can contain an Advanced Query Language query. Only those documents in the named collection that match the query will be reloaded. This is an alternative to using the Filter feature. Do not attempt to use both at once.
Document ID PrefixAppend this prefix to the Document ID during processing.
Ingest WorkflowThis is the name of the workflow where the reloaded ingestDocuments will be injected. The documents can only be injected at the beginning of a workflow, not between workflow components.

Run this connector at any time, just like any other connector. It will reload the specified ingestDocuments into the ingestion pathway.

How to Test Document Reloading

When experimenting with this feature it is a simple matter to verify that it is working.

As a first check, note that <data-agent>\projects\<project-name>\<environment>\data\data-store\store\db.log contains entries showing the documents being stored in the Document Store.

<data-agent>\projects\<project-name>k\<environment>\data\data-store\store\db.log
 INSERT INTO DOCUMENT VALUES(605,'preWorkflow','country-C:\attivio50\conf\factbook\content\countries\ce.xml','2015-10-21 09:19:47.320000',NULL,NULL,NULL,NULL,NULL)
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','filename',NULL,NULL,NULL,4,'ce.xml')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','sourcepath',NULL,NULL,NULL,4,'C:\attivio50\conf\factbook\content\countries\ce.xml')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','date',NULL,NULL,NULL,5,'1445032282000')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','size',NULL,NULL,NULL,3,'6626')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','bytes',NULL,NULL,NULL,7,'<contentPointer storeName="aie.docstore.preWorkflow">C:\attivio50\conf\factbook\content\countries\ce.xml-14301e3a-6940-4883-826c-93ffa91c092d-42</contentPointer>')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','asap.application',NULL,NULL,NULL,4,'Factbook')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','asap.source.structured',NULL,NULL,NULL,4,'true')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','table',NULL,NULL,NULL,4,'country')
INSERT INTO DOCFIELD VALUES(605,'preWorkflow','asap.source.type',NULL,NULL,NULL,4,'File Connector')
COMMIT

To test the retrieval mechanism, follow these steps:

  1. Starting with an empty index, load a set of documents through an appropriate connector and workflow so that documents are added to the Document Store.
  2. Navigate to the System Management > Indexes page and note how many documents are in the index.
  3. Delete the index (use the "Delete All" link on the indexes page).
  4. Run your Document Store Connector to reload the documents.
  5. Check the indexes page again to see how many documents are in the index.

The restored index should have the same number of documents as the original index. 

Partial Updates

The Document Store can be used in conjunction with the partial update feature.

You can use a custom ingest transformer or a ChangeDocumentMode  component to intercept ingestDocuments and change their DocumentMode property to PARTIAL instead of the default ADD.


When a StoreDocument component encounters an ingestDocument with DocumentMode.PARTIAL, it retrieves  the previously-stored document from the Document Store (if one exists), and then updates that document with the fields in this PARTIAL update before storing the updated document back in the store.

 

  • No labels