Overview
Several AIE Connectors allow copies of incoming documents to be set aside in a special Document Store. Later, if it becomes necessary to reload exactly the same documents (not updated documents), we can use the Document Store Connector to load the local copies instead of repeating the potentially expensive process of tracking down the original documents again.
Required Modules
These features require the inclusion of the documentstore module when you run createproject to create the project directories.
View incoming links.
How It Works
This section presents a general narrative of the Document Store mechanism.
Storing from the Original Connector
In the following diagram the lower tier depicts a typical AIE ingestion connector and workflow pathway. On the left are the original document files, presumed to be located in some file directory or database. They are ingested by one of several kinds of AIE Connectors. The resulting
ingestDocuments
are sent to some ingest workflow by the connector. From there they undergo further processing and are eventually indexed.
Connectors that can put ingestDocuments into the Document Store can be identified by the presence of Document Store properties on the Advanced Tab of the connector editor in the AIE Administrator.
The Document Store properties allow a connector to store copies of unmodified ingestDocuments in the Document Store. The ingestDocuments can later be retrieved by using a Document Store Connector to retrieve them. Usually, we configure the Document Store Connector to insert the reloaded ingestDocuments into the same ingestion workflow as the original connector did.
Storing from a StoreDocument Component
There could be situations where you want to capture the ingestDocument after some expensive text-analysis stage has been completed. Then you can reload the ingestDocuments and insert them into a workflow that is downstream of the expensive stage, avoiding a lot of unnecessary recomputation. The StoreDocument component lets us capture ingestDocuments at any point in the ingestion process and copy them to the Document Store. They can be retrieved later using a Document Store Connector.
Reloading with a Document Store Connector
As shown in the diagrams above, ingestDocuments in the Document Store can be reloaded using a Document Store Connector . The connector can be configured to drop the ingestDocuments at the beginning of any ingestion workflow.
Example
The following sections demonstrate how to configure document storage and retrieval, using the standard Attivio Factbook example. In it we'll configure country records to be reloadable.
To save the documents to the Document Store, either configure a standard connector to save them, or create a StoreDocument component and put it in a workflow. It is not necessary to do both.
Configure a Connector
Many connectors can store pristine (unaltered) ingestDocuments in the Document Store. Edit the connector in the AIE Administrator. Navigate to the Advanced Tab of the connector editor. Expand the Document Store group of properties.
There are four properties to configure:
Property | Remarks |
---|---|
Store Documents Before Feeding | If true, the connector will copy ingestDocuments to the Document Store. If false (the default), documents will not be stored. |
Collection Name | Within the Document Store, ingestDocuments are grouped in named collections. The collection name is used to retrieve the correct set of documents later. Generally, each connector would be configured to use its own collection, although multiple connectors can be configured to use the same collection. |
Store ContentPointers | If true (the default), the connector will store the content associated with an ingestDocument's contentPointer along with the ingestDocument. This preserves the text of the original document so it can be used by later ingestion stages. When the ingestDocument is reloaded, the content is reposted to the Content Store. |
Filter Fields | The named fields each have their own indexes in the Document Store, allowing us to retrieve subsets of a collection based on exact-match queries of these fields. |
From this point on, running the connector will silently store copies of the generated ingestDocuments for potential reloading.
Create a StoreDocument Component
A StoreDocument component is a workflow stage that intercepts passing ingestDocuments and copies them to the Document Store. It then passed the ingestDocuments to the next workflow stage without altering them.
In the AIE Administrator, navigate to the System Management > Palette screen. Open the New dialog and search for the StoreDocument document transformer. Select it to open a component editor.
There are several properties to configure:
Property | Remarks |
---|---|
Name | The first time you edit the component, you will have to give it a name. After that the name is fixed and cannot be changed. |
Type | This field displays the type of component. It is not editable. |
Workflow Referenced In | After you insert this component in a workflow, this field will display the name of the workflow. It is not editable. |
Source URI | This field indicates the location of the dynamic configuration source file on the configuration server(s). It is not editable. |
Collection Name | Within the Document Store, ingestDocuments are grouped in named collections. The collection name is used to retrieve the correct set of documents later. |
Filter Fields | The named fields each have their own indexes in the Document Store, allowing us to retrieve subsets of a collection based on exact-match queries of these fields. |
Store ContentPointers | If true (the default), the connector will store the content associated with an ingestDocument's contentPointer along with the ingestDocument. This preserves the text of the original document so it can be used by later ingestion stages. When the ingestDocument is reloaded, the content is reposted to the Content Store. |
Creating the component is not enough. We must also insert the component in a workflow. Navigate to System Management > Workflows > Ingest Workflows. Then edit the desired workflow. In this example, we inserted the StoreDocument component (MyStoreDocumentTransformer) into the countryXML workflow, which is used by the Factbook country connector:
From this point on, ingestDocuments from any source that pass through this component will be silently copied to the Document Store for potential reloading.
Create a Document Store Connector
To reload documents from a collection in the Document Store, we must create and run a Document Store Connector.
Navigate to System Management > Connectors and click on the New link. Create a new Document Store Connector.
There are several properties to be configured here:
Property | Remarks |
---|---|
Connector Name | The first time you edit the connector, you will have to give it a name. After that the name is fixed and cannot be changed. |
Collection Name | Within the Document Store, ingestDocuments are grouped in named collections. The collection name is used to retrieve the correct set of documents. |
Filters | Enter a field name and a value. Documents in the collection that exactly match this filter will be reloaded. This is an alternative to using the Query feature. Do not attempt to use both at once. |
Query | This field can contain an Advanced Query Language query. Only those documents in the named collection that match the query will be reloaded. This is an alternative to using the Filter feature. Do not attempt to use both at once. |
Document ID Prefix | Append this prefix to the Document ID during processing. |
Ingest Workflow | This is the name of the workflow where the reloaded ingestDocuments will be injected. The documents can only be injected at the beginning of a workflow, not between workflow components. |
Run this connector at any time, just like any other connector. It will reload the specified ingestDocuments into the ingestion pathway.
How to Test Document Reloading
When experimenting with this feature it is a simple matter to verify that it is working.
As a first check, note that <data-agent>\projects\<project-name>\<environment>\data\data-store\store\db.log contains entries showing the documents being stored in the Document Store.
INSERT INTO DOCUMENT VALUES(605,'preWorkflow','country-C:\attivio50\conf\factbook\content\countries\ce.xml','2015-10-21 09:19:47.320000',NULL,NULL,NULL,NULL,NULL) INSERT INTO DOCFIELD VALUES(605,'preWorkflow','filename',NULL,NULL,NULL,4,'ce.xml') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','sourcepath',NULL,NULL,NULL,4,'C:\attivio50\conf\factbook\content\countries\ce.xml') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','date',NULL,NULL,NULL,5,'1445032282000') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','size',NULL,NULL,NULL,3,'6626') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','bytes',NULL,NULL,NULL,7,'<contentPointer storeName="aie.docstore.preWorkflow">C:\attivio50\conf\factbook\content\countries\ce.xml-14301e3a-6940-4883-826c-93ffa91c092d-42</contentPointer>') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','asap.application',NULL,NULL,NULL,4,'Factbook') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','asap.source.structured',NULL,NULL,NULL,4,'true') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','table',NULL,NULL,NULL,4,'country') INSERT INTO DOCFIELD VALUES(605,'preWorkflow','asap.source.type',NULL,NULL,NULL,4,'File Connector') COMMIT
To test the retrieval mechanism, follow these steps:
- Starting with an empty index, load a set of documents through an appropriate connector and workflow so that documents are added to the Document Store.
- Navigate to the System Management > Indexes page and note how many documents are in the index.
- Delete the index (use the "Delete All" link on the indexes page).
- Run your Document Store Connector to reload the documents.
- Check the indexes page again to see how many documents are in the index.
The restored index should have the same number of documents as the original index.
Partial Updates
The Document Store can be used in conjunction with the partial update feature.
You can use a custom ingest transformer or a ChangeDocumentMode component to intercept ingestDocuments and change their DocumentMode property to PARTIAL instead of the default ADD.
When a StoreDocument component encounters an ingestDocument with DocumentMode.PARTIAL, it retrieves the previously-stored document from the Document Store (if one exists), and then updates that document with the fields in this PARTIAL update before storing the updated document back in the store.