Overview
The Incremental Update feature lets AIE track loaded documents, and re-ingest only those documents that have been added or changed since the previous ingest.
Incremental updating is available for the AIE connector types shown here.
This page shows a detailed example of incremental updating based on the JDBC Database Scanner. It is a continuation of the tutorial example for this connector. Although the example is specific to the database connector, the process of setting up incremental updates is very similar for all relevant connectors.
View incoming links.
Incremental Update using the JDBC Database Scanner
The JDBC Database Scanner is one of the scanners that supports incremental updating. This support lets us update the database with only those records that changed since the previous visit.
The database scanner supports two different incremental indexing methods. The first method is based on database tracking of changed or deleted records, while the second requires no specific information in the database. Database managed of incremental indexing is much more efficient, but does place additional requirements on the database itself.
Database Managed Incremental Indexing
To support database managed incremental indexing, a database must meet all of the following criteria:
- All tables being queried/indexed must have a 'last modified time stamp' column that is populated for each record. If multiple tables are joined together in the SQL you are executing, then all tables being joined must have a last modified column populated in the database. The resulting SQL being executed by the DB scanner must account for a change in any of these tables as a part of the where clause.
- All tables being queried must have a change tracking table in place that lets AIE administrators query for recently deleted records.
If a database meets the above criteria, you can use the DB scanner's following properties to support incremental indexing:
scanner property | description for incremental indexing support |
---|---|
sql | the variables listed below (%%LAST_SCAN_TIMESTAMP_MS%%, %%TIMESTAMP_MS%% , etc.) will be substituted at run-time to reflect the correct values persisted across system restarts. |
deletedSql | the variables listed below will be substituted at run-time to reflect the correct values. In addition, customers may choose to delete/clear this table using the postSql property. |
incrementalModeActivated | Set to "true" to enable incremental updates. |
Timestamp Variables for DB Scanner
The four timestamp variables usable in a database scanner's sql
, preSql
, postSql
, and deletedSql
property values include:
%%TIMESTAMP%%
Start time of the connector's current crawl, expressed in seconds since the epoch.
%%TIMESTAMP_MS%%
Start time of the connector's current crawl, expressed in milliseconds since the epoch.
%%LAST_SCAN_TIMESTAMP%%
Start time of the connector's previous crawl, expressed in seconds since the epoch. Requires incrementalModeActivated="true".
%%LAST_SCAN_TIMESTAMP_MS%%
Start time of the connector's previous crawl, expressed in milliseconds since the epoch. Requires incrementalModeActivated="true".
AIE Managed Incremental Indexing
AIE managed incremental indexing requires AIE to keep a signature of every document fed to AIE. As each document is retrieved from the database it's signature is checked to see if the document is new, or changed, and should be indexed. If the signature is the same, the document is not fed to AIE. After all documents are fed, the documents not seen in the current run are presumed deleted from the database, and a request is sent to AIE to delete each document. This part is optional.
If AIE managed incremental indexing is desired, use the following properties to support incremental indexing:
scanner property | description for incremental indexing support |
---|---|
sql | Select all records in the target table/view. |
deletedSql | Leave empty. |
incrementalModeActivated | Set to "true" to incremental updates. |
incrementalDeletes | Set to "true" to send deletes in for all new documents not seen in any given run. |
rsFetcherClass | You must set this to com.attivio.connector.IncrementalResultSetFetcher. |
incrementalModeActivated
To enable incremental updates, you must set the incrementalModeActivated property for the scanner to "true". Once enabled, the DatabaseScanner pre-processes all SQL statements replacing %%LAST_SCAN_TIME_MS%% and %%LAST_SCAN_TIME%% with the time the scanner last ran. As a result, the following SQL statement fetches all rows between the last scan and now.
SELECT * FROM docs WHERE NOT(deleted) AND ts > %%LAST_SCAN_TIMESTAMP_MS%% AND ts <= %%TIMESTAMP_MS%%
incrementalDeletes
If incrementalDeletes property is enabled, the DatabaseScanner deletes AIE index records for documents that no longer exist within the database.
deletedSql
The deletedSql property contains a SQL statement used to track records that no longer exist. For each row returned, the DatabaseScanner issues a delete for the idField column.
postSql
The postSql property contains a SQL statement that executes after the scanner is complete to cleanup any database tables.
com.attivio.connector.IncrementalResultSetFetcher
You can use the IncrementalResultSetFetcher to track changes within each row. It preforms a check sum on each row returned and only re-ingests changed documents.
Examples
These examples use the following table where id is the document id, ts is the last modified timestamp, del is a deleted flag.
CREATE TABLE docs ( id varchar, ts LONG , del BOOLEAN )
Full Table Scanning
The following example shows how to use the com.attivio.connector.IncrementalResultSetFetcher to perform a full table scan.
<connector name="dbConnector"> <scanner class="com.attivio.connector.DBConnector"> <properties> <property name="sql" value="SELECT * FROM docs" /> <property name="rsFetcherClass" value="com.attivio.connector.IncrementalResultSetFetcher" /> <property name="incrementalModeActivated" value="true" /> <property name="incrementalDeletes" value="true" /> <property name="idField" value="id" /> <property name="jdbcDriver" value="org.h2.Driver" /> <property name="connectionString" value="jdbc:h2:mem:aname" /> <property name="username" value="sa" /> <property name="password" value="" /> </properties> </scanner> </connector>
Changes/Delete Table
In the following example assumes that docs table only contains the changed documents. The sql statement fetches only updated documents, the deletedSql issue deletes for deleted documents, and postSql drops all ingested changes.
<connector name="dbConnector"> <scanner class="com.attivio.connector.DBConnector"> <properties> <property name="sql" value="SELECT * FROM docs WHERE NOT( del ) AND ts < %%TIMESTAMP_MS%% "/> <property name="deletedSql" value="SELECT * FROM docs WHERE del AND ts < %%TIMESTAMP_MS%% " /> <property name="postSql" value="DELETE FROM docs WHERE ts < %%TIMESTAMP_MS%%"/> <property name="idField" value="id" /> <property name="jdbcDriver" value="org.h2.Driver" /> <property name="connectionString" value="jdbc:h2:mem:aname" /> <property name="username" value="sa" /> <property name="password" value="" /> </properties> </scanner> </connector>
Timestamp Tables
The following example shows how to use the ts column to fetch only changed rows. Unlike, the previous example, it does not preform any updates.
<connector name="dbConnector"> <scanner class="com.attivio.connector.DBConnector"> <properties> <property name="sql" value="SELECT * FROM docs WHERE NOT(del) AND ts > %%LAST_SCAN_TIMESTAMP_MS%% AND ts <= %%TIMESTAMP_MS%%" /> <property name="deletedSql" value="SELECT id FROM docs WHERE del AND ts > %%LAST_SCAN_TIMESTAMP_MS%% AND ts <= %%TIMESTAMP_MS%%" /> <property name="rsFetcherClass" value="com.attivio.connector.IncrementalResultSetFetcher" /> <property name="incrementalModeActivated" value="true" /> <property name="incrementalDeletes" value="true" /> <property name="idField" value="id" /> <property name="jdbcDriver" value="org.h2.Driver" /> <property name="connectionString" value="jdbc:h2:mem:aname" /> <property name="username" value="sa" /> <property name="password" value="" /> </properties> </scanner> </connector>