The Attivio Intelligence Engine (AIE) can be configured to ingest content from a database through a JDBC interface. AIE provides a default scanner for loading content from a database: the DatabaseScanner .
These features require that the dbconnector module be included when you run createproject to create the project directories.
View incoming links.
Installing a JDBC Driver
Install the appropriate JDBC driver for the source database provided by the database vendor. Drivers can usually be obtained from database vendor customer support. Sun Microsystems also provides an online JDBC driver reference: JDBC™ Data Access API.
Add the JDBC driver to an AIE installation by copying its JAR file to the <install_dir>\lib\addons\module-dbconnector folder of the AIE installation directory (e.g. copy mysql-connector-java-5.1.x-bin.jar to C:\Attivio\lib\addons\module-dbconnector\).
By default, the DB Connector module installs and loads the following default JDBC drivers. Replace the default version in <install_dir>\lib\addons\module-dbconnector if a different version is needed.
- PostgreSQL - http://jdbc.postgresql.org/download.html
For Oracle databases, the appropriate JDBC driver can be downloaded from the following URL:
For an open source, LGPL, JDBC driver for Microsoft SQL Server and Sybase, one option is to download JTDS from the following URL:
Microsoft JDBC Driver for SQL Server
If your JDBC Database connector uses the Microsoft JDBC Driver for SQL Server instead of the third-party JTDS driver linked above, you must set the connector's Use Embedded Driver property to "false" to avoid connector errors when attempting to load this driver. See the Database Scanner Properties section below for more details on this property.
JDBC Memory / Cursor Considerations
Some database drivers will attempt to cache the full JDBC result on the client side before allowing clients to iterate through those results. This works for small result sets, however with large result sets this might cause AIE to run out of memory. Some drivers have a mechanism to enable result set paging or streaming so that only a few records are brought into memory at a time. Use these JDBC driver mechanisms to keep memory use in acceptable limits.
Cursor (result set paging)
Microsoft SQL Server
Add selectMethod=cursor as a JDBC connection string argument.
Add useCursorFetch=true as a JDBC connection string argument. Set defaultFetchSize to a reasonable small number (1000 should work for most scanner configurations).
If you would like to enable row-by-row streaming of results to reduce memory consumption on the client side, you should set the f etchSize=-2147483648, as explained in the MySQL documentation. According to the mysql documentation ResultSet streaming must be configured by setting fetchSize to
Document ID Mapping
The document ID field value is very important when ingesting from a structured data source. The document ID should reflect the source system, database and record ID so as to avoid record collisions. For example, if a customer record and order record have overlapping numeric ID ranges, whichever document is ingested last will overwrite the previously ingested records. Creating a compound document ID resolves this potential problem.
For the example above, the document ID could be explicitly defined in the SQL statement:
The DatabaseScanner configuration would then specify the ID field to be "id". In the Connector Editor interface we would simply type "id" into the ID Column field on the Scanner Tab:
When ingesting from structured data sources, it is recommended that a "table" field be appended to all documents to simplify segmenting and joining of content within AIE at query time. A default table field is included in the AIE Schema, so you don't have to define your own.
The easy way to tag each incoming document with a table name is to open the Connector Editor for the connector, turn to the Field Mappings tab, and make an entry under Static Field Values.
We can also create a static "table" value using the example SELECT statement from above. It would be modified as follows to populate the table field:
The table field can now be used in queries to create a virtual table structure at run-time. For example, a query for "table:DB.foo" is equivalent to "SELECT * FROM DB.foo".
During ingestion of database content, AIE lowercases all field names by default. Keep this in mind and be sure to use lowercase versions of your database field names when configuring the AIE schema.
Data Mapping Guidelines
Each relational database maintains its own set of data types that it can use for field values. Most objects are mapped from the java.sql.ResultSet directly by calling rs.getObject() with the appropriate column identifier. For further information on the underlying JDBC type mapping please refer to the JDBC documentation.
AIE handles certain types with specific logic, as follows:
Bytes are streamed from the RDB to temporary disk files.
Bytes are streamed from the RDB to temporary disk files.
Bytes are streamed from the RDB in the UTF-8 encoding; if source content is not in UTF-8 it should be converted as a part of the SQL statement.
Configure a Standard Database Connector
The easy way to create a new connector is to use the AIE Administrator's Connector Editor.
Start AIE using the AIE Agent and its Command-Line Interface (CLI). This will start AIE and will make the Administration UI available, usually at http://<host>:17000/admin.
In the Administration UI, navigate to System Management > Connectors. Click New in the menu bar. Select the Database Connector from the list.
On the Scanner tab of the resulting dialog box, enter the Connector Name (DBConnector). You'll also need the SQL query and the database connection parameters. Note that the query is expressed in the native SQL of the database.
The ID Column field is for the name of the field that contains the document ID value. In our example, this would be the code field of the SQL query results.
The explanation of the other fields is in the discussion of the DBScannerProperties.
It is a good practice to label incoming database records with a table field to identify the source of the record. We'll put these records into the country table. This lets us issue queries against specific virtual tables in the AIE index.
Click the Field Mappings tab of the connector editor and enter a static "table" field with value "country". this value will be applied to every ingested record from this connector.
Click Save. The Connector UI writes out the connection configuration to the project's configuration servers.
The database connector is ready to run. The instructions are in Running the Standard Database Connector, below.
Running the Standard Database Connector
Erasing the Index
While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.
To run the Database Connector, open the AIE Administration UI, and navigate to the System Management > Connectors page. Right-click on DBConnector and click on Start.
Then navigate to SAIL, which is Query > SAIL. Search for *:*, which retrieves all records in all tables. We can see that the scanner was successful:
For an example of configuring a Database Scanner connector in the Connector Editor UI, see the example under Loading Database Data.
Database Scanner Properties
The Database Scanner can be configured by setting property values in fields of the Connector Editor.
The name of the connector as seen in the UI or in XML.
|Node Set||The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors.|
|JDBC Driver Class|
The database JDBC driver, as in "com.mysql.jdbc.Driver". Required.
|JDBC Connection String||The database connection string, such as "jdbc:mysql://localhost:3306/mondial". Required.|
|User Name||The database user's username. Required.|
|Password||The database user's password. Required.|
|SQL||The SQL query that pulls the desired content from the database. Required.|
|ID column||Name of the database field to be used as the document ID. Required.|
|Document ID Prefix||Append this prefix to the Document ID during processing.|
|Ingest Workflow||Ingestion workflow to receive the ingested documents. String.|
|JDBC fetch size hint||Used when retrieving result sets.|
|Pre SQL||The SQL that will be run before the execution of the main SQL statement.|
|Post SQL||The SQL that will be run after the execution of the main SQL statement.|
|Delete SQL||SQL that will be return a list of document Ids that should be removed from AIE.|
The class that filters unchanged rows from a SQL statement, as in "com.attivio.connector.IncrementalResultSetFetcher".
|Auto Commit||Whether the connector should issue a commit message after ingesting the records.|
Expert use only. Puts this connection in read-only mode as a hint to the driver to enable database optimizations.
SQL Query Timeout
|The connector will wait this many seconds before returning from a failed query. -1 means don't time out.|
|Store Clob in Store||"false" (default) means that character large object (CLOB) data is written to the Attivio ingest document; "true" means that CLOB data is streamed to the Attivio content store and that a content pointer for this data is written to the Attivio ingest document, reducing its size. Set to "true" if your database includes large CLOB fields.|
|Use Embedded Driver||"true" (default) means the JDBC driver runs in the Attivio connector node's Java Virtual Machine (JVM); "false" means it runs in a separate JVM. Set to "false" when using Microsoft's JDBC driver for SQL Server.|
|Incremental Mode Activated||Enables incremental updates. Boolean.|
|Incremental Deletes||Optional. Used with 'incremental-activated' parameter to control if AIE should delete documents that have been removed from the source files. Default is true.|
The table above is for the JDBC Database Scanner Tab. The other tabs in the Connector Editor are described on the Connectors page.
After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:
- Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
- Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.
If you must make changes to change the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.
Hive Connector Considerations
Hive Query Development
Hive connectors are configured the same as other JDBC based connectors. The Hive query language (QL) is very close to standard SQL. However, the Hive engine is different from the commonly used database engines such as Oracle and SQLServer. Hive queries are executed on multiple Hadoop nodes, might consume a large amount of CPU and memory resources and are subject to Yarn's limits and resource allocation policies.
We recommend that you test and develop the SQL query used by the connector in a standalone fashion, using Hadoop tools. The connector should be configured only after the query is tested and fine tuned. Use the following documentation to fine tune Hive queries
- For Cloudera use the Impala Documentation.
- For Hortonworks use Hortonworks Hive Documentation.
- You can also use generic Hadoop/Hive documentation such as Hive Performance Tuning.
Help from the Hadoop administrator might be required for allocating the appropriate resources for the Hive query .
Monitoring the Hive Connector
The Attivio Administrator connectors screen and Attivio logs can be used to monitor the progress of the Hive connector. In particular you will monitor these stages:
- Connector start.
- Connector connection to Hive.
- Query execution.
- The rate of document ingestion,
Troubleshooting Hive Connectors
If there is a query execution failure or document ingestion is slow, use the Hadoop administration tools (Ambari, Cloudera Manager) to monitor the underlying YARN service, Hive processes, and logs. The most common reasons for these issues are:
- YARN has insufficient resources to start the MapReduce job.
- YARN is under-allocating resources to the MapReduce job.
If you see that the YARN memory or CPU usage is high, then it is expected that the connector will not start or will be slow, as there are inadequate resources within the cluster to support running the MapReduce job. Please work with the cluster administrator to increase the memory/CPU or stop running jobs.
Action Recommendations when ingestion is sluggish
- Wait for the YARN job to start - If the YARN job has yet to begin, it is best to wait for the job to complete before stopping the connector.
- Make sure the cluster is "green". The services in the cluster should be running without error.
- Check the MapReduce priorities set by YARN using the command line call: Hadoop job -list.
- Don't pause the connector, it doesn't pause the underlying Hive MapReduce jobs and doesn't release Attivio and Hadoop resources.
- Don't start additional connectors: If one connector is not performing well, starting other connectors wouldn't help solve the problem. If additional connectors are launched, and the underlying issue is that there are too many MapReduce jobs executing, then launching additional connectors will result in the creation of additional jobs. This will result in each job being queued further, which will not increase the rate of execution.