Overview
The Excel Connector reads XSL and .XSLX files and converts each row into an IngestDocument .
To use the Excel Connector, you must include the excelconnector module when you create the project configuration files using the createproject tool.
View incoming links.
Sample XLSX File
The follow example presumes that we have loaded this CSV file into Excel:
id,title,author,creationdate,location,teaser 1,Oliver Twist,Charles Dickens,1837,London,"Oliver Twist, Fagin, Nancy, Bill Sykes, and the Artful Dodger live by their wits in this dark tale of Victorian England." 2,Journey to the Centre of the Earth,Jules Verne,1864,Centre of the Earth,"Eccentric Uncle Lindenbrock is determined to search for a fabled land in the center of the earth. He takes his nephew, Axel, and their servant, Hans on the ultimate spelunking adventure."
It looks like this when viewed in Excel:
The first row contains column names. The column names are fields defined in <project-dir>\conf\schema\default.xml. Since they are present in the schema, these field names will be passed directly into the AIE index without forcing us to set up any customized field mapping during ingestion. Fields that are not present in the schema are not added to the index.
Save the file as books.xlsx. We'll load that file in the following example.
Configuring an Excel Connector
You can configure an Excel connector by using the Connector UI.
Start AIE using the AIE Agent's Command-Line Interface. This will start AIE and will make the Administration UI available at http://<host>:17000/admin.
In the Administration UI, navigate to System Management > Connectors. Click New in the menu bar. Select the Excel Connector from the list.
On the Scanner tab of the resulting dialog box, enter the Connector Name (ExcelBookConnector) and the Start Directory (c:\documents\books.xslx.) With the example we have set up on this page, you can accept the default behavior of all the other fields.
UNC Paths
File connectors such as the Excel Connector support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes. It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.
The next step is to visit the Other tab to set the name of the sheet to import. (Leave it blank to import all the sheets in the workbook.)
Then click the Field Mappings tab. Add one static field, declaring that each of these incoming IngestDocuments should be included in the "book" table:
Click Save. The Connector UI writes out the connection configuration to the project's configuration servers.
That's all it takes. You can load your file. The instructions are in "Running the Excel Connector," below.
Excel Connector Properties
The Excel Connector is configured by setting properties on the editor.
Excel Scanner Tab | Remarks | |
---|---|---|
Connector Name | The name of the connector as seen in the UI or in XML. | |
Node Set | The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors. | |
Start Directory | The directory containing the files to scan, or the root directory of the tree to scan.. Avoid using the same start directory in multiple connectors. This can confuse the incremental deletion feature, causing unexpected deletions. | |
Field Names in First Row | Set to true if the first row of the worksheet is a header containing the field names. Otherwise, you must supply fieldNames. | |
Fields | A list of field names, used if Field Names in First Row is false. | |
ID Fields | A list of fields to concatenate to create a unique id value. | |
ID Field Format | Describes how to concatenate the values from one or more ID Fields into a single value, which will be used as the record's unique id. The value is a string that follows the behavior of the format method of the Java String class. | |
Lines to Skip | Number of leading data lines to ignore. | |
Follow Symbolic Links | Whether or not the scanner should follow symbolic links while | |
Maximum Directory Depth | Maximum number of nested directory levels to traverse. "-1" | |
Minimum File Size (MB) | Minimum file size to send (in MB). Smaller files will be dropped. | |
Maximum File Size (MB) | Maximum file size to send in megabytes. | |
Wildcard Include Filter | File-extension wildcards. Matching files will be scanned. | |
Wildcard Exclude Filter | File-extension wildcards. Matching files will not be scanned. | |
Document ID Prefix | Append this prefix to the Document ID during processing. | |
Ingest Workflow | Ingestion workflow to receive the ingested documents. String. | |
Incremental | ||
Incremental Deletes | Optional. Used with the Incremental Mode Activated parameter to control if AIE should delete documents that have been removed from the source files. Default is true. | |
Advanced | ||
Delete After Crawl | Delete the files after they have been scanned. Do not use with the Incremental Mode Activated feature. | |
Move to directory after crawl | Move the scanned files to this directory after they are scanned. Do not use with the Incremental Mode Activated feature. |
The Excel Connector has an Other tab that contains several properties of special interest.
Excel Other Tab | Remarks |
---|---|
Worksheet List | List of sheets within a workbook to process. Leave blank to process all sheets. |
Extra Column Field | Name of a field in which to place any extra column data that might be found. |
Strict Load | If true, all row data must be accounted for. Trailing empty cells at the end of a row will cause the row to be skipped. |
Keytab | Location of keytab file for Kerberos authentication. |
Principal Name | Principal name for Kerberos authorization. |
Incremental Mode Activated | If true, this connector will run in Incremental Mode. |
Name Node Principal | Configuration property for enabling support for Kerberos. |
File System URI | Use this field to access an HDFS file system. The syntax is hdfs://[username@] host:port, for example, hdfs://acevm0681.lab.attivio.com:8020/. Otherwise leave it empty. |
Additional Start Directories | If there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl. If there is more than one root directory to scan, put the first one in the Start Directory field (and optionally specify the Move to Directory After Crawl field) and then add the other directories here. Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory. |
Scan hidden files | If true, scan all readable files including system and hidden files. |
Directory Listing Timeout | Provide configurable directory listing times (in seconds). |
The tables above are for the Excel Connector. The other tabs in the Connector Editor are described on the Connectors page.
Running the Excel Connector
Erasing the Index
While testing a new connector, you will frequently need to empty the index and try again. Methods of deleting the index are described here.
To run the Excel Connector, open the AIE Administration UI, and navigate to the System Management > Connectors page. Right-click on ExcelBookConnector and click Start.
Then navigate to SAIL, which is Query > SAIL. Search for *:*, which retrieves all records in all tables. We can see that the scanner was successful:
To view all of these fields in the search results, open the Search Options dialog box and check the Debug checkbox. Click the Search button to return to the display of results.
Incremental Updating
If one or more input files are renamed or deleted after the first run of an incrementally-enabled connector, the next connector run deletes all of the documents associated with those files from the index. But if the file is left in place with one or more rows removed, the next connector run does not delete documents associated with the missing rows, because it only checks whether the source file is still present.
This connector supports the Incremental Updating features. There is a tutorial example of incremental updating here.
After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:
- Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
- Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.
If you must make changes to the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.