Page tree

 

Overview

The Excel Connector  reads XSL and .XSLX files and converts each row into an IngestDocument .


To use the Excel Connector, you must include the excelconnector module when you create the project configuration files using the createproject tool.

 

View incoming links.

Sample XLSX File

The follow example presumes that we have loaded this CSV file into Excel: 

books.csv
id,title,author,creationdate,location,teaser
1,Oliver Twist,Charles Dickens,1837,London,"Oliver Twist, Fagin, Nancy, Bill Sykes, and the Artful Dodger live by their wits in this dark tale of Victorian England."
2,Journey to the Centre of the Earth,Jules Verne,1864,Centre of the Earth,"Eccentric Uncle Lindenbrock is determined to search for a fabled land in the center of the earth. He takes his nephew, Axel, and their servant, Hans on the ultimate spelunking adventure."

It looks like this when viewed in Excel:

The first row contains column names. The column names are fields defined in <project-dir>\conf\schema\default.xml. Since they are present in the schema, these field names will be passed directly into the AIE index without forcing us to set up any customized field mapping during ingestion. Fields that are not present in the schema are not added to the index.

Save the file as books.xlsx.  We'll load that file in the following example.

Configuring an Excel Connector

Start AIE using the AIE Agent's Command-Line Interface. This will start AIE and will make the Administration UI available at http://<host>:17000/admin.

In the Administration UI, navigate to System Management > Connectors. Click New in the menu bar. Select the Excel Connector from the list.

On the Scanner tab of the resulting dialog box, enter the Connector Name (ExcelBookConnector) and the Start Directory (c:\documents\books.xslx.) With the example we have set up on this page, you can accept the default behavior of all the other fields.

UNC Paths

File connectors such as the Excel Connector support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes. It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.

 

The next step is to visit the Other tab to set the name of the sheet to import. (Leave it blank to import all the sheets in the workbook.)

Then click the Field Mappings tab.  Add one static field, declaring that each of these incoming IngestDocuments should be included in the "book" table:

Click Save. The Connector UI writes out the connection configuration to the project's configuration servers.

That's all it takes. You can load your file. The instructions are in "Running the Excel Connector," below.

 

Excel Connector Properties

The Excel Connector is configured by setting properties on the editor.

Excel Scanner Tab

Remarks

Connector Name

The name of the connector as seen in the UI or in XML.

Node Set

The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors.

Start Directory

The directory containing the files to scan, or the root directory of the tree to scan..

Avoid using the same start directory in multiple connectors. This can confuse the incremental deletion feature, causing unexpected deletions.

Field Names in First RowSet to true if the first row of the worksheet is a header containing the field names. Otherwise, you must supply fieldNames.
FieldsA list of field names, used if Field Names in First Row is false.
ID Fields

A list of fields to concatenate to create a unique id value.
Used with idFieldFormat. Default is "id".

ID Field FormatDescribes how to concatenate the values from one or more
ID Fields into a single value, which will be used as the record's
unique id. The value is a string that follows the behavior of the
format method of the Java String class.
Lines to SkipNumber of leading data lines to ignore.
Follow Symbolic Links

Whether or not the scanner should follow symbolic links while
crawling the file system.

Maximum Directory Depth

Maximum number of nested directory levels to traverse. "-1"
means no limit.

Minimum File Size (MB)Minimum file size to send (in MB). Smaller files will be dropped.
Maximum File Size (MB)

Maximum file size to send in megabytes.

Wildcard Include Filter

File-extension wildcards. Matching files will be scanned.

Wildcard Exclude Filter

File-extension wildcards. Matching files will not be scanned.

Document ID Prefix

Append this prefix to the Document ID during processing.

Ingest Workflow

Ingestion workflow to receive the ingested documents. String.

Incremental 

Incremental Deletes

Optional. Used with the Incremental Mode Activated parameter to control if AIE should delete documents that have been removed from the source files. Default is true.

Advanced 
Delete After Crawl

Delete the files after they have been scanned. Do not use with the  Incremental Mode Activated feature.

 
Move to directory after crawl

Move the scanned files to this directory after they are scanned. Do not use with the Incremental Mode Activated feature.

 

The Excel Connector has an Other tab that contains several properties of special interest.

Excel Other TabRemarks
Worksheet ListList of sheets within a workbook to process. Leave blank to process all sheets.
Extra Column FieldName of a field in which to place any extra column data that might be found.
Strict LoadIf true, all row data must be accounted for. Trailing empty cells at the end of a row will cause the row to be skipped.
KeytabLocation of keytab file for Kerberos authentication.
Principal NamePrincipal name for Kerberos authorization.
Incremental Mode ActivatedIf true, this connector will run in Incremental Mode.
Name Node PrincipalConfiguration property for enabling support for Kerberos.
File System URI
 
Use this field to access an HDFS file system. The syntax is hdfs://[username@] host:port, for example, hdfs://acevm0681.lab.attivio.com:8020/. Otherwise leave it empty.
Additional Start DirectoriesIf there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl.

If there is more than one root directory to scan, put the first one in the Start Directory field (and optionally specify the Move to Directory After Crawl field) and then add the other directories here. Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory.
Scan hidden filesIf true, scan all readable files including system and hidden files.
Directory Listing TimeoutProvide configurable directory listing times (in seconds).

 

The tables above are for the Excel Connector. The other tabs in the Connector Editor are described on the Connectors page.

Running the Excel Connector

Erasing the Index

While testing a new connector, you will frequently need to empty the index and try again. Methods of deleting the index are described here.

To run the Excel Connector, open the AIE Administration UI, and navigate to the System Management > Connectors page. Right-click on ExcelBookConnector and click  Start.

Then navigate to SAIL, which is Query > SAIL. Search for *:*, which retrieves all records in all tables. We can see that the scanner was successful:

To view all of these fields in the search results, open the Search Options dialog box and check the Debug checkbox. Click the Search button to return to the display of results. 

Incremental Updating

If one or more input files are renamed or deleted after the first run of an incrementally-enabled connector, the next connector run deletes all of the documents associated with those files from the index. But if the file is left in place with one or more rows removed, the next connector run does not delete documents associated with the missing rows, because it only checks whether the source file is still present.

This connector supports the Incremental Updating features. There is a tutorial example of incremental updating here.

After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:

  • Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
  • Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.

If you must make changes to the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.

  • No labels