Page tree
Skip to end of metadata
Go to start of metadata

Overview

A CSV file is a text file of comma-separated values such as one might export from Excel.  The Attivio Intelligence Engine (AIE) provides a tool for ingesting CSV content. 

View incoming links.

CSV Scanner

A CsvScanner  parses a CSV file and sends its contents to AIE as a set of IngestDocuments .  The CsvScanner produces one IngestDocument for each line of the CSV file.


Sample CSV File

The follow scanner configuration example presumes that we are loading this CSV file from a C:\Documents\CSV directory:

C:\Documents\CSV\books.csv
id,title,author,creationdate,location,teaser
1,Oliver Twist,Charles Dickens,1837,London,"Oliver Twist, Fagin, Nancy, Bill Sykes, and the Artful Dodger live by their wits in this dark tale of Victorian England."
2,Journey to the Centre of the Earth,Jules Verne,1864,Centre of the Earth,"Eccentric Uncle Lindenbrock is determined to search for a fabled land in the center of the earth. He takes his nephew, Axel, and their servant, Hans on the ultimate spelunking adventure."

The first row contains column names. The column names are fields defined in <project-dir>\conf\schema\default.xml. Since they are present in the schema, these field names will be passed directly into the AIE index without forcing us to set up any customized field mapping during ingestion.

Content Encoding

By default, the CSV connector interprets input files using UTF-8 content encoding. You can override this default encoding by setting value of the connector's Input File Encoding property to any encoding supported by Java.

All files read by a single CSV connector must have the same encoding; if you have CSV files in multiple encodings, create a separate CSV connector for each encoding.

Configuring a CSV Connector

Start AIE using the AIE Agent's Command-Line Interface. This will start AIE and will make the Administration UI available at http://<host>:17000/admin.

In the Administration UI, navigate to System Management > Connectors. Click New in the menu bar. Select the CSV Connector from the list.

SelectCSVConnector

On the Scanner tab of the resulting dialog box, enter the Connector Name (csvBookConnector) and the Start Directory (c:\documents\books.csv). With the example we have set up on this page, you can accept the default behavior of all the other fields.

UNC Paths

File connectors such as the CSV Connector support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes. It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.

Then click the Field Mappings tab. In our example, we have bypassed most of the field-mapping issue by using AIE schema field names in the first line of the CSV file. However, we would like to add one static field, declaring that each of these incoming IngestDocuments should be included in the "book" table:

Click Save. The Connector UI writes out the connection configuration to the project's configuration servers.

That's all it takes. You can load your file. The instructions are in "Running the CsvConnector," below.

CSV Connector Properties

The CsvScanner is configured by setting properties on the editor.

CSV Scanner Editor

Remarks

Connector Name

The name of the connector as seen in the UI or in XML.

Node Set

The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors.

File System URI
 
Use this field to access an HDFS file system. The syntax is hdfs://[username@] host:port, for example, hdfs://acevm0681.lab.attivio.com:8020/. Otherwise leave it empty.
Start Directory

The directory containing the files to scan, or the root directory of the tree to scan.[REQUIRED].

Avoid using the same start directory in multiple CSV scanners. This can confuse the incremental deletion feature, causing unexpected deletions.

Separator CharacterSeparator character between columns, handles TAB as an option. Default is comma.
Quote CharacterQuote character around a single text column, handles 'quote' and 'doublequote' as options. Default is 'doublequote'.
Field Names in First RowSet to true if the first row of the CSV file is a header containing the field names.  Otherwise, you must supply fieldNames.
CSV FieldsA list of field names, used if firstRowAreFieldNames is false.
Line Number Field NameName of the field to put the line number in.
ID Fields

A list of CSV fields to concatenate to create a unique id value.
Used with idFieldFormat. Default is "id".

ID Field FormatDescribes how to concatenate the values from one or more
idFields into a single value, which will be used as the record's
unique id. The value is a string that follows the behavior of the
format method of the Java String class.
Lines to SkipNumber of leading data lines to ignore.
Input File EncodingCharacter set or encoding to use when reading the CSV file.
Follow Symbolic Links

Whether or not the scanner should follow symbolic links while
crawling the file system.

Maximum Directory Depth

Maximum number of nested directory levels to traverse. "-1"
means no limit.

Minimum File Size (MB)Minimum file size to send (in MB). Smaller files will be dropped.
Maximum File Size (MB)

Maximum file size to send in megabytes.

 

Wildcard Include Filter

File-extension wildcards. Matching files will be scanned.

Wildcard Exclude Filter

File-extension wildcards. Matching files will not be scanned.

Directory Listing TimeoutProvide configurable directory listing times (in seconds).

Document ID Prefix

Append this prefix to the Document ID during processing.

Ingest Workflow

Ingestion workflow to receive the ingested documents. String.

Incremental 

Incremental Mode Activated

Enables incremental updates.  Boolean.

Incremental Deletes

Optional. Used with 'incremental-activated' parameter to control if AIE should delete documents that have been removed from the source files. Default is true.

Advanced 
Delete After Crawl

Boolean.  Delete the files after they have been scanned. Do not use with the incrementalModeActivated feature.

 
Move to directory after crawl

Move the scanned files to this directory after they are scanned. Do not use with the incrementalModeActivated feature.

 
Additional Start Directories

If there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl.

If there is more than one root directory to scan, put the first one in the Start Directory field (and optionally specify the Move to Directory After Crawl field) and then add the other directories here.

Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory.

 
Max RowsNumber of rows to read from the file.
Scan hidden filesIf true, scan all readable files including system and hidden files.
Kerberos 
KeytabLocation of keytab file for Kerberos authentication.
Principal NamePrincipal name for Kerberos authorization.

The table above is for the CSV Connector Scanner Tab.  The other tabs in the Connector Editor are described on the Connectors page.

Running the CsvConnector

Erasing the Index

While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.

To run the CsvConnector, open the AIE Administration UI, and navigate to the System Management > Connectors page. Right-click on csvBookConnector and click on Start.

Then navigate to SAIL, which is Modules > SAIL.    Search for *:*, which retrieves all records in all tables.  We can see that the scanner was successful:

To view all of these fields in the search results, open the SAIL Properties dialog box (click on the gear icon) and add the field names to the Field Expressions tab Other Results Fields list.

Incremental Updating

If one or more CSV input files are renamed or deleted after the first run of an incrementally-enabled CSV connector, the next connector run deletes all of the documents associated with those files from the index. But if the file is left in place with one or more rows removed, the next connector run does not delete documents associated with the missing rows, because it only checks whether the source file is still present.

This connector supports the Activating Incremental Updating features. There is a tutorial example of incremental updating here.

After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:

  • Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
  • Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.

If you must make changes to change the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.

 

By default, the CSV connector interprets input files using UTF-8 content encoding. You can override this default encoding by setting value of the connector's Input File Encoding property to any encoding supported by Java.

  • No labels