AIE can ingest a variety of text-like files from a file system using the File Connector . This page demonstrates how to use the File Connector to crawl a target document container, scan its documents, and extract text to be loaded into the AIE index.
Note that the File Connector sends output to the fileIngest workflow, which invokes Advanced Text Extraction features. These features extract text from over 500 file types, and can break up composite files (zip files, etc.,) into individual documents automatically.
View incoming links.
Before You Begin
Ensure that the environment is prepared as follows:
- Create a new project that includes the advancedtextextraction and SAIL modules, or just include the demo group of modules, which includes both.
- The advancedtextextraction module provides the fileIngest workflow used in the examples.
- SAIL lets us view the ingested files as search results.
Start AIE using the AIE Agent and its Command-Line Interface (CLI).
Setting Up the Target Files
To experiment with the File Connector, create a new directory and put a few document files in it. In these examples, using a Windows environment, we created this directory:
The top-level directory is not a requirement; it just makes the example code easier to read. We copied a few .docx and .html files into the directory.
Configuring a File Connector
The Connector Editor is part of the AIE Administrator.
Once you have started the AIE node, the AIE Administrator will be available at http://<host>:17000/admin.
In the AIE Administrator, navigate to System Management > Connectors. Click New in the menu bar. Select the File Connector from the list.
On the Scanner tab of the resulting editor, enter the Connector Name (textFileConnector) and the Start Directory (c:\documents). With the example we have set up on this page, you can accept all other default values.
File connectors support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes. It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.
The remaining fields in the editor are listed in the section on File Connector Properties, below on this page.
Click Save. The Connector UI sends the connection's XML configuration to the Configuration Server. To view this configuration, use the CLI update command, and then look for the new connector file in the <project-dir>\conf\connectors directory. It will have the same name that you gave to the connector.
File Connector Properties
The File Connector is configured by setting properties on the editor.
File Connector Scanner Tab
The name of the connector as seen in the UI or in XML.
The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors.
File System URI
|Use this field to access an HDFS file system. The syntax is hdfs://[username@] host:port, for example, hdfs://acevm0681.lab.attivio.com:8020/. Otherwise leave it empty.|
The directory containing the files to scan, or the root directory of the tree to scan.[REQUIRED].
Avoid using the same start directory in multiple scanners. This can confuse the incremental deletion feature, causing unexpected deletions.
|Follow Symbolic Links|
Whether or not the scanner should follow symbolic links while crawling the file system.
|Maximum Directory Depth|
Maximum number of nested directory levels to traverse. "-1" means no limit.
|Minimum File Size (MB)||Minimum file size to send (in MB). Smaller files will be dropped.|
|Maximum File Size (MB)|
Maximum file size to send in megabytes.
|Wildcard Include Filter|
File-extension wildcards. Matching files will be scanned.
|Wildcard Exclude Filter|
File-extension wildcards. Matching files will not be scanned.
|Directory Listing Timeout||Provide configurable directory listing times (in seconds).|
Document ID Prefix
Append this prefix to the Document ID during processing.
Ingestion workflow to receive the ingested documents. String.
Incremental Mode Activated
Enables incremental updates. Boolean.
Optional. Used with the Incremental Mode Activated parameter to control if AIE should delete documents that have been removed from the source files. Default is true.
|Delete After Crawl|
Boolean. Delete the files after they have been scanned. Do not use with the Incremental Mode Activated feature.
|Move to directory after crawl|
Move the scanned files to this directory after they are scanned. Do not use with the Incremental Mode Activated feature.
|Additional Start Directories|
If there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl.
Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory.
|Scan hidden files||If true, scan all readable files including system and hidden files.|
|Keytab||Location of keytab file for Kerberos authentication.|
|Principal Name||Principal name for Kerberos authorization.|
|Name Node Principal||Configuration property for enabling support for Kerberos.|
The table above is for the File Connector Scanner Tab. The other tabs in the Connector Editor are described on the Connectors page.
Running the File Connector
Erasing the Index
While testing a new connector, you will frequently need to empty the index and try again. Methods of deleting the index are described here.
Open the AIE Administrator in a web browser, using a URL similar to http://localhost:17000/admin/.
Navigate to System Management > Connectors. Here is the textFileConnector we just defined:
Click the checkbox to select the connector. Click the Start link to load the files.
Navigate to the SAIL interface: Query > SAIL. You may search by clicking on a Key Phrase, or by typing keywords into the Search field. To match all available records, search for *:*.
SAIL will show you a great deal more information if you click the Search Options link and check the Debug checkbox.
Loading Single Record XML Files with the File Connector
If the XML content to be loaded is formatted such that each XML file corresponds to a single IngestDocument , the content can be loaded by using the standard file connector and the xmlIngest ingestion workflow.
For example, say your XML input data looks like this:
<?xml version="1.0" encoding="utf-8"?> <doc id="1"> <field name="name1" value="value1" /> <field name="name2" value="value2" /> <field name="less_useful_data" value="value3" /> </doc>
<?xml version="1.0" encoding="utf-8"?> <doc id="2"> <field name="name1" value="value4" /> <field name="name2" value="value5" /> <field name="less_useful_data" value="value6" /> </doc>
and you want to bring each XML file into AIE as a separate IngestDocument, preserving the name1 and name2 fields and omitting the less_useful_data field. Do this by configuring a standard file connector to operation on XML files, to route the incoming documents to the xmlIngest workflow, and modify the xPathExtractor component of that workflow to map name1 and name2 into AIE schema fields.
Of these steps, configuring the file connector will be left as an exercise for the reader, since the process is almost identical to the example that is on this page. The following discussion shows how to map XML values into AIE schema fields using the XPathExtractor component.
AIE uses XPath Expressions to map XML elements and attributes to AIE document fields. Mappings are realized through the xPathExtractor ingestion component. The default AIE configuration (<install_dir>\conf\core-app\attivio-components.xml) contains an xPathExtractor transformer in the standard xmlIngest ingestion workflow.
The configuration of the default xPathExtractor transformer is very basic. To edit xPathExtractor in the AIE Administrator, navigate to the System Management > Palette page. Use the Search Components field to search for "xPathExtractor." Click on the name of the component to open an editor.
You can accept all of the existing settings except for the XPaths settings. These settings map XML field values to AIE schema fields, so they must be customized to match your XML data.
The format of the XPath expressions (the strings "
/doc/field[@name = 'name1']/@value" and "
/doc/field[@name = 'name2']/@value", above) is described in more detail at http://en.wikipedia.org/wiki/XPath. Suffice to say that these expressions have the following interpretation:
- Find the top level <doc> element. Find the <field> element underneath it whose name attribute equals "name2". The value of this element's value attribute will be mapped to the AIE schema field "text".
- Find the top level <doc> element. Find the <field> element underneath it whose name attribute equals "name1". The value of this element's value attribute will be mapped to the AIE schema field "title".
Note that the field whose name attribute is set to "less_useful_data" will be ignored when ingesting these documents.
AIE needs to be re-started and the XML files need to be re-ingested for changes to the XPath Extractor configuration to take effect.
The Factbook Quick Start Tutorial configuration provides another example of XPath mapping of fields.
Main Article: Loading Factbook XML
This connector extracts content in binary form, and thus has no encoding limitations. However, the ingest workflow (typically fileIngest), which generates field values from the binary content, may not support all content encoding schemes.
This is one of the connectors that supports the Activating Incremental Updating features. There is a tutorial example of incremental updating here.
After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:
- Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
- Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.
If you must make changes to change the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.