Overview
The AVRO Connector reads AVRO files and converts each row into an IngestDocument . Files in Apache Avro format are frequently encountered in Apache Hadoop file systems.
View incoming links.
Sample Data
The following scanner configuration example presumes that we have loaded this CSV file into a Hadoop cluster using AVRO binary format:
N_NATIONKEY,N_NAME,N_REGIONKEY,table,N_COMMENT 0,ALGERIA,0,NATION, haggle. carefully final deposits detect slyly agai 1,ARGENTINA,1,NATION,al foxes promise slyly according to the regular accounts. bold requests alon 2,BRAZIL,1,NATION,y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special 3,CANADA,1,NATION,eas hang ironic silent packages. slyly regular packages are furiously over the tithes. fluffily bold 4,EGYPT,4,NATION,y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d 5,ETHIOPIA,0,NATION,ven packages wake quickly. regu
The first row contains column names.Note that the N_COMMENT field contains random text for testing purposes.
The column names must be defined as fields in <project-dir>\conf\schema\default.xml or those fields will not be indexed. Add the following field definitions to the schema file:
<field name="N_NATIONKEY" type="STRING" multivalue="false" indexed="false" stored="true" sort="false" tokenize="yes"/> <field name="N_NAME" type="STRING" multivalue="false" indexed="false" stored="true" sort="false" tokenize="yes"/> <field name="N_REGIONKEY" type="STRING" multivalue="false" indexed="false" stored="true" sort="false" tokenize="yes"/> <field name="N_TABLE" type="STRING" multivalue="false" indexed="false" stored="true" sort="false" tokenize="yes"/> <field name="N_COMMENT" type="STRING" multivalue="false" indexed="false" stored="true" sort="false" tokenize="yes"/>
The AVRO and CSV versions of the "nations" data are attached to this page for your convenience.
Configuring an AVRO Connector
You can configure an AVRO connector by using the Connector UI.
Start AIE using the AIE Agent's Command-Line Interface. This will start AIE and will make the Administration UI available at http://<host>:17000/admin.
In the Administration UI, navigate to System Management > Connectors. Click New in the menu bar. Select the AVRO Connector from the list.
On the Scanner tab of the resulting dialog box, enter the Connector Name (AVROConnector) and the Start Directory (c:\documents\nation_avro.avro.) With the example we have set up on this page, you can accept the default behavior of all the other fields.
UNC Paths
File connectors such as the AVRO Connector support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes. It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.
Then click the Field Mappings tab. Add one static field, declaring that each of these incoming IngestDocuments should be included in the "avro" table
Click Save. The Connector UI writes out the connection configuration to the project's configuration servers.
That's all it takes. You can load your file. The instructions are in "Running the AVRO Connector," below.
AVRO Connector Properties
The AVRO Connector is configured by setting properties on the editor.
AVRO Scanner Tab | Remarks | |
---|---|---|
Connector Name | The name of the connector as seen in the UI or in XML. | |
Node Set | The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors. | |
Start Directory | The directory containing the files to scan, or the root directory of the tree to scan.[REQUIRED]. Avoid using the same start directory in multiple connectors. This can confuse the incremental deletion feature, causing unexpected deletions. | |
Row Number Field Name | The field in which to store the row number. Default is to ignore the row number. | |
ID Field Format | Describes how to concatenate the values from one or more idFields into a single value, which will be used as the record's unique id. The value is a string that follows the behavior of the format method of the Java String class. | |
ID Fields | A list of fields to concatenate to create a unique id value. | |
Follow Symbolic Links | Whether or not the scanner should follow symbolic links while | |
Maximum Directory Depth | Maximum number of nested directory levels to traverse. "-1" | |
Minimum File Size (MB) | Minimum file size to send (in MB). Smaller files will be dropped. | |
Maximum File Size (MB) | Maximum file size to send in megabytes.
| |
Wildcard Include Filter | File-extension wildcards. Matching files will be scanned. | |
Wildcard Exclude Filter | File-extension wildcards. Matching files will not be scanned. | |
Directory Listing Timeout | Provide configurable directory listing times (in seconds). | |
Document ID Prefix | Append this prefix to the Document ID during processing. | |
Ingest Workflow | Ingestion workflow to receive the ingested documents. String. | |
Incremental | ||
Incremental Mode Activated | If true, this connector will run in Incremental Mode. | |
Incremental Deletes | Optional. Used with 'incremental-activated' parameter to control if AIE should delete documents that have been removed from the source files. Default is true. | |
Advanced | ||
Delete After Crawl | Boolean. Delete the files after they have been scanned. Do not use with the incrementalModeActivated feature. | |
Move to directory after crawl | Move the scanned files to this directory after they are scanned. Do not use with the incrementalModeActivated feature. | |
Additional Start Directories | If there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl. If there is more than one root directory to scan, put the first one in the Start Directory field (and optionally specify the Move to Directory After Crawl field) and then add the other directories here. Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory. | |
Max Rows | Number of rows to read from the file. |
The AVRO Connector has an Other tab that contains several properties of special interest.
AVRO Other Tab | Remarks |
---|---|
Keytab | Location of keytab file for Kerberos authentication. |
Principal Name | Principal name for Kerberos authorization. |
Name Node Principal | Configuration property for enabling support for Kerberos. |
Scan hidden files | If true, scan all readable files including system and hidden files. |
The tables above are for the AVRO Connector. The other tabs in the Connector Editor are described on the Connectors page.
Running the AVRO Connector
Erasing the Index
While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.
To run the AVRO Connector, open the AIE Administration UI, and navigate to the System Management > Connectors page. Right-click on AVROConnector and click on Start.
Then navigate to Search UI, which is Query > Search UI. Search for *:*, which retrieves all records in all tables. Set the Details toggle to "ON" to see all of the documents' fields. We can see that the scanner was successful:
The highlilghted fields were the ones loaded from the AVRO file. The remaining fields are document metadata and the score explanation.
Incremental Updating
If one or more input files are renamed or deleted after the first run of an incrementally-enabled connector, the next connector run deletes all of the documents associated with those files from the index. But if the file is left in place with one or more rows removed, the next connector run does not delete documents associated with the missing rows, because it only checks whether the source file is still present.
This connector supports the Incremental Updating features. There is a tutorial example of incremental updating here.
After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:
- Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
- Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.
If you must make changes to the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.