Page tree
Skip to end of metadata
Go to start of metadata

Overview

This is an example of using the XML Connector to load a large XML file (>10MB) that contains multiple documents. See Loading XML Content for other examples. 

View incoming links.

XML Connector, Multiple Documents in a Large File

An XML Connector encapsulates an XmlScanner, which plucks individual documents from a large XML file without loading the whole file into memory first. "Large" means over 10MB in size.

This scanner splits each source file into multiple IngestDocuments . Each IngestDocument represents one <book> document.

Content Encoding

The XML Connector uses the encoding scheme (character set) declared in the XML file's header: it defaults to UTF-8 encoding if no character set is specified in the header.

Setting Up the Demonstration

Ensure that the environment is prepared as follows:

  1. Create a new project that includes the demo group of AIE Modules. (None of the modules in the demo group is necessary to XML ingestion.)

    aie-exec.exe createProject -n xmlbooks -g demo -o c:\attivio-projects
  2. Start AIE using the AIE Agent and its Command-Line Interface (CLI):

    1. Run the Agent in a Command Window:

      <install-dir>\bin\aie-agent.exe -d <data-agent-dir> 
    2. Run the Command-Line Interface in a second Command Window. Note that the CLI is invoked for a specific project.

      <install-dir>\bin\aie-cli -p <project-dir>
    3. To run the project use the start all command in the Command-Line Interface:

Mapping XML Elements to AIE Schema Fields

In this example we use only the title, author, and teaser fields, so no special customization of the AIE Schema is required.

Setting Up the Target File

To experiment with XML ingestion, we need to create a directory and put an XML source file in it.

In this exercise we used the following source directory:

c:\documents

To demonstrate the XML Connector we need a "large" file of books. We visited the web site of Online Books and downloaded their RSS feed of new acquisitions. (RSS is a dialect of XML.) The file, onlinebooks.xml, isn't large compared to 10MB, but it lets us work with real-world data and extract many IngestDocuments from a single file.

This is a snapshot of the content in the file:

RSS feed
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
  <channel>
    <title>New Online Books</title>
      <link>http://onlinebooks.library.upenn.edu/new.html</link>
      <description>New listings of free online books from The Online Books Page</description>

    <item>
      <title>Life Portraits of William Shakespeare (Friswell)</title>
      <link>http://catalog.hathitrust.org/Record/001018477</link>
      <description>Life Portraits of William Shakespeare: A History of the Various Representations of the Poet, With an Examination Into Their Authenticity (London: S. Low, Son, and Marston, 1864),  by J. Hain Friswell (page images at HathiTrust)</description>
    </item>

    < many other items...>

  </channel>
</rss>

We'll use the XPath features of the XML Connector to extract information from the XML and reorganize it into an AIE index entry. Refer back to this display to see how the XPath formulas map into the content of the file.

Configure the XML Connector

You can configure an XML Connector in the Use the Attivio Administrator.

Start AIE using the AIE Agent and its Command-Line Interface (CLI). This will start AIE and will make the Administration UI available at http://<host>:17000/admin.

In the Administration UI, navigate to System Management > Connectors. Click New in the menu bar. Select the XML Connector from the list.

XmlConnectorNewUI

On the Scanner tab of the resulting dialog box, enter the Connector Name (BigFileManyDocs) and the Start Directory (c:\documents).

UNC Paths

File connectors such as the XML Connector support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes.  It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.

XmlScannerConfigUI

Continue with the Document Root (/rss/channel/item/), which is the XML element that encloses an entire document. Then add the ID Path (/rss/channel/item/link/), which in this case fetches the URL link to the document. Direct the feed to load the onlinebooks.xml file. Finally, we want to pass the new IngestDocuments to the xmlIngest workflow.

Click Save. The Connector UI writes out the connection configuration to the project's configuration servers. 

Additional Start Directories

If there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl.

If there is more than one root directory to scan, put the first one in the Start Directory field (and optionally specify the Move to Directory After Crawl field) and then add the other directories in the Additional Start Directories field.

Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory.

XML Connector Properties

The XML Connector is configured by setting properties on the editor.

XML Connector Editor

Remarks

Connector Name

The name of the connector as seen in the UI or in XML.

Node Set

The nodeset the connector should run on. Defaults to default-service-nodeset. The Editor can set this value only on new, unsaved connectors.

Document RootXpath to the node that encloses the document, such as /rss/channel/item/.
ID PathXpath to document ID element, such as /doc/@id for an attribute or /doc/id for an element. 
Xml Incremental ModeXML or file-based incremental mode.
File System URI
 
Use this field to access an HDFS file system. The syntax is hdfs://[username@] host:port, for example, hdfs://acevm0681.lab.attivio.com:8020/. Otherwise leave it empty.
Start Directory

The directory containing the files to scan, or the root directory of the tree to scan.[REQUIRED].

Avoid using the same start directory in multiple scanners. This can confuse the incremental deletion feature, causing unexpected deletions.

Follow Symbolic Links

Whether or not the scanner should follow symbolic links while
crawling the file system.

Maximum Directory Depth

Maximum number of nested directory levels to traverse. "-1"
means no limit.

Maximum File Size (MB)

Maximum file size to send in megabytes.

 

Wildcard Include Filter

File-extension wildcards. Matching files will be scanned.

Wildcard Exclude Filter

File-extension wildcards. Matching files will not be scanned.

Directory Listing TimeoutProvide configurable directory listing times (in seconds).

Document ID Prefix

Append this prefix to the Document ID during processing.

Ingest Workflow

Ingestion workflow to receive the ingested documents. String.

Incremental 

Incremental Mode Activated

Enables incremental updates.  Boolean.

Incremental Deletes

Optional. Used with 'incremental-activated' parameter to control if AIE should delete documents that have been removed from the source files. Default is true.

Advanced 
Delete After Crawl

Boolean.  Delete the files after they have been scanned. Do not use with the incrementalModeActivated feature.

 
Move to directory after crawl

Move the scanned files to this directory after they are scanned. Do not use with the incrementalModeActivated feature.

 
Additional Start Directories

If there is only one root directory to scan, put it in the Start Directory field and optionally specify a Move to Directory After Crawl directory where the files should be placed after the crawl.

If there is more than one root directory to scan, put the first one in the Start Directory field (and optionally specify the Move to Directory After Crawl field) and then add the other directories here.

Each entry is two strings. The first string is the Start Directory. The second string is the optional Move To Directory After Crawl directory.

 
Scan hidden filesIf true, scan all readable files including system and hidden files.
Kerberos 
KeytabLocation of keytab file for Kerberos authentication.
Principal NamePrincipal name for Kerberos authorization.
Name Node PrincipalConfiguration property for enabling support for Kerberos.

The table above is for the File Connector Scanner Tab.  The other tabs in the Connector Editor are described on the Connectors page.

Configure the parseXml Transformer

The first component in the xmlIngest workflow is called parseXml. Configuring parseXML is not typically needed, but if your XML input files contain namespaces you may want to do so.

Navigate to the System Management > Workflows > Ingest page of the Admin UI. Type "xmlIngest" into the search field and click the Search Workflows button. Click the xmlIngest entry to open the Workflow Editor. Then select the parseXml component of the workflow and click the Edit Component button. This opens the Component Editor.

parseXML

If your XML file uses namespaces, you can enter them here and AIE will automatically expand them as they are encountered. If there are no namespaces in the input files, you probably don't need to edit this component.

Configure the xPathExtractor Transformer

xPathExtractor is a predefined workflow component that is part of the xmlIngest workflow.  xPathExtractor maps XML elements to AIE-compatible index fields in an IngestDocument. In this example, xmlIngest modifies IngestDocuments received from the XmlScanner, and eventually passes them into the ingest workflow for further analysis before reaching the AIE index.

Navigate to the System Management > Workflows > Ingest page of the Admin UI. Type "xmlIngest" into the search field and click the Search Workflows button. Click the xmlIngest entry to open the Workflow Editor. Then select the xPathExtractor component of the workflow and click the Edit Component button. This opens the Component Editor.

xPathExtractorDiagram

The properties are AIE index fields (such as teaser) paired with XML elements, such as description. The XML elements are identified using XPath notation, which means we can take advantage of xPath's string operators to extract pieces of strings to use as values in our IngestDocument fields.

XPath Note

It's worth noting that xpaths are relative to the Document Root (/rss/channel/item/ specified in the XML Scanner). For example, the teaser above uses the xpath of /item/description. The full xpath of this is actually /rss/channel/item/description; however, we can use /item/description as it is relative to the Document Root.

 

For instance, the AIE "teaser" field can be populated with the value of the /rss/channel/item/description field:

xPath to Description Field.
/item/description

The titles of the RSS entries contain a title string followed by the author's last name in parenthesis. The xPath string functions select the characters before the opening paren for the title, and the characters between parens for the author name.

xPath to Title
substring-before(/item/title, '(')
xPath to Author
substring-before(substring-after(/item/title, '('), ')')

Just paste these expressions into the fields of the Component Editor.

Testing the Configuration

Erasing the Index

While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.

Run the custom connector from the AIE Administration Web Interface by clicking on the System Management -> Connectors menu item. Right-click BigFileManyDocs and select Start from the context menu.

When the connector has finished loading, open SAIL and search for *:* This will retrieve all records from all tables in the index.  Here's an example. (You'll see more detail if you click the Search Options link and check the Debug checkbox.)

Shakespeare

As you can see from the illustration, the book descriptions from the onlinebooks.xml file have found their was into the AIE index, and the title, author, and teaser fields are properly fulfilled.

  • No labels