Page tree
Skip to end of metadata
Go to start of metadata

Overview

This is an example of using the File Connector to load a small XML file (<10MB) that contains multiple documents. See Loading XML Content for other examples.

View incoming links.

File Connector, Multiple Documents per File

Since our sample XML file is small, we we'll use a File Connector and a special XML workflow to load the file. The exercise involves the following steps:

  1. Edit the project's schema.xml file as described in the previous exercise
  2. Create a File Connector (BookFileConnector) that opens the XML file in c:\documents, and turns it into an IngestDocument. Send this IngestDocument to the xmlBookIngest workflow. We'll create that workflow in the following step.
  3. Make a copy of the xmlIngest workflow, called xmlBookIngest. The components of this workflow (parseXml, xPathExtractor, dropDom, and the ingest subflow) are almost what we need to read in our sample data.
  4. Because our sample data consists of one XML file that contains multiple XML documents, we'll have to create a splitBookXml transformer to break out the individual documents into independent IngestDocument messages.
  5. Add splitBookXml to the xmlBookIngest workflow, between parseXml and xPathExtractor.
  6. It will be necessary to configure the xPathExtractor component to copy values from the XML elements to IngestDocument fields.
  7. Run the BookFileConnector and inspect the resulting records in the SAIL. Verify that we got two book records, and that the content fields are as expected.

The next few subsections demonstration how to use the AIE Administrator to accomplish these steps.

Setting Up the Demonstration

Ensure that the environment is prepared as follows:

  1. Create a new project that includes the demo group of AIE Modules. (None of the modules in the demo group is necessary to XML ingestion.)

    aie-exec.exe createProject -n xmlbooks -g demo -o c:\attivio-projects
  2. Start AIE using the AIE Agent and its Command-Line Interface (CLI):

    1. Run the Agent in a Command Window:

      <install-dir>\bin\aie-agent.exe -d <data-agent-dir> 
    2. Run the Command-Line Interface in a second Command Window. Note that the CLI is invoked for a specific project.

      <install-dir>\bin\aie-cli -p <project-dir>
    3. To run the project use the start all command in the Command-Line Interface:

Mapping XML Elements to AIE Index Fields

This exercise uses the same schema mappings as the previous exercise.

Setting Up the Target File

To experiment with XML ingestion, we need to create a directory and put an XML source file in it.

In this exercise we used the following source directory:

c:\documents

The top-level directory is not a requirement; it just makes the example code easier to read.

The source file, books.xml is an example of an xml file that contains multiple independent documents. AIE will have to locate and split out these documents within the file.

books.xml
<?xml version='1.0'?>
<catalog>
  <book id="1">
    <title>Oliver Twist</title>
    <author>
      <firstname>Charles</firstname>
      <lastname>Dickens</lastname>
    </author>
    <yearpublished>1837</yearpublished>
    <description>Oliver Twist, Fagin, Nancy, Bill Sykes, and the
 	Artful Dodger live by their wits in this
 	dark tale of Victorian England.
    </description>
    <location>London</location>
    <location>England</location>
  </book>

  <book id="2">
    <title>Journey to the Centre of the Earth</title>
    <author>
      <firstname>Jules</firstname>
      <lastname>Verne</lastname>
    </author>
    <yearpublished>1864</yearpublished>
    <description>Eccentric Uncle Lindenbrock is determined to search
        for a fabled land in the center of the earth. He takes his
        nephew, Axel, and their servant, Hans on the ultimate spelunking
        adventure.
    </description>
    <location>Centre of the Earth</location>
  </book>
</catalog>

This is one XML file containing two documents within a top-level element called catalog. Each of the <book> documents contains a nested XML description of the book's author.

This file, too, goes into the source directory.

Create the BookFileConnector

Open the AIE Use the Attivio Administrator and navigate to System Management > Connectors. Click the New link. This opens the New Connector dialog box. Select File Connector and click OK.

NewFileConnector

On the Scanner tab of the New Connector editor, fill in the following values:

  • The name of the connector. We called it BookFileConnector.
  • The directory that contains the source file.
  • Adjust the include/exclude filters until the connector will look for *.xml files only.
  • Designate the xmlBookIngest workflow as the destination of the new IngestDocuments.

UNC Paths

File connectors support the Uniform Naming Convention (UNC) path format used to designate Windows network shares. However, UNC paths are not supported for other path specifications in AIE for example the location of AIE logs or indexes.  It is also possible to use a mapped network drive to specify a Windows file share as if it were a local drive. Note that scanners running on Linux hosts cannot access file content via UNC paths or local Windows paths - these scanners must run on Windows hosts.

BookFileConnectorEdit

Click Save to store the new connector.

Configuring the Workflow

In this part of the exercise, we need to create a new workflow and add a splitXml stage to it.

Create the xmlBooksIngest Workflow

AIE comes with an XML ingestion workflow called xmlIngest. This workflow is almost sufficient for our needs, but it assumes that each file contains exactly one document. Our example file contains two documents, so we need to add a component to split these XML documents into separate IngestDocument messages.

Since it is generally a good practice not to modify out-of-the-box constructions like xmlIngest, we'll make a copy of it first and then modify the copy.

Navigate to System Management > Workflows > Ingest and select xmlIngest from the list. Right-click on xmlIngest and select Copy.

CloneXmlIngest

This opens the Workflow Editor for the new copy of xmlIngest. Give it a unique name, such as xmlBookIngest in this case.

Create the SplitBooksXml Transformer

With the Workflow Editor still open, click the Add New Component button.

XMLBookIngestA

This opens the New Component list. Open the Document Transformers list and select splitXml. Click the OK button.

AddSplitXmlToWorkflow

Now we're looking at a Component Editor stacked on top of the Workflow Editor. You need to fill in these fields:

  • Give the component a name, such as SplitBooksXml.
  • Resist the temptation to experiment with the input field. The XML-related transformers have default input and output channels that should not be altered.
  • The splitXmltransformer needs to know how to recognize a unique document in the aggregate XML, and how to give that document a unique ID.
    • Type /catalog/book in the XPath to Document field. This is the path that identifies the beginning of a new document in the XML.
    • Type @id in the XPath to ID field. This extends the previous path to the location of the document ID, which is an attribute (@) of the book element..

Click the Save button.

EditSplitBookXml

This closes the Component Editor dialog box, exposing the Workflow Editor again. SplitBooksXml appears as the final stage in the workflow. Use the Move Up button repeatedly to move it up to the second position.

PositionSplitBookXml

Don't close the Workflow Editor just yet.

Configure the xPathExtractor Transformer

We are not done with the Workflow Editor yet. Select the xPathExtractor stage and click the Edit Component button. This opens a new Component Editor.

SelectXPathExtractorComponent

In this editor, enter the list of AIE Schema fields (such as creationdate) that you intend to use, paired with XPath expressions that tell AIE how to find the values for each field.

For the sake of this exercise, the configuration of xPathExtractor is exactly the same as in the previous exercise.

ExitXPathExtractor

Click the Save button to store the modified component. This exposes the Workflow Editor again. Click the Save button on that editor, too.

This completes the configuration of the XML-processing components of the new workflow. Note that the final stage of the workflow, the ingest subflow, sends the transformed IngestDocuments to the standard ingest workflow for linguistic processing and, ultimately, indexing.

Testing the Configuration

Erasing the Index

While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.

To test the configuration, use the BookFileConnector to load the file of book definitions.

Then open SAIL and search for *:* (asterisk-colon-asterick). This retrieves all records in the index. If all has gone well, there should be two records:

 

The fact that there are two records means that the splitBookXml stage operated correctly. The fact that the title and teaser fields are displayed correctly indicates that xPathExtractor functioned for at least those two fields.

 

  • No labels