Page tree
Skip to end of metadata
Go to start of metadata

Overview

The Attivio Intelligence Engine (AIE) provides simple tools for ingesting XML content.  You can configure the tools for the structure and size of the XML source, as well as for metadata-extraction requirements. How you load XML files into AIE depends on whether the XML files contain multiple documents, and on the size of the XML files being ingested.

DecisionTree

  • If the XML file contains a single document, use a File Connector and the standard xmlIngest workflow. This approach is described on the One Doc per XML File page.
  • If the XML file contains multiple documents, and the file is less than 10 MB in size, using the File Connector with a customized workflow that includes a splitXml transformer is the most efficient method. This procedure is described on the Multiple Docs per Small XML File page.
  • If the XML file contains multiple documents and is larger than 10 MB, it is more efficient to use the XML Connector (which incorporates the XmlScanner) with the xmlIngest standard workflow. See the Multiple Docs per Large XML File page.

View incoming links.

XML Processing Workflow Components

AIE contains several components specifically designed to support XML ingestion. Most of them are part of the standard xmlIngest workflow.

ParseXml

The ParseXml stage is designed to parse XML from ContentPointers , strings, or byte arrays into a DOM object. This is typically the first stage in any workflow designed to handle XML, as later stages all depend on the parsed XML. In addition, the ParseXml stage supports adding namespace prefixes to URI mappings to the DOM. This ensures that later stages can properly select nodes based on namespace-aware XPath expressions.

Content Encoding

The ParseXml stage uses the encoding scheme (character set) declared in the XML header: it defaults to UTF-8 encoding if no character set is specified in the header.

SplitXml

The SplitXmlstage is designed to split a single IngestDocument  containing XML into multiple IngestDocuments containing subsections of the XML. For example, the SplitXml stage could take a catalog of books in a single XML record, and create a separate IngestDocument for each book in the catalog. The SplitXml stage can also copy parent field values to the child records. By default, the parent XML IngestDocument drops from ingestion after the stage completes.

ExtractXPaths

The ExtractXpaths stage uses a series of XPath-to-field-name mappings to copy content from XML elements into IngestDocument fields. 

DropDom

It is considered best practice to include a dropDom stage at the end of most XML workflows. Once XML parsing and extraction is complete, deleting the DOM object from the IngestDocument saves on memory footprint and network bandwidth across the system.

Incremental Updating

This connector supports the Activating Incremental Updating features. There is a tutorial example of incremental updating here.

After running the connector to ingest documents with Incremental Mode activated, be careful with any future configuration changes to the connector, as such changes can cause one or more of the following issues:

  • Some incremental changes might not be properly identified, and hence, not get ingested into AIE in future runs.
  • Some documents can remain in your index that are no longer managed by any connector. These documents can eventually become out of date and contain outdated content security permissions.

If you must make changes to change the connector configuration after running it, follow these steps to keep your system fully up to date:
1. Delete any previous documents the connector created in your AIE index.
2. Select your connector from the AIE Administrator's Connectors tab, and Reset the connector.

Rowset XML

AIE contains a default workflow to handle RowSet XML formatted content. Generally, rowset XML files are large, so using the XMLScanner is recommended. The default rowsetIngest workflow can ingest the resulting IngestDocuments with each rowset XML field mapping to a new AttiviDocument field as expected.

  • No labels