Overview
The Quick Start Tutorial makes use of default configuration files supplied with the AIE factbook module to ingest World Factbook XML content.
The World Factbook content consists of one XML file per country, and the tutorial is configured to create a single AIE document for each country. The standard FileConnector is used to load each file separately, and an ExtractXpaths transformer is used to extract specific XML elements into AIE document fields. Some of these fields require that we extend the AIE Schema with additional field definitions. We also have to define a new ingestion workflow to route incoming documents from the connector to the xPath extractor, and then to the index.
The factbook demo customization files are located in AIE's installation directory tree. This is not typical. In most AIE applications, ingestion customizations are made in the project's configuration files only.
View incoming links.
Creating a File Connector
Main article: Loading File Content
The first step in ingesting country documents is to open each file and pull the text out of it. The text is then inserted into an AttivioDocument which is routed to an ingestion workflow for further processing.
Open the Factbook demo configuration file, which is <install_dir>\conf\factbook\factbook.xml. Search for the countryConnector. It is an instance of the FileScanner class. .
<!-- This is the connector for reading country files. It is a FileScanner. It looks in the proper directory for XML files and then feeds the content to the countryXML workflow. --> <connector name="countryConnector"> <scanner class="com.attivio.connector.FileScanner"> <properties> <property name="startDirectory" value="${factbook.content.dir}/countries" /> <list name="wildcardFilter"> <entry value="*.xml" /> </list> </properties> </scanner> <feeder> <properties> <property name="ingestWorkflowName" value="countryXml" /> <map name="staticFields"> <property name="table" value="country" /> </map> </properties> </feeder> </connector>
The scanner section knows where to look for the input files, and actively filters for files with the extension .xml. The content of each incoming document is packaged as an AttivioDocument.
The feeder section directs the new AttivioDocuments to the countryXml workflow, and instructs AIE to label each document by placing "country" in the table field.
If you were to create a connector like this one, you would put it in your <project-dir>\conf\connectors\ area.
Customizing an xPathExtractor
Main article: Loading XML Content
Once the file content enters the ingestion workflow, the next step is to map some of the XML elements into AIE schema fields (index fields). We do this with an instance of ExtractXPaths used as an early stage in the ingestion path.
Return to the Factbook demo configuration file. Search for the countryXPathExtractor component.
<!-- This component maps XML elements into AIE schema fields. It goes in an ingestion workflow between an XML File Connector and the beginning of the language analysis stages. --> <component name="countryXPathExtractor" class="com.attivio.platform.transformer.ingest.xml.ExtractXPaths"> <properties> <!-- NAME is an AIE schema field. VALUE is an xPath to an XML element. @ denotes an element property rather than an element value.--> <map name="xpaths"> <property name="title" value="/country/@name" /> <property name="country" value="/country/@name" /> <property name="thumbnailImageUri" value="concat('/factbook_resources/flags/', /country/@abbrev, '-flag.gif')" /> <property name="previewImageUri" value="concat('/factbook_resources/maps/', /country/@abbrev, '-map.gif')" /> <property name="uri" value="/country/@uri" /> <property name="teaser" value="/country/background" /> <property name="text" value="/country/background" /> <property name="economy" value="/country/economy/overview" /> <property name="location" value="/country/geography/locations/location" /> <property name="map" value="/country/geography/maps/location/map" /> <property name="climate" value="/country/geography/climates/location/climate" /> <property name="terrain" value="/country/geography/terrains/location/terrain" /> <property name="resource" value="/country/geography/resources/location/resources/@name" /> <property name="spokenLanguage" value="/country/people/languages/language" /> <property name="religion" value="/country/people/religions/religion" /> <property name="ethnicity" value="/country/people/ethnicities/ethnicity" /> <property name="agriprod" value="/country/economy/agriculturalProducts/product" /> <property name="industry" value="/country/economy/industries/industry" /> <property name="laborForce" value="/country/economy/@laborForce" /> <property name="inflationRate" value="/country/economy/@inflationRate" /> <property name="unemploymentRate" value="/country/economy/@unemploymentRate" /> <property name="publicDebt" value="/country/economy/@publicDebt" /> <property name="gdp.purchasePowerParity" value="/country/economy/gdp/@purchasePowerParity" /> <property name="gdp.officialExchangeRate" value="/country/economy/gdp/@officialExchangeRate" /> <property name="gdp.growthRate" value="/country/economy/gdp/@growthRate" /> <property name="gdp.growthRatePerCapita" value="/country/economy/gdp/@growthRatePerCapita" /> <property name="latitude" value="/country/geography/coordinates/coord/@latitude" /> <property name="longitude" value="/country/geography/coordinates/coord/@longitude" /> </map> </properties> </component>
Each <property> element in the "xpaths" map consists of a name (the name of an AIE index field) and a value (the XPath leading to the value to put in this field).
The component issues modified AttivioDocuments containing many individual fields, each with its appropriate snippet of content. These are put back into the workflow, where they find their way to the language-analysis stages and eventually to the index.
Schema Additions
Main article: Configure the Attivio Schema
Open <install_dir>\conf\factbook\schema.xml and examine its contents. Since countryXPathExtractor uses fields that are not defined in the default schema, it is necessary to extend the schema. This can be done by defining new schema fields under a <schema> element with the merge attribute set to "true". The fields defined are added to the existing schema definition.
<schema name="default" merge="true"> <fields> <field name="country" type="string" indexed="true" stored="true" facet="true" joinable="true"/> <field name="economy" type="string"/> <field name="locationDesc" type="string"/> <field name="map" type="string"/> <field name="climate" type="string"/> <field name="terrain" type="string"/> <field name="resource" type="string"/> <field name="spokenLanguage" type="string"/> <field name="religion" type="string"/> <field name="ethnicity" type="string"/> <field name="agriprod" type="string"/> <field name="industry" type="string"/> <field name="laborForce" type="long"/> <field name="unemploymentRate" type="float"/> <field name="inflationRate" type="float"/> <field name="publicDebt" type="float"/> <field name="gdp.purchasePowerParity" type="float"/> <field name="gdp.officialExchangeRate" type="float"/> <field name="gdp.growthRate" type="float"/> <field name="gdp.growthRatePerCapita" type="float"/> <!-- include all fields in the content field for keyword indexing --> <field name="content" type="string" indexed="true" facet="false" stored="false"> <include-field name="title"/> <include-field name="author"/> <include-field name="text"/> <include-field name="*_s"/> <!-- include all the dynamic fields in the content field --> <include-field name="*_nl"/> <!-- include all the natural language fields in the content field --> <include-field name="map"/> <include-field name="locationDesc"/> <include-field name="economy"/> <include-field name="climate"/> <include-field name="terrain"/> <include-field name="resource"/> <include-field name="spokenLanguage"/> <include-field name="religion"/> <include-field name="ethnicity"/> <include-field name="agriprod"/> <include-field name="industry"/> <include-field name="country"/> </field> </fields> </schema>
It is a common practice to create a "content" field for indexing the combined content of all other fields, rather than indexing every field separately.
Creating an Ingestion Workflow
Main article: Workflow Configuration
For the Factbook demo, we need a customized workflow that will connect the countryConnector to the countryXPathExtractor and then to the AIE standard ingest workflow. The ingest workflow takes the incoming AttivioDocuments through the standard suite of language-analysis stages before sending them to the index.
Return to the Factbook demo configuration file. Search for the countryXml workflow.
<workflow name="countryXml" type="ingest"> <documentTransformer name="parseXml" /> <documentTransformer name="countryXPathExtractor" /> <documentTransformer name="convertRawCoord" /> <documentTransformer name="dropDom" /> <subflow name="ingest" /> </workflow>
As you can see, this workflow directs incoming AttivioDocuments through four stages before committing them to the default ingestion workflow. The parseXml stage reads the XML and creates a Document Object Model (DOM) for subsequent states to process. Of course, you know that countryXPathExtractor is the stage that populates all of the separate indexable fields of the document. The convertRawCoord stage converts raw latitude and longitude data into a form more suitable to AIE's geographic search feature. The dropDom stage deletes the DOM model now that it is no longer needed.
For a broader view of AIE configuration, refer to the Configuration Guide.