Overview
This document describes the components of the Attivio platform's Advanced Text Extraction Module (ATEM) and walks through its out-of-the-box configuration.
The configuration files for the Advanced Text Extraction Module are found in <install-dir>\conf\advancedtextextraction\.
View incoming links.
Before You Begin
Refer to the ATEM summary page for an overview of the architecture and capabilities of the Advanced Text Extraction module. Refer to Loading File Content for the specifics of data loading using ATEM.
ATEM Workflow
The ATEM workflow definition can be found in the <install_dir>\conf\advancedtextextraction\advancedtextextraction.xml file. The workflow separates documents into two processing paths. Plain-text documents (and those who type cannot be determined) are handled by legacy text-extraction methods. All other documents are routed to the advteConverter and advteExtractor components.
- advteConverter extracts the content of the document into a monolithic block of XML containing the original document and all included child documents, if any.
- advteExtractor mines the XML, generating IngestDocuments for each document and child document.
Legacy Document Processing
The advteConvert workflow first determines whether the document type is "Unrecognized." If so, the document is routed to the attivioTextExtraction workflow for "legacy" (simple) text extraction. Otherwise, the document is routed to advteExtractor in the advteExtract workflow.
If the legacy logic fails to process the document, an error is logged and the document is passed on to the next stage in the workflow with the doctype value of "Unrecognized" and mimetype set to "undetected".
Document Conversion (advteConvert)
The advteConverter transformer component converts the content of the document into XML. This includes all "child" documents that are encountered, such as the files within a .zip archive.
The converted XML is then routed to the advteExtractor component of the advteExtract workflow.
The following parameters are available on in the advteConverter Component Editor. To view the editor, open the AIE Administrator and navigate to System Management > Palette. Click on advteConverter.
Tab/Parameter | Description | Default value | |
---|---|---|---|
Platform Component Tab | |||
Name | Name of the component. | advteConverter | |
Type: | The underlying Java class. | Advanced Text Extraction Document Converter | |
Workflow reference in: | The workflow that hosts the component. | advteConvert | |
Source URI | Configuration file for this component. | <install-dir>/conf/advancedtextextraction/ advancedtextextraction.xml | |
Number of instances | Number of parallel instances of the component to be created. | Default is blank (interpreted as equaling the number of CPUs on the computer). | |
Document Type Configuration | The document type configuration list to use. The list enumerates many document types and how Attivio should process each type. | advancedDocTypeConfig Do not change this setting. Expert feature. Most users should never change this setting. | |
Sink Field | The IngestDocument field that contains the path of the conversion results XML file | convertedfilepath | |
Annotate with Errors | Annotates the converted XML document to reflect any errors in the conversion. | false | |
Error Code Field | IngestDocument field that contains conversion error codes, if any. | conversionErrorCode | |
Error field | IngestDocument field that contains conversion error message, if any. | conversionError | |
Doc Type Field | IngestDocument field that contains the extracted document type value | doctype | |
Parent Doc Type Field | IngestDocument field that contains the extracted parent document type value | parentdoctype | |
Mime Type Field | IngestDocument field that contains the MIME type value | mimetype | |
Parent Mine Type Field | IngestDocument field that contains the extracted parent MIME type value | parentmimetype | |
Child Data Field | IngestDocument field for child document content (or contentPointer). | bytes | |
Child Document Filename Field | IngestDocument field that contains the short file name of the child document | sourcepath | |
Children Batch Size | Batch size to use with "native child" extraction. See "Store Native Children," below. | 0 (no batching) Do not change this setting. Expert feature. Most users should never change this setting. | |
Save Failed Children | If true, does not drop child documents that failed conversion. | false | |
Store Native Children | If true, stores native children in the contentStore. | false Do not change this setting. Expert feature. Most users should never change this setting. | |
Converter Location | Location of the text-to-XML converter executable. | ${attivio.home}/lib/aieadvte Do not change this setting. Expert feature. Most users should never change this setting. |
|
Converter Library Path | Library path for the text-to-XML converter. | ${attivio.home}/lib Do not change this setting. Expert feature. Most users should never change this setting. | |
Option File Path | Path to the conversion options file. | advancedtextextraction\ Relative to <install-dir>\conf\. Do not change this setting. Expert feature. Most users should never change this setting. | |
Advanced Tab | |||
Document Timeout | The document conversion request timeout, in milliseconds. | ${advancedtextextraction.documentTimeout} (from <install-dir>\conf\advancedtextextraction\ advancedtextextraction.properties), which defaults to 120000. Use -1 to disable timeout. | |
Drop Document on Exception | Drops the document from the ingestion process if it causes an AttivioException. | false | |
Fire Event on Timeout | Generates a system event (a log entry) with a document times out. | true | |
documentBatchSize | Number of child documents to be batched (included in a single message). | -1 (no batching) | |
Other Tab | |||
Child Document Post Processor | Optional copier of parent document fields to child documents. | defaultTeChildPostProcessor |
Document Extraction - advteExtractor
Once the document has been converted to XML execution flows to advteExtractor where the XML is converted into IngestDocuments.
During this stage, advteExtractor extracts document metadata, content, hyperlinks, and any data related to child documents from the XML file. This component prepares this data for use by subsequent stage(s), typically (though not exclusively) the ingestion stage.
The following parameters are available on in the advteExtractor Component Editor. To view the editor, open the AIE Administrator and navigate to System Management > Palette. Click on advteExtractor.
Tab/Parameter | Description | Default value |
---|---|---|
Platform Component Tab | ||
Name | Name of the component. | advteExtractor |
Type: | The underlying Java class. | Advanced Text Extraction Text Extractor |
Workflow reference in: | The workflow that hosts the component. | advteExtract |
Source URI | Configuration file for this component. | <install-dir>/conf/advancedtextextraction/ advancedtextextraction.xml |
Number of instances | Number of parallel instances of the component to be created. | Default is blank (interpreted as equaling the number of CPUs on the computer). |
Document Type Configuration | The document type configuration list to use. The list enumerates many document types and how Attivio should process each type. | advancedDocTypeConfig Do not change this setting. Attivio normally uses advancedDocTypeConfig, but falls back on legacyDocTypeConfig for plain-text documents and documents whose type cannot be determined. There is no need for the user to change this setting. |
Sink Field | The name of the field that contains the path of the conversion results XML file. | convertedfilepath |
Compress Whitespace | Compresses whitespace down to a single space before tokenization. | true |
Omit Tokens On Doc Type | Specifies whether to include structural tokens, such as paragraph boundaries, on the text field of all documents passing through the component. When set to true , a doc type lookup will be executed to determine if that option is configured. This checks a new configuration option in DocTypeInfo called includeStructureTokensOnTextField . It will default to false out of the box for all spreadsheet doctypes to avoid adding paragraph tokens for each data cell. It can be xml configured per doctype by editing the installation file advancedtextextraction-doctypes.xml . Custom configuration of doc types in this manner are considered advanced. | false |
Delete SearchML Files | If true, the generated XML conversion files are deleted. | true |
Demarcate Text Elements | Boolean flag for whether units of text should be terminated with the boundary character. The boundary character is used to clearly delineate such units of text as paragraphs, cells, and slides from each other. The ASCII 0x1F ("unit separator") character is used to demarkate these boundaries. | true |
Generate DeleteByQuery | Boolean flag for whether DeleteByQuery's should be generated so that any stale child documents are removed prior to new ones being ingested. | false |
Metadata Configuration File Path | Location of the ATEM document metadata configuration file | conf\advancedtextextraction\ Do not change this setting. Expert feature. Most users should never change this setting. |
Output Content Field | Field that holds the extracted text (content) | text |
Output File Name Field | Field that contains the short file name of the document | filename |
Output File Extension Field | Field that contains the file extension, if any, of the document | fileext |
Doc Type Field | Field that contains the document type information | doctype |
Parent Doc Type Field | Field that contains the parent document type information. | parentdoctype |
Mime Type Field | the name of the field that contains the MIME type information | mimetype |
Parent Mime Type Field | the name of the field that contains the parent MIME type information | parentmimetype |
Advanced Tab | ||
Document Timeout | Document conversion request timeout, in milliseconds. There is no timeout if this value is not specified or is set to -1. | -1 (no timeout) |
Drop Document On Exception | Drops the document from the ingestion process if it causes an AttivioException. | false |
Fire Event on Timeout | Generates a system event (a log entry) with a document times out. | true |
Incremental Output Mode | If set to true, as child documents are added to processing results, the children will be emitted to the default workflow destination rather than being part of the result message. This prevents components which add large numbers of children from having all of those children in the same document list and potentially exhausting memory. | false |
Document Batch Size | Dictates the size of the batch at the DocumentList level, as it is being sent over by the IngestClient. | -1 (no batching) |
Other Tab | ||
Child Document Post Processor | Optional copier of parent document fields to child documents | defaultTeChildPostProcessor |
Configuration File Notes
The following ATEM configuration files are located in the <install-dir>\conf\advancedtextextraction\ directory.
advancedtextextraction-search-export.cfg
This configuration file contains various options that are passed into the ATEM. In general, the parameters in this file should not be modified without expert advice from Professional Services, but we call your attention to the documentmemorymode setting:
# Determines the maximum amount of memory that the chunker may use to # store the document's data, from 4 MB to 1 GB. The more memory the chunker has # available to it, the less often it needs to re-read data from the document. # # Use: # smallest: 4mb # small: 16mb # medium: 64mb # large: 256mb # largest: 1gb documentmemorymode small
Changes to this value may affect execution speed. We've defaulted to "small" (16mb), but you should experiment to find the optimum value, based on document type and desired memory consumption vs. execution speed ratio.
Note that this value is a maximum. It allows ATEM to use up to the given amount of memory. It is not a preallocated amount.
advancedtextextraction-doctypes.xml
This configuration file acts as a registry of all the document types supported. Each document type definition contains the following pieces related to a particular document type:
- parenttype - this is typically an "umbrella", user-friendly type name that spans multiple types. For example, "Word" spans multiple various versions of Microsoft Word such as Word for DOS 4.x, Microsoft Word (MAC) etc. It does not, however, uniquely identify any specific type.
- type - this is a user-friendly, descriptive type name such as, for example, "Microsoft Word 2000". This name uniquely identifies a given doctype.
- parentmimetype - this is an "umbrella" MIME type that spans multiple doctypes. For example, "application/vnd.ms-office" is applicable to all the various Microsoft Office types. It does not, however, uniquely identify any specific type.
- mimetype - this is an industry-standard MIME type that is assigned to a given doctype, for example, "application/vnd.ms-word". Notice, however, that this value does not necessarily uniquely identify a given doctype, as in this case, "application/vnd.ms-word" applies to a number of Word doctypes.
- compound-doc-strategy - see the section on compound document handling for more details.
It is strongly recommended that you not modify any of these values except the compound document strategy. This is especially important in regards to the type names, as these are linked to the doctype definitions and any modifications to these names may break the functionality.
advancedtextextraction-metadata.xml
This configuration file acts as a registry of all the metadata fields returned for various document types. It normalizes the various metadata property names into a consistent set of Schema field names.
It is strongly recommended that this not be modified.
Logging and Debugging
Advanced Text Extraction log messages are written to the "Text Extraction" logging files named <data-agent>\projects\<project-name>\default\logs\logs-local\attivio.te.<node>.log.
To enable DEBUG or TRACE levels for all Advanced Text Extraction components, open the Attivio Administrator and navigate to Logging > Logging Level Settings. Set the desired level on the package com.attivio.advancedtextextraction.