Page tree
Skip to end of metadata
Go to start of metadata

Overview

This document describes the components of the Attivio platform's Advanced Text Extraction Module (ATEM) and walks through its out-of-the-box configuration. 

The configuration files for the Advanced Text Extraction Module are found in <install-dir>\conf\advancedtextextraction\.

View incoming links.

Before You Begin

Refer to the ATEM summary page for an overview of the architecture and capabilities of the Advanced Text Extraction module. Refer to Loading File Content for the specifics of data loading using ATEM.

ATEM Workflow

The ATEM workflow definition can be found in the <install_dir>\conf\advancedtextextraction\advancedtextextraction.xml file.  The workflow separates documents into two processing paths. Plain-text documents (and those who type cannot be determined) are handled by legacy text-extraction methods.  All other documents are routed to the advteConverter and advteExtractor components. 

  • advteConverter extracts the content of the document into a monolithic block of XML containing the original document and all included child documents, if any.
  • advteExtractor mines the XML, generating IngestDocuments for each document and child document.

advteWorkflowDiagram

Legacy Document Processing

The advteConvert workflow first determines whether the document type is "Unrecognized."  If so, the document is routed to the attivioTextExtraction workflow for "legacy" (simple) text extraction.  Otherwise, the document is routed to advteExtractor in the advteExtract workflow. 

If the legacy logic fails to process the document, an error is logged and the document is passed on to the next stage in the workflow with the doctype value of "Unrecognized" and mimetype set to "undetected".

Document Conversion (advteConvert)

The advteConverter transformer component converts the content of the document into XML.  This includes all "child" documents that are encountered, such as the files within a .zip archive.

The converted XML is then routed to the advteExtractor component of the advteExtract workflow.

The following parameters are available on in the advteConverter Component Editor.  To view the editor, open the AIE Administrator and navigate to System Management > Palette.  Click on advteConverter.

Tab/Parameter

Description

Default value

Platform Component Tab  
NameName of the component.advteConverter
Type:The underlying Java class.Advanced Text Extraction Document Converter
Workflow reference in:The workflow that hosts the component.advteConvert
Source URIConfiguration file for this component.

<install-dir>/conf/advancedtextextraction/ advancedtextextraction.xml

Number of instancesNumber of parallel instances of the component to be created.Default is blank (interpreted as equaling the number of CPUs on the computer).
Document Type ConfigurationThe document type configuration list to use. The list enumerates many document types and how Attivio should process each type.

advancedDocTypeConfig

Do not change this setting.

Expert feature. Most users should never change this setting.

Sink Field

The IngestDocument field that contains the path of the conversion results XML file

convertedfilepath

 
Annotate with ErrorsAnnotates the converted XML document to reflect any errors in the conversion.false
Error Code FieldIngestDocument field that contains conversion error codes, if any.conversionErrorCode
Error fieldIngestDocument field that contains conversion error message, if any.conversionError

Doc Type Field

IngestDocument field that contains the extracted document type value

doctype

 

Parent Doc Type Field

IngestDocument field that contains the extracted parent document type value

parentdoctype

 

Mime Type Field

IngestDocument field that contains the MIME type value

mimetype

 

Parent Mine Type Field

IngestDocument field that contains the extracted parent MIME type value

parentmimetype

 
Child Data FieldIngestDocument field for child document content (or contentPointer).bytes
Child Document Filename FieldIngestDocument field that contains the short file name of the child documentsourcepath
Children Batch SizeBatch size to use with "native child" extraction. See "Store Native Children," below.

0 (no batching)

Do not change this setting.

Expert feature. Most users should never change this setting.
Save Failed ChildrenIf true, does not drop child documents that failed conversion.false
Store Native ChildrenIf true, stores native children in the contentStore.

false

Do not change this setting.

Expert feature. Most users should never change this setting.

Converter Location

Location of the text-to-XML converter executable.

${attivio.home}/lib/aieadvte

Do not change this setting.

Expert feature. Most users should never change this setting.

 

Converter Library Path

Library path for the text-to-XML converter.

${attivio.home}/lib

Do not change this setting.

Expert feature. Most users should never change this setting.

Option File Path

Path to the conversion options file.

advancedtextextraction\
advancedtextextraction-search-export.cfg

Relative to <install-dir>\conf\.

Do not change this setting.

Expert feature. Most users should never change this setting.
Advanced Tab  

Document Timeout

The document conversion request timeout, in milliseconds.

${advancedtextextraction.documentTimeout} (from <install-dir>\conf\advancedtextextraction\ advancedtextextraction.properties), which defaults to 120000.

Use -1 to disable timeout.

Drop Document on ExceptionDrops the document from the ingestion process if it causes an AttivioException.false
Fire Event on TimeoutGenerates a system event (a log entry) with a document times out.true
documentBatchSizeNumber of child documents to be batched (included in a single message).-1 (no batching)
Other Tab  

Child Document Post Processor

Optional copier of parent document fields to child documents.

defaultTeChildPostProcessor

 

Document Extraction - advteExtractor

Once the document has been converted to XML execution flows to advteExtractor where the XML is converted into IngestDocuments.

During this stage, advteExtractor extracts document metadata, content, hyperlinks, and any data related to child documents from the XML file. This component prepares this data for use by subsequent stage(s), typically (though not exclusively) the ingestion stage.

The following parameters are available on in the advteExtractor Component Editor.  To view the editor, open the AIE Administrator and navigate to System Management > Palette.  Click on advteExtractor.

Tab/Parameter

Description

Default value

Platform Component Tab  
NameName of the component.advteExtractor
Type:The underlying Java class.Advanced Text Extraction Text Extractor
Workflow reference in:The workflow that hosts the component.advteExtract
Source URIConfiguration file for this component.

<install-dir>/conf/advancedtextextraction/ advancedtextextraction.xml

Number of instancesNumber of parallel instances of the component to be created.Default is blank (interpreted as equaling the number of CPUs on the computer).
Document Type ConfigurationThe document type configuration list to use. The list enumerates many document types and how Attivio should process each type.

advancedDocTypeConfig

Do not change this setting.

Attivio normally uses advancedDocTypeConfig, but falls back on legacyDocTypeConfig for plain-text documents and documents whose type cannot be determined. There is no need for the user to change this setting.

Sink Field

The name of the field that contains the path of the conversion results XML file.

convertedfilepath

Compress WhitespaceCompresses whitespace down to a single space before tokenization.true
Omit Tokens On Doc TypeSpecifies whether to include structural tokens, such as paragraph boundaries, on the text field of all documents passing through the component. When set to true, a doc type lookup will be executed to determine if that option is configured. This checks a new configuration option in DocTypeInfo called includeStructureTokensOnTextField. It will default to false out of the box for all spreadsheet doctypes to avoid adding paragraph tokens for each data cell. It can be xml configured per doctype by editing the installation file advancedtextextraction-doctypes.xml. Custom configuration of doc types in this manner are considered advanced.false

Delete SearchML Files

If true, the generated XML conversion files are deleted.

true

Demarcate Text Elements

Boolean flag for whether units of text should be terminated with the boundary character. The boundary character is used to clearly delineate such units of text as paragraphs, cells, and slides from each other. The ASCII 0x1F ("unit separator") character is used to demarkate these boundaries.

true

Generate DeleteByQuery

Boolean flag for whether DeleteByQuery's should be generated so that any stale child documents are removed prior to new ones being ingested.

false

Metadata Configuration File Path

Location of the ATEM document metadata configuration file

conf\advancedtextextraction\
advancedtextextraction-metadata.xml

Do not change this setting.

Expert feature. Most users should never change this setting.

Output Content Field

Field that holds the extracted text (content)

text

Output File Name Field

Field that contains the short file name of the document

filename

Output File Extension Field

Field that contains the file extension, if any, of the document

fileext

Doc Type Field

Field that contains the document type information

doctype

Parent Doc Type Field

Field that contains the parent document type information.

parentdoctype

Mime Type Field

the name of the field that contains the MIME type information

mimetype

Parent Mime Type Field

the name of the field that contains the parent MIME type information

parentmimetype

Advanced Tab  

Document Timeout

Document conversion request timeout, in milliseconds. There is no timeout if this value is not specified or is set to -1.

-1 (no timeout)

Drop Document On ExceptionDrops the document from the ingestion process if it causes an AttivioException.false
Fire Event on TimeoutGenerates a system event (a log entry) with a document times out.true
Incremental Output ModeIf set to true, as child documents are added to processing results, the children will be emitted to the default workflow destination rather than being part of the result message.  This prevents components which add large numbers of children from having all of those children in the same document list and potentially exhausting memory.false
Document Batch Size

Dictates the size of the batch at the DocumentList level, as it is being sent over by the IngestClient.

-1 (no batching)
Other Tab  

Child Document Post Processor

Optional copier of parent document fields to child documents

defaultTeChildPostProcessor

Configuration File Notes

The following ATEM configuration files are located in the <install-dir>\conf\advancedtextextraction\ directory.

advancedtextextraction-search-export.cfg

This configuration file contains various options that are passed into the ATEM.  In general, the parameters in this file should not be modified without expert advice from Professional Services, but we call your attention to the documentmemorymode setting:

# Determines the maximum amount of memory that the chunker may use to
# store the document's data, from 4 MB to 1 GB. The more memory the chunker has
# available to it, the less often it needs to re-read data from the document.
#
#   Use:
#   smallest:   4mb
#   small:      16mb
#   medium:     64mb
#   large:      256mb
#   largest:    1gb

documentmemorymode small

Changes to this value may affect execution speed. We've defaulted to "small" (16mb), but you should experiment to find the optimum value, based on document type and desired memory consumption vs. execution speed ratio.

Note that this value is a maximum. It allows ATEM to use up to the given amount of memory. It is not a preallocated amount.

advancedtextextraction-doctypes.xml

This configuration file acts as a registry of all the document types supported. Each document type definition contains the following pieces related to a particular document type:

  • parenttype - this is typically an "umbrella", user-friendly type name that spans multiple types. For example, "Word" spans multiple various versions of Microsoft Word such as Word for DOS 4.x, Microsoft Word (MAC) etc. It does not, however, uniquely identify any specific type.
  • type - this is a user-friendly, descriptive type name such as, for example, "Microsoft Word 2000". This name uniquely identifies a given doctype.
  • parentmimetype - this is an "umbrella" MIME type that spans multiple doctypes. For example, "application/vnd.ms-office" is applicable to all the various Microsoft Office types. It does not, however, uniquely identify any specific type.
  • mimetype - this is an industry-standard MIME type that is assigned to a given doctype, for example, "application/vnd.ms-word". Notice, however, that this value does not necessarily uniquely identify a given doctype, as in this case, "application/vnd.ms-word" applies to a number of Word doctypes.
  • compound-doc-strategy - see the section on compound document handling for more details.

It is strongly recommended that you not modify any of these values except the compound document strategy. This is especially important in regards to the type names, as these are linked to the doctype definitions and any modifications to these names may break the functionality.

advancedtextextraction-metadata.xml

This configuration file acts as a registry of all the metadata fields returned for various document types. It normalizes the various metadata property names into a consistent set of Schema field names.

It is strongly recommended that this not be modified.

Logging and Debugging

Advanced Text Extraction log messages are written to the "Text Extraction" logging files named <data-agent>\projects\<project-name>\default\logs\logs-local\attivio.te.<node>.log.

To enable DEBUG or TRACE levels for all Advanced Text Extraction components, open the Attivio Administrator and navigate to Logging > Logging Level Settings.  Set the desired level on the package com.attivio.advancedtextextraction.


  • No labels