The out-of-the-box installation of the Attivio Intelligence Engine (AIE) includes the Advanced Text Extraction Module (ATEM), which provides a rich set of capabilities related to extracting text from over 500 document types. Document types range from the well-known, popular types such as Microsoft Office documents (Word, Excel, PowerPoint etc.), to plain text and markup types (TXT, XML, HTML, XHTML), to email formats (message/rfc822), to compound types such as Outlook PST, ZIP, and Gzip, and to the extraction of metadata from various image and multimedia types.
The following sections explain in detail the architecture and usage of the text extraction module.
In Linux environments, you must add libstdc++.so.5 to your system. Other environments such as CentOS, may require g++-multilib, ia32-libs, and libstdc++5.
To run AIE on Red Hat Enterprise Linux 7, you must first install a 32-bit zlib library:
sudo yum install zlib.i686
These features require that the advancedtextextraction module be included in your project when you run createproject to create the project directories. The createproject tool adds this module to all projects by default.
View incoming links.
The notion of text extraction, as it is supported by the ATEM, encompasses the following aspects:
- document type identification (with a user-friendly type as well as the MIME type);
- metadata extraction;
- textual content extraction;
- child document extraction (for compound types);
- header/footer extraction (where applicable);
- hyperlink extraction (where available).
Advanced Text Extraction Core Architecture
The ATEM uses Java's runtime execution (ProcessBuilder) capability to invoke an underlying document conversion executable, called aieadvte.exe (on Windows) and aieadvte (on Linux). The executable produces XML representations of its input documents. From there, ATEM takes over and extracts metadata and the text. This architecture allows for efficient, high-performance processing of a great variety of document types in a normalized, unified way.
"Legacy" text extraction handles plain-text documents and documents in unrecognizable formats. Documents recognized by ATEM are converted into XML. ATEM breaks down compound documents (such as zip files) into parent and child documents. The child documents are sent back to the beginning of the workflow as new input documents. Eventually, ATEM extracts metadata and other content from each XML document, storing the information in the fields of an IngestDocument object. The IngestDocument is submitted to AIE's input workflow for linguistic analysis and indexing.
For more information, see Configure the Advanced Text Extraction Module
Supported Document Types
The Supported Document Types page contains the list of document types supported by the ATEM and information related to file type identification.
Recording Document Type
Document type recognition consists of analyzing the content and metadata of a given document to determine its type. The type is then recorded in multiple fields of the IngestDocument.
The following fields in the IngestDocument object being processed get filled with values based on the document type recognized:
the user-friendly name for the document type
"Word 6.0 or 7.0"
the common, umbrella name for document types that belong together in one group
a standard, industry-adopted MIME type value
the "parent", umbrella MIME type which applies across several document types
"application/vnd.ms-office" for Microsoft Office types
the file extension
".doc" for Word
When processing documents, the file extension may be missing or the filepath may not even be available for the document type recognition; so AIE uses what are known as magic numbers to perform file type detection. Magic numbers are the few first bytes of a document that help uniquely identify its type. This technique is mostly relevant to binary document formats such as Word, Excel, and PDF but is also applied to some plain text file types such as XML.
If the Advanced Text Extraction process is unable to recognize the format of a given input document, a fallback set of extraction techniques ("legacy" extraction) are used to process the document. If the fallback branch of logic fails to process the input document, an error is logged and the document is passed on to the next stage in the workflow with the doctype value of "Unrecognized" and mimetype set to "undetected".
It is easy to augment the workflow with a Drop
stage which checks whether the document doctype is 'unrecognized' and if so, drops the document from further ingestion processing.
By default, no attempt is made to salvage any textual content from documents that are marked with an Unrecognized filetype; however, AIE can be configured to perform basic string parsing on Unrecognized file type files by editing one of the following files:
Specifically, the NoOpTextExtractor in the doctype definition of the Unrecognized doctype should be replaced with the class DefaultTextExtractor . The Default extractor attempts to extract any textual content it deems "resembling text" from the unrecognized documents.
Using the DefaultTextExtractor may have adverse effects on performance if large binary files are marked as Unrecognized file types. This configuration change may result in ingesting large amounts of "garbage" data out of such files.
Support for International Characters
The Advanced Text Extraction module extracts the text from each source file and converts it into Unicode strings for use by other components of AIE. This Unicode conversion is usually quite straightforward.
Under certain circumstances, however, the extraction does not faithfully reconstruct the source text. Some legacy file formats did not anticipate the Unicode standard, which includes support for a very large number of characters and languages. Some file formats only support ANSI single-byte character set encodings, which are limited to 255 characters per encoding. Some legacy PDFs contain embedded fonts, and the Advanced Text Extraction module is not always able to determine the correspondence between the embedded font characters and the Unicode characters.
Hebrew and Arabic text extracted from some older PDF files may contain pieces of reversed text. The ReverseHebrew transformer (which is not configured by default) may be used to repair some Hebrew text.
The Advanced Text Extraction Document Converter transformer may be configured to convert character encoding before text extraction. See Character Encoding Conversion for more details.
If you need additional information on international character support, please contact Attivio customer support.
Supported Metadata Properties
AIE maps all of the metadata properties extracted during the text extraction process to respective AIE schema field names. Since different document types have different metadata properties, but also have some overlapping properties (which are sometimes named differently), certain normalization of these properties is required. This normalization is achieved in the <install_dir>\conf\advancedtextextraction\advancedtextextraction-metadata.xml configuration file, in which the native properties specified by the 'propname' attribute are mapped to AIE schema field names, specified by the 'fieldname' attribute.
The AIE schema fields are listed in the text extraction schema file <install_dir>\conf\advancedtextextraction\advancedtextextraction-schema.xml. These fields are automatically included in the project's schema file, <project_dir>\conf\schema\default.xml.
See Supported Fields for more information on the supported metadata properties.
Compound Document Handling
The ATEM automatically unpacks compound documents (like .zip files) and processes each child document as if it were a newly-ingested document. It unpacks nested compound documents to any depth, while recording all discovered parent-of and child-of relationships.
Types of Child Documents
The <install_dir>\conf\advancedtextextraction\advancedtextextraction-doctypes.xml configuration file allows for specifying how the ATEM processes child documents within compound documents. Compound documents are defined as documents that may contain other documents.
Three kinds of this parent/child relationships are distinguished:
- entries (such as documents within an archive file);
- embeddings (such as an Excel spreadsheet embedded within a Word document);
- attachments (such as a Word document attached to an EML (message/rfc822) file).
The ATEM supports identification and extraction of all three types of child documents. By default, each child document is represented as a separate document, linked to its parent through the parentid field. Furthermore, nested compound documents are also supported, meaning that a ZIP within a ZIP is supported, as well as, for example, an attachment to an EML file which happens to be a part of a ZIP. Not only is the immediate parent document maintained via its ID on each child but also the whole list of ID's of all the ancestor documents (the ancestorids field). This allows AIE to maintain the full chain of parentage for complex document structures.
Therefore, upon ingestion into the AIE index, each child is represented by its own record and the records are connected with the parent/child links that can be queried.
Compound Document Processing Strategies
The full unraveling of compound documents and representing each child document with its own record may be desired for some document types but not for others. The ATEM configuration allows for specifying which strategies to use for child documents processing on a per doctype basis.
For example, the definition of the ZIP File doctype is below.
<bean class="com.attivio.textextraction.api.DocTypeInfo"> <property name="typeName" value=".ZIP/JAR File"/> <property name="mimeType" value="application/zip"/> <property name="compoundDocumentStrategy"> <bean class="com.attivio.textextraction.api.CompoundDocumentStrategy"> <property name="includeChildDocuments" value="true"/> <property name="aggregateChildDocuments" value="false"/> <property name="excludedChildMimeTypes"> <list> <value>application/unknown</value> </list> </property> </bean> </property> </bean>
To change how AIE handles child document within ZIP files, modify the CompoundDocumentStrategy bean in <install-dir>\conf\advancedtextextraction\advancedtextextraction-doctypes.xml. At the project level, these settings are found in <project-dir>\conf\bean\advancedDocTypeConfig.xml.
By default, all the entry documents are extracted and processed by the system; each sub-document is extracted and processed separately and the system maintains parent/child relationship links between the parent (the ZIP) and the children (the entries). This logic is applied recursively for cases when there are ZIP files within other ZIP files.
This default processing can be altered by changing the configuration. If the includeChildDocuments attribute is set to false, the child documents are not extracted and not processed; however, the parent document will still contain the list of its child document's names in its metadata extraction results.
If the includeChildDocuments flag is set to true, an additional option as to whether the processed child documents' content should be aggregated (rolled up) into the textual content extraction results of the parent documents can be specified.
In addition, there is an option to specify if sub-documents should be excluded (based on their MIME type) from processing by "blacklisting" the types via the excludedChildMimeTypes list:
<property name="compoundDocumentStrategy"> <bean class="com.attivio.textextraction.api.CompoundDocumentStrategy"> <property name="includeChildDocuments" value="true"/> <property name="aggregateChildDocuments" value="false"/> <property name="excludedChildMimeTypes"> <list> <value>application/vnd.ms-word</value> </list> </property> </bean> </property>
In this case, any sub-documents that happen to be Word documents are excluded from processing altogether.
In the absence of a configured strategy on a given document type, ATEM uses the default compound document processing strategy defined at the top of the doctypes configuration XML file. This strategy is to include all child documents and aggregate their textual content into the textual content extraction results of their parent documents.
The ATEM extracts any hyperlinks found in documents as follows:
- Any generic hyperlinks are extracted into the (multi-valued) field called "links".
- Any "mailto" links are extracted into the (multi-valued) field called "mailto".
Each extracted link is represented by an object of type Link. Each instance of the object has at least the URI field filled in but may also have the link title and/or link tooltip data, as available.
Since the Link object cannot be directly added to the AIE index, various schemes may be used for hyperlink data ingestion. ATEM includes the collapseHyperlinkPOJOsIntoStrings component, which is included in the fileIngest workflow. This transformer converts each Link object into a concatenated string value that contains the URI and, if available, the title and/or tip, separated by a single space character. The resulting string can be added to the index.
The ATEM workflow extracts any header/footer data from the documents it processes. Any extracted header values are stored in the (multi-valued) headers field. Any extracted footer values are stored in the (multi-valued) footers field.
The ATEM recognizes specific text patterns as dates and timestamps, and automatically converts them into Java date objects. The patterns are expressed in the Java SimpleDateFormat syntax. The supported patterns are:
EEE, d MMM yyyy HH:mm:ss Z MM/dd/yyyy hh:mm aaa EEE MM/dd/yyyy hh:mm:ss aaa EEE, MMM dd, yyyy hh:mm a EEE MMM dd HH:mm:ss yyyy EEE MM/dd/yyyy hh:mm a EEE, MMM dd, yyyyhh:mm a EEE, MMM dd, yyyyhh:mma EEE, dd MMM yyyy HH:mm:ss EEE, dd MMM yyyy HH:mm:ss z dd MMM yyyy HH:mm:ss Z dd MMM yyyy HH:mm MM/dd/yyyy HH:mm:ss MM/dd/yyyy MM/dd/yy EEE dd-MM-yy hh:mm:ss a
These patterns are not user-extendable.
The following page provides a how-to guide for loading file content using the Advanced Text Extraction Module.
Main Article: Loading File Content