Overview
The out-of-the-box installation of the Attivio Intelligence Engine (AIE) includes the Structure Extraction Module (SEM), which provides a rich set of capabilities related to extracting structural document elements from over 500 document types. Document types range from the well-known, popular types such as Microsoft Office documents (Word, Excel, PowerPoint etc.), to plain text and markup types (TXT, XML, HTML, XHTML), to email formats (message/rfc822), to compound types such as Outlook PST, ZIP, and Gzip, and to the extraction of metadata from various image and multimedia types.
The following sections explain in detail the architecture and usage of the module.
Required Modules
These features require that the structureextraction module be included when you run createproject to create the project directories.
View incoming links.
Extraction Capabilities
The notion of structure extraction, as it is supported by the SEM, encompasses the following aspects:
- identification and extraction of bold text;
- identification and extraction of paragraphs;
- identification and extraction of document sections;
- identification and extraction of document headings;
- identification and extraction of document titles.
Note that by a "section" we mean a section in the sense of typically a word processing document such as Microsoft Word, where sections of text can be defined by applying various text breaks.
Note that by a "heading" we mean a heading in the sense of typically a word processing document, where a text segment is highlighted, bolded, increased in size, or the like, to make the text segment stand out in regards to the rest of the text. The SEM uses a number of heuristics (elaborated on in the following sections) to identify the headings.
SEM also uses a number of heuristics to detect and extract document titles which show up in search results. The following sections cover this topic in more detail.
Structure Extraction Core Architecture
The SEM uses Java's runtime execution (ProcessBuilder) capability to invoke an underlying document conversion executable, called aieadvte.exe (on Windows) and aieadvte (on Linux). The executable accepts input arguments via the standard input stream. The executable produces XML representations of its input documents. From there, SEM takes over and extracts a variety of structural document elements. This architecture allows for efficient, high-performance processing of a great variety of document types in a normalized, unified way.
Structure Extraction Workflows
AIE provides a set of customizable, out-of-the-box workflows which contain all of the logic for passing input documents, extracting the results, and passing the results on to additional ingestion processing workflow stages.
You may want to use different workflows in SEM, based on your tasks at hand.
Refer to the <install-dir>\conf\structureextraction\structureextraction.xml configuration file to see the definitions of the various SEM workflows.
The fileIngestWithTitles Workflow
The fileIngestWithTitles workflow comes as part of the structure extraction module. Use this workflow when you intend to perform both text extraction and structure extraction. The goal of this workflow is to provide the ability to extract text, metadata, and child documents but also to provide titles to the ingested documents. The text extraction subflow achieves the first goal and the SEM takes care of the structural part (which is the titles).
The structureExtractionWithXmlExport-IncludeTitle Workflow
The structureExtractionWithXmlExport-IncludeTitle workflow comes as part of the structure extraction module. Use this workflow when you intend to only perform structure extraction with title detection/generation (without the text extraction). This subflow, combined with the ingest subflow, is exposed as the structureExtractionWithXmlExport-IncludeTitle-Ingest workflow.
The xmlExportBasedExtraction Workflow
The xmlExportBasedExtraction workflow comes as part of the structure extraction module. Use this workflow when you intend to only perform structure extraction, without title detection/generation (and without text extraction). This subflow, combined with the ingest subflow, is exposed as the structureExtractionWithXmlExport workflow. The goal of the xmlExportBasedExtraction workflow is to provide the ability to extract such structural elements as: bold text, paragraphs, sections, and document headings.
How Structure Extraction Works
The structural elements identified and extracted by SEM are presented in their output form as an AIE Token List. An AIE token list is a Java object which contains a list of tokens with additional markup to indicate position increments and ordering of tokens. The reason why the structural data is represented as a token list is so that various stages can be applied to the same data. For instance, a stage can be implemented which analyzes the extracted data as a token list, and detects and annotates a table of contents within the data. The token list is output by the flexionDocExtractor transformer whose definition can be found in the conf\structureextraction\structureextraction.xml file. The tokenListField contains the name of the field into which an instance of the output TokenList is inserted once the data extraction is done.
The sections below elaborate in more detail on how to work with token lists.
Identification and Extraction of Bold Text
Bold text is detected for such documents as word processing documents, spreadsheets, presentations, and PDF's. The code snippet below provides an example of how to extract segments of bold text from the output token list:
public static List<String> getBoldText(TokenList tl, boolean compressWhitespace) { List<String> boldText = new ArrayList<String>(); boolean inBold = false; StringBuilder bold = null; for (TokenIterator iter = tl.iterator(); iter.hasNext();) { Token tok = iter.next(); String strTok = tok.toString(); if (StructureExtractionUtils.isBoldStart(tok)) { bold = new StringBuilder(); inBold = true; } else if (StructureExtractionUtils.isBoldEnd(tok)) { String str = bold.toString(); if (compressWhitespace) { str = StringUtils.compressWhitespace(str); } boldText.add(str); inBold = false; } else if (inBold && StructureExtractionUtils.isCharacters(tok)) { bold.append(strTok); } } return boldText; }
Identification and Extraction of Paragraphs
SEM provides detection and identification of document paragraphs. Paragraphs are identified for word processing documents, presentations, and PDF documents. The following snippet provides an example of how to extract paragraphs from the output token list:
public static List<String> getParagraphs(TokenList tl, boolean compressWhitespace) { List<String> segments = new ArrayList<String>(); boolean inSegment = false; StringBuilder buff = new StringBuilder(); for (TokenIterator iter = tl.iterator(); iter.hasNext();) { Token tok = iter.next(); String strTok = tok.toString(); if (strTok.equals(SeConstants.SE_PARAGRAPH)) { // Start of segment if (tok.containsAnnotation(TokenAnnotation.START_ELEMENT)) { inSegment = true; } // End of segment else if (tok.containsAnnotation(TokenAnnotation.END_ELEMENT)) { inSegment = false; if (buff.length() > 0) { String str = buff.toString().trim(); if (compressWhitespace) { str = StringUtils.compressWhitespace(str); } if (StringUtils.isNotBlank(str)) { segments.add(str); } // Prepare for a new segment buff = new StringBuilder(); } } } else if (inSegment) { if (strTok.equals(SeConstants.SE_TEXTRUN) && tok.containsAnnotation(TokenAnnotation.END_ELEMENT)) { buff.append(' '); } else if (tok.containsAnnotation(TokenAnnotation.CHARACTERS)) { buff.append(strTok); } } } return segments; }
Identification and Extraction of Sections
Sections are typically defined in word processing documents such as Microsoft Word, where sections of text can be split off from each other by applying various text breaks.
In SEM, sections are currently only supported for the word processing formats (e.g. MS Word). No notion of a section is supported for PDF files. Even though Excel supports insertion of a page break and PowerPoint supports the notion of a "section" of slides, neither one of these items is supported by SEM. The following section break types are supported:
- Page break;
- Column break;
- Text wrapping;
- Next page;
- Continuous;
- Even page;
- Odd page.
This help page summarizes Microsoft Word's support for sections and section breaks.
The following snippet provides an example of how to extract sections from the output token list:
public static List<String> getSections(TokenList tl, boolean compressWhitespace) { List<String> segments = new ArrayList<String>(); boolean inSegment = false; StringBuilder buff = new StringBuilder(); for (TokenIterator iter = tl.iterator(); iter.hasNext();) { Token tok = iter.next(); String strTok = tok.toString(); if (strTok.equals(SeConstants.SE_WP_SECTION)) { // Start of segment if (tok.containsAnnotation(TokenAnnotation.START_ELEMENT)) { inSegment = true; } // End of segment else if (tok.containsAnnotation(TokenAnnotation.END_ELEMENT)) { inSegment = false; if (buff.length() > 0) { String str = buff.toString().trim(); if (compressWhitespace) { str = StringUtils.compressWhitespace(str); } if (StringUtils.isNotBlank(str)) { segments.add(str); } // Prepare for a new segment buff = new StringBuilder(); } } } else if (inSegment) { if (strTok.equals(SeConstants.SE_TEXTRUN) && tok.containsAnnotation(TokenAnnotation.END_ELEMENT)) { buff.append(' '); } else if (tok.containsAnnotation(TokenAnnotation.CHARACTERS)) { buff.append(strTok); } } } return segments; }
Identification and Extraction of Document Headings
Headings, in essence, are contiguous text segments which stand out among the surrounding text and are used as visual and sometimes hyperlinked aids in structuring a document. Headings are defined in word processing documents, presentations, and PDF documents. Headings typically tend to be relatively short, possibly boldened, and usually larger than the normal text font size.
In SEM, Headings are extracted as simple Java objects; their attributes include: the actual text of the heading; the "level" of the heading (an integer value, where 1 is equivalent to H1, 2 is equivalent to H2, etc.), as well as an integer score value from 0 to 100. The higher the score, the higher degree of confidence SEM expresses in that a particular text segment is actually a heading.
In some cases, SEM is able to extract headings with a high degree of confidence, typically due to the fact that they are clearly marked up natively in the source documents. However, in the majority of cases, headings are extracted based on a set of heuristics. Most of this functionality is provided by the headingDetector bean which is an integral part of the SEM; its configuration can be found in the conf\structureextraction\structureextraction-beans.xml file.
How the Heading Detector Works
The numLevels parameter in the Heading Detector determines how many levels of headings to detect. By default this value is set to 9, which in essence means that headings of level/size 1 through 9 are extracted.
The initialScore parameter determines the initial score value to assign to candidate headings before they are processed with the heading detection heuristics.
The thresholdScore parameter works as follows: if the final score for a candidate heading is greater than or equal to this value then the candidate is included as an extracted heading.
The majorityOfTextPercentage parameter works as follows. When heading candidates are evaluated, their font is taken into account. If a given font is used for a particular percentage of all text in a document or more, the font is assumed to be the "normal" font (non-heading). That tells the Heading Detector that the candidate is in the normal font and is therefore not a heading. The value of the parameter is the percentage in question. If a given font is used for, say, 52% of the text in the document, it is assumed that the font is "normal".
The scorers parameter is the most essential parameter on the bean, as it defines the set of pluggable heading detection heuristics for the bean to run. Out of the box, SEM provides the following four scorer heuristics:
- textLengthBasedHeadingScorer;
- boldBasedHeadingScorer;
- positionBasedHeadingScorer;
- regexBasedHeadingScorer.
The textLengthBasedHeadingScorer evaluates the chances of a text segment being a heading based on the length of the segment. The heading candidate's score is decreased if it is "too short" or "too long", and the score is increased if the candidate's length is within the preconfigured boundaries.
The boldBasedHeadingScorer evaluates the changes of a text segment being a heading based on whether the segment is bold or not. The score is increased if the candidate is bold and descreased if it is not.
The positionBasedHeadingScorer evaluates the changes of a text segment being a heading based on the position of a given text segment within its parent, typically a text run within its parent paragraph. The candidate's score is increased if the text segment is the first in a group of segments and the score is decreased otherwise.
The regexBasedHeadingScorer performs heading candidate scoring based on regular expressions. If the candidate matches the expression, the score is increased, otherwise it is decreased. This allows you to give a higher score, for example, to headings which start with a capital letter.
Note that the configuration of the Heading Detector allows you to:
- pick and choose various scorers;
- adjust their individual configuration parameters;
- add your own scorers (as long as their implementation classes implement the com.attivio.structureextraction.model.TextSegmentScorer interface).
Identification and Extraction of Document Titles
The following diagram illustrates the inner workings of the title detection workflow.
The flow is as follows:
- If the input document is plain text, then title detection is based on a number of configurable document field values, in order. The following sections elaborate in more detail on this.
- Otherwise, if the document is not plain text, then structural data is extracted from the document into an AIE token list.
- If it is determined that the document is an email doc (e.g. EML), then its subject is examined. If there is a subject value, then it is given a high score as a title candidate, and the title detection process is done at that point.
- If the document is not an email doc, or it is an email doc with no subject, then document headings-based heuristics are employed. The following sections elaborate in more detail on this.
- If the headings-based heuristics generate a title with a high enough score, then the title detection process is done at that point.
- Otherwise, SEM examines the document for the presence of any natively maintained 'title' metadata property.
- If the title metadata property yields a high enough score, that value is used as the title. Otherwise, field value-based title detection is performed. The following sections elaborate in more detail on this.
How the Headings-Based Title Detection Works
The preceding sections provided a walk-through of how headings are detected and extracted. Headings are a natural foundation of implementing strategies for title detection.
The main element of the headings-based title extraction subflow is the headingsBasedTitleDetector stage (com.attivio.structureextraction.platform.transformer.ingest.DetectTitleFromHeadings). This stage employs a pluggable 'strategy' bean for how to pick the best title candidate from the set of detected headings (if any are present). The following three strategies are supported out of the box:
Strategy | Implementation Class | Description |
---|---|---|
biggestOfFirstNHeadings | com.attivio.structureextraction.title.BiggestOfFirstNHeadingsAsTitle | The out of the box default. Picks the title as the biggest (in terms of character size) headings out of the first N (configurable value) headings in the document. If there are multiple such headings of the same size, the first such heading is picked. |
headingNumberN | com.attivio.structureextraction.title.HeadingNumberNAsTitle | Picks N-th heading within the document as the title, starting from the beginning, where N is configurable. |
highestScoredOfFirstNHeadings | com.attivio.structureextraction.title.HighestScoredOfFirstNHeadingsAsTitle | Picks the title as the heading with the highest heading detection score, out of the first N (configurable value) headings in the document. If there are multiple such headings of the same size, the first such heading is picked. |
Another important element of the headings-based detection is blacklisting. The headingsBasedTitleDetector stage allows you to blacklist (exclude) certain textual patterns of headings as title candidates. The list of 'blacklist' regular expressions is maintained in the following file: conf\structureextraction\structureextraction-titles-blacklist.txt. Expressions can be added to the list (or removed from it) as necessary, with one expression per line.
The following is the definition of the headings-based title detector component:
<!-- Detect document title based on any headings detected by flexionDocExtractor. --> <component name="headingsBasedTitleDetector" class="com.attivio.structureextraction.platform.transformer.ingest.DetectTitleFromHeadings"> <properties> <!-- This field contains the input token list with document structure data. --> <property name="tokenListField" value="tokenlist" /> <!-- The name of the output title field. --> <property name="titleField" value="title" /> <!-- The name of the output title score field. This field contains the score of the candidate title (integer 0-100). --> <property name="titleScoreField" value="title-score" /> <!-- If the title score so far is below or equal to this value, then headings-based detection takes place, otherwise the stage is a no-op. --> <property name="thresholdScore" value="60" /> <!-- The title detection strategy. This is defaulted to the "Biggest of first N headings" strategy. Also available: headingNumberN, highestScoredOfFirstNHeadings. --> <container-property name="titleDetectionStrategy" reference="biggestOfFirstNHeadings" /> <!-- <container-property name="titleDetectionStrategy" reference="headingNumberN" /> --> <!-- <container-property name="titleDetectionStrategy" reference="highestScoredOfFirstNHeadings" /> --> <property name="titleBlacklistLocation" value="conf/structureextraction/structureextraction-titles-blacklist.txt" /> </properties> </component>
You can choose the title detection strategy by setting the titleDetectionStrategy property value.
The following are the definitions of the available strategies, which can be seen in the <install-dir>\conf\structureextraction\structureextraction-beans.xml file:
<!-- Pick the title by the ordinal number (from 1 to N) of the candidate title within the list of pre-detected headings. --> <bean id="headingNumberN" class="com.attivio.structureextraction.title.HeadingNumberNAsTitle"> <!-- This picks the first heading. --> <property name="headingNum" value="1" /> </bean> <!-- Pick the title as the first heading with the highest score out of the first N headings, starting from the beginning of the document. --> <bean id="highestScoredOfFirstNHeadings" class="com.attivio.structureextraction.title.HighestScoredOfFirstNHeadingsAsTitle"> <!-- The number of first N headings from which to pick the title. --> <property name="firstN" value="5" /> </bean> <!-- Pick the title as the first biggest heading out of the first N headings, starting from the beginning of the document. --> <bean id="biggestOfFirstNHeadings" class="com.attivio.structureextraction.title.BiggestOfFirstNHeadingsAsTitle"> <!-- The number of first N headings from which to pick the title. --> <property name="firstN" value="5" /> </bean>
How the Metadata-Based Title Detection Works
The metadata-based title detection subflow is based on the metadataTitleBasedTitleExtractor stage (com.attivio.structureextraction.platform.transformer.ingest.DetectTitleFromMetadata). Its job is to examine the natively maintained metadata of a given document (if any) for the presence of a 'title' metadata property. If such a property is present, the stage assigns the value of the scoreToAssign parameter to the title score. Note that this stage also employs the blacklist checking capability described earlier in the section on headings-based title detection. This prevents SEM from assigning such values to be titles as, for example, "PowerPoint Presentation" – this value is used by Microsoft PowerPoint automatically in the absence of a user-defined value for the title property.
It's important to note that if you are using the Advanced Text Extraction Module (ATE) or the Basic Text Extraction Module (BTE), it is possible that for a given document, they extract a particular value to be the value of the 'title' property. In the scenario where the Structure Extraction Module is applied next, it is quite likely that the value of the title property may get overwritten by the SEM. To make sure that the title value extracted by ATE or BTE is preserved, this value is copied into a separate field called 'metadata.title'.
How the Field Value-Based Title Detection Works
The subflow which performs field value-based title detection is typically used as the last means employed by the title detection process. In some cases, such as with archive documents, for instance, this is the only technique applied. The idea of this subflow is to assign a title based on some of the field values that a particular IngestDocument may have, in sequence. By default, the following field values, if present, are examined:
- filename;
- sourcepath;
- uri;
- document ID.
By default, the fieldValueBasedTitleExtractor stage attempts to get the value of each such field, in sequence, and attempt to extract a short filename from it. Since a document always has at least its unique ID, even in the absence of all the other fields, it is guaranteed that a title is extracted.
Example
The following steps define a procedure for ingesting some documents with titles. Note that although the File Scanner is used in this example, you can utilize the same technique with any other AIE connector.
The steps:
- Copy some documents into a particular folder on your machine, e.g. under c:\documents.
- Make sure that you have an AIE project created with the following modules: structureextraction, advancedtextextraction, and textextraction. You can also include the SAIL module in order to have the visual means of inspecting the ingested content. Refer to the Create a New Project page for details on how to create a new project.
Run the File Scanner pointing to your test document folder. Refer to the Loading File Content page for more details on how to do that.
- Open SAIL and observe that the titles are displayed for your ingested documents.