Page tree
Skip to end of metadata
Go to start of metadata

Overview

The Attivio Intelligence Engine (AIE) supports a feature rich HTML text extraction capability when using the Advanced Text Extraction Module.

View incoming links.

Supported Document Types

The following document types related to HTML are supported: 

Parent document type

Document type

Parent MIME type

MIME type

Extension

HTML

HTML Document

N/A

text/html

.htm or .html

Please note that many additional file types that do not have a traditional HTML file extension (e.g. .htm or .html)  may actually be valid HTML documents.  For example, a file produced by a PHP web application may have an extension of .php, but may actually be a valid HTML document.

 

Expected Metadata Properties

The following metadata properties are extracted, subject to availability in the document:

Property name

Property type

Description

author

String

The author (as specified by associated <meta> element).

description

String

The description (as specified by associated <meta> element).

generator

String

The application that generated this document (as specified by associated <meta> element).

doctype

String

The document type.

fileext

String

The extension of the original filename.

filename

String

The short filename of the document.

keywords

StringList

The keywords (as specified by associated <meta> element).

mimetype

String

The MIME type of the document.

parentdoctype

String

The parent document type of the document.

parentmimetype

String

The parent MIME type of the document.

title

String or StringList

The title (as specified by the <title> element and/or associated <meta> element).  If both are specified, and they are not equal, this will be a multi-valued field.

h1StringListLevel 1 headings
h2StringListLevel 2 headings
h3StringListLevel 3 headings
h4StringListLevel 4 headings
h5StringListLevel 5 headings
h6StringListLevel 6 headings

 

Capabilities

When using the  Advanced Text Extraction Module, documents sent to the fileIngest workflow that are deemed valid HTML documents are further routed to the htmlTextExtraction subflow.

The next few sections describe some of the capabilities of the htmlTextExtraction subflow.

Intelligent Character Encoding Detection

Before AIE extracts textual content or metadata from an HTML document, it must determine the correct character encoding of the bytes representing the HTML document.

The character encoding of an HTML document is determined by one of the following sources in this preferential order:

  1. Populated in the IngestDocument's field named by the ProcessHtml transformer's inputCharsetField property ("encoding") by default.  The most common use case is when a web server provides this as part of its HTTP response headers (e.g. Content-Type: text/html; charset=utf-8). 
  2. A meta tag within the HTML document specifying the charset, such as <meta charset="character_set"> or one of its earlier variants like <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">.
  3. Heuristic algorithms using techniques like character frequency distribution scoring.

At each stage in the above preferential ordering, AIE tests the detected encoding to determine if it appears correct.  If it is not considered correct, or if it is missing, AIE moves to the next stage in the order described above.  For example, if the encoding is not specified in the IngestDocument, and the meta tag appears incorrect, then heuristic algorithms are applied to determine the correct character encoding of the HTML document.

Boilerplate Removal

By default, AIE employs boilerplate removal technology which uses heuristic algorithms to remove any boilerplate text that is unrelated to the main textual content.  Examples of boilerplate include but are not limited to: advertisements, sidebars, navigation menus, and footers.  Please note that the heuristic algorithms are not 100% accurate.  The boilerplate removal algorithms tend to work best on HTML documents where text is organized like news articles, but are less effective on other HTML documents like forum conversations.  To override the default behavior and have AIE extract all the text in the text in the body element of the HTML document, you may disable boilerplate removal.

To disable boilerplate removal in your project, edit <project-dir>\conf\components\processHtml.xml to set its removeBoilerplate property to false:

<project-dir>\conf\components\processHtml.xml
<component xmlns="http://www.attivio.com/configuration/type/componentType" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           name="processHtml" 
           class="com.attivio.platform.transformer.ingest.textextraction.ProcessHtml" 
           xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType 
                               http://www.attivio.com/configuration/type/componentType.xsd ">
  <!--Generated configuration-->
  <properties>
    <property name="removeBoilerplate" value="false"/>
  </properties>
</component>

Optional HTML Sanitization

By default, AIE tries to parse the HTML document as is and is quite permissive in terms of what is allowed in terms of HTML tags.  If however, your HTML content has lots of invalid tags and unescaped HTML entities like '<', '&', and '>', this may cause the processing to use excessive resources and produce incorrect results.  To tell AIE to clean up non-conforming tags and sanitize your HTML on a best-effort basis, you can turn on the sanitize flag on the ProcessHtml transformer.  Please be careful with this option and use only if you think it is needed, since it may affect the content of your indexed documents and slow down ingestion.

To activate HTML sanitization in your project, edit <project-dir>\conf\components\processHtml.xml to set its sanitize property to true:

<project-dir>\conf\components\processHtml.xml
<component xmlns="http://www.attivio.com/configuration/type/componentType" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           name="processHtml" 
           class="com.attivio.platform.transformer.ingest.textextraction.ProcessHtml" 
           xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType 
                               http://www.attivio.com/configuration/type/componentType.xsd ">
  <!--Generated configuration-->
  <properties>
    <property name="removeBoilerplate" value="false"/>
    <property name="sanitize" value="true" />
  </properties>
</component>

Optional HTML Class Exclusion

HTML class exclusion allows you to selectively exclude certain HTML elements from text extraction, specifically those with certain class attributes like "noindex". 

To use HTML Class Exclusion in your project, edit <project-dir>\conf\components\processHtml.xml to set its classesToExclude property to a list of attribute values such that HTML elements with one of these attribute values for the 'class' attribute should be excluded from text extraction.

<project-dir>\conf\components\processHtml.xml
<component xmlns="http://www.attivio.com/configuration/type/componentType" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           name="processHtml" 
           class="com.attivio.platform.transformer.ingest.textextraction.ProcessHtml" 
           xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType 
                               http://www.attivio.com/configuration/type/componentType.xsd ">

 <!--Generated configuration-->
  <properties>
    <property name="removeBoilerplate" value="false"/>
    <list name="classesToExclude">
      <entry value="noindex" />
      <entry value="hidden" />
      <entry value="ms-hidden" />
    </list>
  </properties>
</component>

Configurable Field Extraction From the HTML Document Object Model (DOM)

In addition to extracting text and metadata from the HTML documents, you can extract text from specific elements within the HTML document's Document Object Model (DOM) and mapped to fields in the resulting IngestDocument

You can set this is via the SelectFieldsFromHtmlDom transformer:

<project-dir>\conf\components\selectFieldsFromHtmlDom.xml
<!-- Customizable HTML extraction using selector syntax. -->
<component xmlns="http://www.attivio.com/configuration/type/componentType" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           name="selectFieldsFromHtmlDom" 
           class="com.attivio.platform.transformer.ingest.textextraction.SelectFieldsFromHtmlDom" 
           xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType 
                               http://www.attivio.com/configuration/type/componentType.xsd ">
  <properties>
       
    <property name="htmlDomFieldName" value="html_dom" />
 
    <!-- map where 
                 key is a selector in a CSS and jquery like language 
                    documentation here:
                       http://jsoup.org/cookbook/extracting-data/selector-syntax 
                 value is the name of the field in which to populate the selected text
     -->
    <map name="selectorToFieldNameMap">
      <property name="h1" value="h1" />
      <property name="h2" value="h2" />
      <property name="h3" value="h3" />
      <property name="h4" value="h4" />
      <property name="h5" value="h5" />
      <property name="h6" value="h6" />
    </map>
 
  </properties>
</component>

By default only h1-h6 elements are extracted.  You can add additional selector to field mappings by editing <project-dir>\conf\components\selectFieldsFromHtmlDom.xml

 

<project-dir>\conf\components\selectFieldsFromHtmlDom.xml
<!-- Customizable HTML extraction using selector syntax. -->
<component xmlns="http://www.attivio.com/configuration/type/componentType" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           name="selectFieldsFromHtmlDom" 
           class="com.attivio.platform.transformer.ingest.textextraction.SelectFieldsFromHtmlDom" 
           xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType 
                               http://www.attivio.com/configuration/type/componentType.xsd ">
  <properties>
       
    <property name="htmlDomFieldName" value="html_dom" />
 
    <!-- map where 
                 key is a selector in a CSS and jquery like language 
                    documentation here:
                       http://jsoup.org/cookbook/extracting-data/selector-syntax 
                 value is the name of the field in which to populate the selected text
     -->
    <map name="selectorToFieldNameMap">
      <property name="h1" value="h1" />
      <property name="h2" value="h2" />
      <property name="h3" value="h3" />
      <property name="h4" value="h4" />
      <property name="h5" value="h5" />
      <property name="h6" value="h6" />
      <!-- Extracts text in "a" elements with "href" attributes into a multi-valued field named "anchortext" -->
      <property name="a[href]" value="anchortext" />
      <!-- Extracts the "href" attributes in "a" elements into a multi-valued field named "links" -->
      <property name="a@href" value="links" />            
    </map>
 
  </properties>
</component>

The key in the selectorToFieldNameMap property is a selector specified by JSoup's selector syntax

Note that you must add any additional fields to your AIE schema if you want them indexed.

  • No labels