Page tree
Skip to end of metadata
Go to start of metadata

Overview 

Attivio provides a rich set of optical character recognition (OCR) capabilities via the OCR add-on module. OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. Attivio also includes advanced functionality such as the intelligent character recognition (ICR) and optical mark reading capability (OMR). ICR extracts handwritten data from documents such as scanned forms. OMR captures human-marked data from document forms such as surveys and tests.

The OCR module extracts textual data from scanned documents and image files which can then be linguistically analyzed and indexed within Attivio such that the documents and files can be found by keyword search. The module provides additional options which allow for conversion of scanned documents and image files into other file formats such Microsoft Word Documents or PDFs.

Required Modules

These features require that the ocr module be included when you run createproject to create the project directories.

You will have to download the OCR Module. If you are unable to access that page, contact sales@attivio.com or your Attivio Account Representative for access.

OCR Module Installation

To install the OCR module please follow the add-on module installation procedure.

OCR Licensing

In order to use the OCR module, a valid OCR module license serial number is required. To request a trial or production serial number, please contact your sales/account representative (sales@attivio.com). Once you have a license serial number follow the directions below.

OCR Module License Serial Numbers can only be used on one physical machine and cannot be reassigned to new machines once they are activated.

Windows

License Activation for Windows

Linux

License Activation for Linux

OCR Configuration

Once the OCR module is installed, its XML configuration is automatically included when you create a project. The OCR configuration files are:

  • <project-dir>\conf\properties\ocr\ocr.properties - the OCR module properties file. The file invites you to set the value of the ocr.developerSerialNumber property, but this value should be left blank.  Attivio sets this value internally.  (Note that the developer serial number is not the same as your OCR user license number.) 
  • <project-dir>\conf\components\recognizeText.xml - the component that uses the recognizeText document transformer.  See the Example Configuration below on this page.
  • <project-dir>\conf\workflow\ingest\ocrIngest.xml - the ocrIngest workflow, which includes the recognizeText component. If needed you can edit this workflow in the Attivio Administrator (System Management > Workflows > Ingest > ocrIngest).

    Important note on the Document Batch Size

    Please note that it is important to set the Document Batch Size correctly in the connector in order for the OCR processing to work properly, which may require some amount of initial testing. OCR is a resource-intensive operation and setting Batch Size correctly will ensure proper memory allocation and prevent OCR request timeouts. The right documentBatchSize value will depend on the average size of your input documents. See the following for more information.

RecognizeText Transformer

The RecognizeText  transformer performs all OCR module operations. It is instantiated by the recognizeText document transformer in the ocrIngest workflow.


Image Formats

The OCR Module can "read" the following image fomats.

Format

Extension

Open

Save

BMP:
uncompressed black and white,
uncompressed gray,
uncompressed color

bmp

+

+

DCX:
2-bit - black and white
4- and 8-bit - gray
TrueColor 

dcx

+

+

PCX:
2-bit - black and white
4- and 8-bit - gray 

pcx

+

+

PCX:
TrueColor

pcx

+

-

PNG:
black and white, gray, color 

png

+

+

JPEG 2000:
gray, color 

jp2, j2c

+

+

JPEG:
gray, color

jpg, jpeg, jfif

+

+

PDF (Version 1.6 or earlier)

pdf

+

+

TIFF:
black and white - uncompressed, CCITT3, CCITT4, Packbits, ZIP
gray - uncompressed, Packbits, JPEG, ZIP
TrueColor - uncompressed, JPEG, ZIP
multi image TIFF

tif, tiff

+

+

TIFF:
black and white - CCITT3FAX, LZW
gray - LZW
palette - uncompressed, Packbits, ZIP
TrueColor - LZW

tif, tiff

+

-

GIF:
black and white - LZW-compressed
gray - LZW-compressed
TrueColor - LZW-compressed

gif

+

-

Output Formats

The Attivio OCR module supports the following output formats for the OCR conversions:

Format

Description

RTF

Rich Text Format document containing textual output

DOC

Microsoft Word 97 Document format

DOCX

Microsoft Word 2007 XML Document format

HTML

HTML format

XLS

Microsoft Excel (great for scanned tables)

PDF

Adobe PDF format (with image over text)

TEXT

Raw text output

CSV

Columnar data in CSV format (great for scanned tables)

XML

XML format (preserves the most information about the scanned image)

Output Modes

Output mode

Description

CONTENT_POINTER

Output text to a content store and store the content pointer in the field

STRING

Output as string to the field (requires output format TEXT to be set)

FILE

Output text to a file and store the filename in the field

International Language Support

By default the RecognizeText transformer assumes input documents are in English. If a document is not in English, the locale set on the input field or input document will be used to determine language during processing. The default language can be changed from English via the RecognizeText configuration by specifying the "defaultLanguage" property with a value accepted in the java.util.Locale.Locale constructor's language parameter.

When specifying locale on a field or document, use the ISO-639 2-letter codes. For example, for Danish, in Java, use 'new Locale("da")'.

Sample XML Configuration

The following snippet illustrates how the transformer can be configured by editing the recognizeText.xml configuration file. It should not be necessary to modify these configurations unless instructed to do so by Attivio:

<project-dir>\conf\components\recognizeText.xml
<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="recognizeText" class="com.attivio.ocr.transformer.ingest.document.RecognizeText" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd ">
  <!--Generated configuration-->
  <performance maxInstances="1"/>
  <properties>
    <property name="developerSerialNumber" value="${ocr.developerSerialNumber}"/>
    <map name="environment">
        <property name="TMP" value="C:\Users\username\AppData\Local\Temp\" />
        <property name="SystemDrive" value="C:" />
        <property name="PUBLIC" value="C:\Users\Public" />
    </map>
  </properties>
</component>

The environment map cannot be edited directly through the Attivio Administrator, and must be added to recognizeText.xml in a text editor.  Note that the TMP property must point to a directory where Attivio has permission to write files.

Feeding Content for OCR Processing

To feed contact for OCR processing, have a connector or API client send documents to the ocrIngest workflow.

Open the Attivio Administrator and navigate to System Management > Connectors.  Click the New button and select a FileConnector.  This opens a FileScanner editor:

You must set the Connector Name, Start Directory, and Ingest Workflow values before saving the connector.  The Ingest Workflow must be set to ocrIngest.  No further configuration is required.

Use the Attivio Administrator > Query > Debug Search page to check the results.  Use the Legacy XML output mode and search for *:*.  You should find OCR-generated text in the document's text field.

Forms Processing with OCR/ICR

The OCR module supports extracting text from handwritten or typed forms.

The OCR forms processing is only supported on Windows.

Template Development

Form processing templates are required for Forms Processing and must be created in ABBYY FormReader and/or FlexiCapture Studio applications.

ABBYY FlexiCapture Studio

ABBYY FlexiCapture Studio is a software application that allows you to create formalized descriptions (FlexiLayouts) of documents with variable layouts (so-called semi-structured forms). The created FlexiLayouts are then used by forms processing applications to capture data from the documents which they describe. FlexiCapture Studio is built on groundbreaking IPA technology which imitates the way living beings recognize objects.

Semi-structured forms are different from structured forms in that the exact location of the fields on such documents is not known in advance. For this reason, a data capture application needs additional information about the data fields in order to locate them on the forms and read the information they contain. ABBYY's FlexiCapture technology allows you to create formalized descriptions which tell the data capture program how and where to find particular fields. Used in conjunction with a data capture application, e.g. ABBYY FormReader, ABBYY FlexiCapture Studio will allow you to automate capturing data from such semi-structured forms as invoices, order forms, and many others.

Note that flexilayouts created using FlexiCapture Studio are imported into the ABBYY FormReader application. FormReader is also used for developing templates for the more static, non-flexible forms.

ABBYY FormReader

ABBYY FormReader is a powerful data capturing application for extracting information from printed forms and exporting it to databases and information systems. ABBYY FormReader is based on OCR, ICR, and OMR technologies.

The data collection process consists of two stages:

  • Preparation stage (creating, distributing, and collecting forms);
  • Form processing stage.

ABBYY FormReader takes care of the most taxing and labor-intensive part of the process, namely the extraction of data from filled-out forms, thereby freeing the user from hours of manual input.

Both of the template authoring applications are used for creating templates. Once the templates are created, they can be pointed at via the RecognizeText transformer's configuration and loaded by the OCR engine at runtime in order to perform form recognition and data extraction.

You can create a single form processing batch which may contain one or more templates to provide for a single packaged solution for recognition and data extraction of forms that are related.

Forms Processing Example

The following XML snippet provides an illustration of how forms processing can be achieved by using the RecognizeText transformer:

<project-dir>\conf\components\recognizeText.xml
<component name="recognizeText"
    class="com.attivio.ocr.transformer.ingest.document.RecognizeText">
    <!-- Note: maxInstances defaults to 1 based upon licensing requirements, but can be increased depending on your individual license terms. -->
    <performance maxInstances="1" />
    <properties>
        <!-- The name of the field for storing extracted text -->
        <property name="output" value="text" />

        <!-- See section on output modes -->
        <property name="outputMode" value="STRING" />

        <!-- See section on output formats -->
        <property name="outputFormat" value="TEXT" />

		<!-- Set these environment variables accordingly per system-->
        <map name="environment">
          <property name="TMP" value="C:\Users\userMachineName\AppData\Local\Temp\tmp" />
          <property name="SystemDrive" value="C:" />
          <property name="PUBLIC" value="C:\Users\Public" />
        </map>
 
        <!-- The location of the form batch to load -->
	    <property name="formTemplate" value="formbatch/Acme/InsuranceClaimsBatch.frm" />
    </properties>
</component>

Forms Processing Output

By default, the form processing output is put into the formXml field in the AttivioDocument as a parsed org.dom4j.Document object.

For a given processed form, its status value may be one of the following: OK , NOTFORM , ERROR . The following is an sample XML output for one particular form. Note that custom Attivio transformers can be developed to process this output, with each form field name becoming a name of an Attivio field.

<?xml version="1.0" encoding="utf-8" ?>
<pages>
  <page no="0" status="OK">
    <text name="Day"><para quality="0.50" suspicious="0">10</para></text>
    <text name="Month"><para quality="1.00">06</para></text>
    <text name="Year"><para quality="1.00">2002</para></text>
    <text name="LastName"><para quality="1.00">TROUT</para></text>
    <text name="FirstName"><para quality="1.00">HELEN</para></text>
    <text name="Patronymic"/>
    <checkMarkGroup name="Status">
      <checkmark name="Married" checked="true"/>
      <checkmark name="Single" checked="false"/>
      <checkmark name="Divorced" checked="false"/>
    </checkMarkGroup>
    <text name="Age"><para quality="1.00">30</para></text>
    <text name="CityCode"><para quality="1.00">095</para></text>
    <text name="Phone"><para quality="0.86" suspicious="6">7980324</para></text>
    <text name="E-mail"><para quality="0.96" suspicious="19">TR_HELEN@boydline.Co.uk</para></text>
    <text name="Other"><para quality="0.86" suspicious="1">AUSTRIA</para></text>
    <checkMarkGroup name="HowOftenDoYouBuyGas">
      <checkmark name="Once a week or more often" checked="true"/>
      <checkmark name="Less often than once a week" checked="false"/>
    </checkMarkGroup>
  </page>
</pages>
  • No labels