Overview
Attivio provides a rich set of optical character recognition (OCR) capabilities via the OCR add-on module. OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. Attivio also includes advanced functionality such as the intelligent character recognition (ICR) and optical mark reading capability (OMR). ICR extracts handwritten data from documents such as scanned forms. OMR captures human-marked data from document forms such as surveys and tests.
The OCR module extracts textual data from scanned documents and image files which can then be linguistically analyzed and indexed within Attivio such that the documents and files can be found by keyword search. The module provides additional options which allow for conversion of scanned documents and image files into other file formats such Microsoft Word Documents or PDFs.
Required Modules
These features require that the ocr module be included when you run createproject to create the project directories.
You will have to download the OCR Module. If you are unable to access that page, contact sales@attivio.com or your Attivio Account Representative for access.
OCR Module Installation
To install the OCR module please follow the add-on module installation procedure.
OCR Licensing
In order to use the OCR module, a valid OCR module license serial number is required. To request a trial or production serial number, please contact your sales/account representative (sales@attivio.com). Once you have a license serial number follow the directions below.
OCR Module License Serial Numbers can only be used on one physical machine and cannot be reassigned to new machines once they are activated.
Windows
License Activation for Windows
Linux
OCR Configuration
Once the OCR module is installed, its XML configuration is automatically included when you create a project. The OCR configuration files are:
<project-dir>\conf\properties\ocr\ocr.properties
- the OCR module properties file. The file invites you to set the value of theocr.developerSerialNumber
property, but this value should be left blank. Attivio sets this value internally. (Note that the developer serial number is not the same as your OCR user license number.)<project-dir>\conf\components\recognizeText.xml
- the component that uses the recognizeText document transformer. See the Example Configuration below on this page.<project-dir>\conf\workflow\ingest\ocrIngest.xml
- the ocrIngest workflow, which includes the recognizeText component. If needed you can edit this workflow in the Attivio Administrator (System Management > Workflows > Ingest > ocrIngest).Important note on the Document Batch Size
Please note that it is important to set the Document Batch Size correctly in the connector in order for the OCR processing to work properly, which may require some amount of initial testing. OCR is a resource-intensive operation and setting Batch Size correctly will ensure proper memory allocation and prevent OCR request timeouts. The right documentBatchSize value will depend on the average size of your input documents. See the following for more information.
RecognizeText Transformer
The RecognizeText transformer performs all OCR module operations. It is instantiated by the recognizeText document transformer in the ocrIngest workflow.
Image Formats
The OCR Module can "read" the following image fomats.
Format | Extension | Open | Save |
---|---|---|---|
BMP: | bmp | + | + |
DCX: | dcx | + | + |
PCX: | pcx | + | + |
PCX: | pcx | + | - |
PNG: | png | + | + |
JPEG 2000: | jp2, j2c | + | + |
JPEG: | jpg, jpeg, jfif | + | + |
PDF (Version 1.6 or earlier) | + | + | |
TIFF: | tif, tiff | + | + |
TIFF: | tif, tiff | + | - |
GIF: | gif | + | - |
Output Formats
The Attivio OCR module supports the following output formats for the OCR conversions:
Format | Description |
---|---|
RTF | Rich Text Format document containing textual output |
DOC | Microsoft Word 97 Document format |
DOCX | Microsoft Word 2007 XML Document format |
HTML | HTML format |
XLS | Microsoft Excel (great for scanned tables) |
Adobe PDF format (with image over text) | |
TEXT | Raw text output |
CSV | Columnar data in CSV format (great for scanned tables) |
XML | XML format (preserves the most information about the scanned image) |
Output Modes
Output mode | Description |
---|---|
CONTENT_POINTER | Output text to a content store and store the content pointer in the field |
STRING | Output as string to the field (requires output format TEXT to be set) |
FILE | Output text to a file and store the filename in the field |
International Language Support
By default the RecognizeText transformer assumes input documents are in English. If a document is not in English, the locale set on the input field or input document will be used to determine language during processing. The default language can be changed from English via the RecognizeText configuration by specifying the "defaultLanguage" property with a value accepted in the java.util.Locale.Locale constructor's language parameter.
When specifying locale on a field or document, use the ISO-639 2-letter codes. For example, for Danish, in Java, use 'new Locale("da")'.
Sample XML Configuration
The following snippet illustrates how the transformer can be configured by editing the
recognizeText.xml
configuration file. It should not be necessary to modify these configurations unless instructed to do so by Attivio:
<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="recognizeText" class="com.attivio.ocr.transformer.ingest.document.RecognizeText" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd "> <!--Generated configuration--> <performance maxInstances="1"/> <properties> <property name="developerSerialNumber" value="${ocr.developerSerialNumber}"/> <map name="environment"> <property name="TMP" value="C:\Users\username\AppData\Local\Temp\" /> <property name="SystemDrive" value="C:" /> <property name="PUBLIC" value="C:\Users\Public" /> </map> </properties> </component>
The environment map cannot be edited directly through the Attivio Administrator, and must be added to
recognizeText.xml
in a text editor. Note that the TMP
property must point to a directory where Attivio has permission to write files.
Feeding Content for OCR Processing
To feed contact for OCR processing, have a connector or API client send documents to the ocrIngest workflow.
Open the Attivio Administrator and navigate to System Management > Connectors. Click the New button and select a FileConnector. This opens a FileScanner editor:
You must set the Connector Name, Start Directory, and Ingest Workflow values before saving the connector. The Ingest Workflow must be set to ocrIngest. No further configuration is required.
Use the Attivio Administrator > Query > Debug Search page to check the results. Use the Legacy XML output mode and search for *:*. You should find OCR-generated text in the document's text field.
Forms Processing with OCR/ICR
The OCR module supports extracting text from handwritten or typed forms.
The OCR forms processing is only supported on Windows.
Template Development
Form processing templates are required for Forms Processing and must be created in ABBYY FormReader and/or FlexiCapture Studio applications.
ABBYY FlexiCapture Studio is a software application that allows you to create formalized descriptions (FlexiLayouts) of documents with variable layouts (so-called semi-structured forms). The created FlexiLayouts are then used by forms processing applications to capture data from the documents which they describe. FlexiCapture Studio is built on groundbreaking IPA technology which imitates the way living beings recognize objects.
Semi-structured forms are different from structured forms in that the exact location of the fields on such documents is not known in advance. For this reason, a data capture application needs additional information about the data fields in order to locate them on the forms and read the information they contain. ABBYY's FlexiCapture technology allows you to create formalized descriptions which tell the data capture program how and where to find particular fields. Used in conjunction with a data capture application, e.g. ABBYY FormReader, ABBYY FlexiCapture Studio will allow you to automate capturing data from such semi-structured forms as invoices, order forms, and many others.
Note that flexilayouts created using FlexiCapture Studio are imported into the ABBYY FormReader application. FormReader is also used for developing templates for the more static, non-flexible forms.
ABBYY FormReader is a powerful data capturing application for extracting information from printed forms and exporting it to databases and information systems. ABBYY FormReader is based on OCR, ICR, and OMR technologies.
The data collection process consists of two stages:
- Preparation stage (creating, distributing, and collecting forms);
- Form processing stage.
ABBYY FormReader takes care of the most taxing and labor-intensive part of the process, namely the extraction of data from filled-out forms, thereby freeing the user from hours of manual input.
Both of the template authoring applications are used for creating templates. Once the templates are created, they can be pointed at via the RecognizeText transformer's configuration and loaded by the OCR engine at runtime in order to perform form recognition and data extraction.
You can create a single form processing batch which may contain one or more templates to provide for a single packaged solution for recognition and data extraction of forms that are related.
Forms Processing Example
The following XML snippet provides an illustration of how forms processing can be achieved by using the RecognizeText transformer:
<component name="recognizeText" class="com.attivio.ocr.transformer.ingest.document.RecognizeText"> <!-- Note: maxInstances defaults to 1 based upon licensing requirements, but can be increased depending on your individual license terms. --> <performance maxInstances="1" /> <properties> <!-- The name of the field for storing extracted text --> <property name="output" value="text" /> <!-- See section on output modes --> <property name="outputMode" value="STRING" /> <!-- See section on output formats --> <property name="outputFormat" value="TEXT" /> <!-- Set these environment variables accordingly per system--> <map name="environment"> <property name="TMP" value="C:\Users\userMachineName\AppData\Local\Temp\tmp" /> <property name="SystemDrive" value="C:" /> <property name="PUBLIC" value="C:\Users\Public" /> </map> <!-- The location of the form batch to load --> <property name="formTemplate" value="formbatch/Acme/InsuranceClaimsBatch.frm" /> </properties> </component>
Forms Processing Output
By default, the form processing output is put into the
formXml
field in the AttivioDocument as a parsed org.dom4j.Document
object.
For a given processed form, its status value may be one of the following:
OK
,
NOTFORM
,
ERROR
. The following is an sample XML output for one particular form. Note that custom Attivio transformers can be developed to process this output, with each form field name becoming a name of an Attivio field.
<?xml version="1.0" encoding="utf-8" ?> <pages> <page no="0" status="OK"> <text name="Day"><para quality="0.50" suspicious="0">10</para></text> <text name="Month"><para quality="1.00">06</para></text> <text name="Year"><para quality="1.00">2002</para></text> <text name="LastName"><para quality="1.00">TROUT</para></text> <text name="FirstName"><para quality="1.00">HELEN</para></text> <text name="Patronymic"/> <checkMarkGroup name="Status"> <checkmark name="Married" checked="true"/> <checkmark name="Single" checked="false"/> <checkmark name="Divorced" checked="false"/> </checkMarkGroup> <text name="Age"><para quality="1.00">30</para></text> <text name="CityCode"><para quality="1.00">095</para></text> <text name="Phone"><para quality="0.86" suspicious="6">7980324</para></text> <text name="E-mail"><para quality="0.96" suspicious="19">TR_HELEN@boydline.Co.uk</para></text> <text name="Other"><para quality="0.86" suspicious="1">AUSTRIA</para></text> <checkMarkGroup name="HowOftenDoYouBuyGas"> <checkmark name="Once a week or more often" checked="true"/> <checkmark name="Less often than once a week" checked="false"/> </checkMarkGroup> </page> </pages>