Attivio provides a rich set of optical character recognition (OCR) capabilities via the OCR add-on module. OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. Attivio also includes advanced functionality such as the intelligent character recognition (ICR) and optical mark reading capability (OMR). ICR extracts handwritten data from documents such as scanned forms. OMR captures human-marked data from document forms such as surveys and tests.
The OCR module extracts textual data from scanned documents and image files which can then be linguistically analyzed and indexed within Attivio such that the documents and files can be found by keyword search. The module provides additional options which allow for conversion of scanned documents and image files into other file formats such Microsoft Word Documents or PDFs.
OCR Module Installation
To install the OCR module please follow the add-on module installation procedure.
In order to use the OCR module, a valid OCR module license serial number is required. To request a trial or production serial number, please contact your sales/account representative (firstname.lastname@example.org). Once you have a license serial number follow the directions below.
OCR Module License Serial Numbers can only be used on one physical machine and cannot be reassigned to new machines once they are activated.
Once the OCR module is installed, its XML configuration is automatically included when you create a project. The OCR configuration files are:
<project-dir>\conf\properties\ocr\ocr.properties- the OCR module properties file. The file invites you to set the value of the
ocr.developerSerialNumberproperty, but this value should be left blank. Attivio sets this value internally. (Note that the developer serial number is not the same as your OCR user license number.)
<project-dir>\conf\components\recognizeText.xml- the component that uses the recognizeText document transformer. See the Example Configuration below on this page.
<project-dir>\conf\workflow\ingest\ocrIngest.xml- the ocrIngest workflow, which includes the recognizeText component. If needed you can edit this workflow in the Attivio Administrator (System Management > Workflows > Ingest > ocrIngest).
Important note on the Document Batch Size
Please note that it is important to set the Document Batch Size correctly in the connector in order for the OCR processing to work properly, which may require some amount of initial testing. OCR is a resource-intensive operation and setting Batch Size correctly will ensure proper memory allocation and prevent OCR request timeouts. The right documentBatchSize value will depend on the average size of your input documents. See the following for more information.
The RecognizeText transformer performs all OCR module operations. It is instantiated by the recognizeText document transformer in the ocrIngest workflow.
The OCR Module can "read" the following image fomats.
jpg, jpeg, jfif
PDF (Version 1.6 or earlier)
The Attivio OCR module supports the following output formats for the OCR conversions:
Rich Text Format document containing textual output
Microsoft Word 97 Document format
Microsoft Word 2007 XML Document format
Microsoft Excel (great for scanned tables)
Adobe PDF format (with image over text)
Raw text output
Columnar data in CSV format (great for scanned tables)
XML format (preserves the most information about the scanned image)
Output text to a content store and store the content pointer in the field
Output as string to the field (requires output format TEXT to be set)
Output text to a file and store the filename in the field
International Language Support
By default the RecognizeText transformer assumes input documents are in English. If a document is not in English, the locale set on the input field or input document will be used to determine language during processing. The default language can be changed from English via the RecognizeText configuration by specifying the "defaultLanguage" property with a value accepted in the java.util.Locale.Locale constructor's language parameter.
When specifying locale on a field or document, use the ISO-639 2-letter codes. For example, for Danish, in Java, use 'new Locale("da")'.
Sample XML Configuration
The following snippet illustrates how the transformer can be configured by editing the
configuration file. It should not be necessary to modify these configurations unless instructed to do so by Attivio:
The environment map cannot be edited directly through the Attivio Administrator, and must be added to
in a text editor. Note that the
TMP property must point to a directory where Attivio has permission to write files.
Feeding Content for OCR Processing
To feed contact for OCR processing, have a connector or API client send documents to the ocrIngest workflow.
Open the Attivio Administrator and navigate to System Management > Connectors. Click the New button and select a FileConnector. This opens a FileScanner editor:
You must set the Connector Name, Start Directory, and Ingest Workflow values before saving the connector. The Ingest Workflow must be set to ocrIngest. No further configuration is required.
Use the Attivio Administrator > Query > Debug Search page to check the results. Use the Legacy XML output mode and search for *:*. You should find OCR-generated text in the document's text field.
Forms Processing with OCR/ICR
The OCR module supports extracting text from handwritten or typed forms.
The OCR forms processing is only supported on Windows.
Form processing templates are required for Forms Processing and must be created in ABBYY FormReader and/or FlexiCapture Studio applications.
Note that flexilayouts created using FlexiCapture Studio are imported into the ABBYY FormReader application. FormReader is also used for developing templates for the more static, non-flexible forms.
Both of the template authoring applications are used for creating templates. Once the templates are created, they can be pointed at via the RecognizeText transformer's configuration and loaded by the OCR engine at runtime in order to perform form recognition and data extraction.
You can create a single form processing batch which may contain one or more templates to provide for a single packaged solution for recognition and data extraction of forms that are related.
Forms Processing Example
The following XML snippet provides an illustration of how forms processing can be achieved by using the RecognizeText transformer:
Forms Processing Output
By default, the form processing output is put into the
field in the AttivioDocument as a parsed
For a given processed form, its status value may be one of the following:
. The following is an sample XML output for one particular form. Note that custom Attivio transformers can be developed to process this output, with each form field name becoming a name of an Attivio field.