Page tree
Skip to end of metadata
Go to start of metadata

Overview

The Text Statistics page shows the text extraction statistics for each MIME type, for each Advanced or Basic text extraction processing subflow in AIE.

Required Modules

These features require the inclusion of the advancedtextextraction module  when you run createproject to create the project directories.

Note that the default behavior is to record data from the fileIngest workflow.  If you are not using that workflow in ingestion, this table appears full of zeros.

View incoming links.

 

Viewing Text Extraction Statistics

To view text extraction statistics, select Diagnostics -> Text Extraction Statistics in the AIE Administrator. The Text Extraction Statistics page appears.

From here you can:

  • Click Reset Statistics to reset system diagnostics.
  • Sort/resort the table by the desired data by clicking the desired column heading.

Text Extraction Throughput by MIME Type and Workflow Type

This table shows statistics aggregated in the following manner. First, component level statistics collected for each MIME Type are aggregated for each workflow in which they belong. Then, workflow level statistics are aggregated based on whether they are considered Basic or Advanced workflows. See the Configuration section below for more details on how to designate individual workflows as either to be counted as Basic or Advanced.

Columns

Description

MIME Type

The MIME type from: <install-dir> \conf\advancedtextextraction\advancedtextextraction-doctypes.xml

Workflow Type

The designation of the workflow containing the component that contributed statistics to this row.

Number of Documents

The number of documents as reported by the designated primary component of the aggregated workflow.

Documents per Second

The number of documents as reported by the designated primary component of the aggregated workflow, divided by the total time spent in all of the components in the aggregated workflow.

Megabytes per Second

The number of megabytes as reported by the designated primary component of the aggregated workflow, divided by the total time spent in all of the components in the aggregated workflow.

Number of Errors

The total number of errors as reported by all of the components in the aggregated workflow.

Important Notes

  • The statistics show the number of documents processed by each of the constituent ATEM workflows. The number of documents processed by a workflow might not match the number of documents eventually indexed. For example, a document may be processed by several stages of a workflow, but then dropped before indexing.
  • Documents may count toward both Basic and Advanced, depending upon the components processing them. For example, in the configuration above, if Basic tries to extract text, fails, and then Advanced tries to run the advteConvert.advteConverter, the same document is counted in both aggregate workflows.
  • Unusually high throughput numbers can be misleading and don't always equate to high performance. When the value in the Number of Documents column is small (for example, less than 10), exercise caution when drawing conclusions or making configuration decisions based on the overall throughput values for documents of that MIME Type.

Configuration

Advanced Users Only

Text Extraction Statistics configuration changes are typically not required unless you change the composition of the ATEM workflow.

For details on configuring what is displayed, please see Advanced Configuration For Text Extraction Statistics.

  • No labels