The Text Statistics page shows the text extraction statistics for each MIME type, for each Advanced or Basic text extraction processing subflow in AIE.
These features require the inclusion of the advancedtextextraction module when you run createproject to create the project directories.
Note that the default behavior is to record data from the fileIngest workflow. If you are not using that workflow in ingestion, this table appears full of zeros.
View incoming links. No links were found.
Viewing Text Extraction Statistics
To view text extraction statistics, select Diagnostics -> Text Extraction Statistics in the AIE Administrator. The Text Extraction Statistics page appears.
From here you can:
- Click Reset Statistics to reset system diagnostics.
- Sort/resort the table by the desired data by clicking the desired column heading.
Text Extraction Throughput by MIME Type and Workflow Type
This table shows statistics aggregated in the following manner. First, component level statistics collected for each MIME Type are aggregated for each workflow in which they belong. Then, workflow level statistics are aggregated based on whether they are considered Basic or Advanced workflows. See the Configuration section below for more details on how to designate individual workflows as either to be counted as Basic or Advanced.
The MIME type from: <install-dir> \conf\advancedtextextraction\advancedtextextraction-doctypes.xml
The designation of the workflow containing the component that contributed statistics to this row.
Number of Documents
The number of documents as reported by the designated primary component of the aggregated workflow.
Documents per Second
The number of documents as reported by the designated primary component of the aggregated workflow, divided by the total time spent in all of the components in the aggregated workflow.
Megabytes per Second
The number of megabytes as reported by the designated primary component of the aggregated workflow, divided by the total time spent in all of the components in the aggregated workflow.
Number of Errors
The total number of errors as reported by all of the components in the aggregated workflow.
- The statistics show the number of documents processed by each of the constituent ATEM workflows. The number of documents processed by a workflow might not match the number of documents eventually indexed. For example, a document may be processed by several stages of a workflow, but then dropped before indexing.
- Documents may count toward both Basic and Advanced, depending upon the components processing them. For example, in the configuration above, if Basic tries to extract text, fails, and then Advanced tries to run the
advteConvert.advteConverter, the same document is counted in both aggregate workflows.
- Unusually high throughput numbers can be misleading and don't always equate to high performance. When the value in the Number of Documents column is small (for example, less than 10), exercise caution when drawing conclusions or making configuration decisions based on the overall throughput values for documents of that MIME Type.
Advanced Users Only
Text Extraction Statistics configuration changes are typically not required unless you change the composition of the ATEM workflow.
For details on configuring what is displayed, please see Advanced Configuration For Text Extraction Statistics.