Overview
Advanced text extraction may be unable to convert some character encodings to Unicode, resulting in unreadable text. You can overcome this by converting character encodings to UTF-8 before the text extraction process is performed. This document details how to configure character encoding conversion in Advanced text extraction.
View incoming links.
Configuration
The advteConverter (AdvancedTextExtractionConvertDocument) component performs text extraction on a Content Pointer.
You can configure character encoding conversion with the following properties under the Converter section.
Properties
Convert Character Encodings to UTF-8
One or more character encoding names to convert to UTF-8. See this list for character encoding names supported on Java 8.
Conversion to UTF-8 occurs when the Content Pointer is detected as one of these character encodings.
Convert Content Pointer Max Bytes
The maximum number of bytes that a Content Pointer can be to be eligible for character encoding conversion. The default value is 4194304 bytes (4 MB).
This limits which Content Pointers will be converted to UTF-8 within advteConverter.
Performance consideration
The advteConverter component reads the entire Content Pointer into memory to perform conversion of character encoding. For this reason the default value for Convert Content Pointer Max Bytes is 4194304 (4 MB).
If your Content Pointers are larger you should increase the value to allow conversion. Similarly if your Content Pointers are smaller you can decrease the value to reduce IO and memory impact.
For example, if you are extracting text for email content less than 4194304 bytes then the default setting will allow character encoding conversion. Email content larger than 4194304 bytes will not be converted.
Detect Max Bytes Read
The maximum number of bytes to read from a Content Pointer to detect character encoding. The default is -1 will read the entire Content Pointer for character encoding detection.
This limits the number of Content Pointer bytes that need to be read to detect the character encoding.
Performance consideration
The advteConverter component defaults to reading the entire Content Pointer into memory to perform detection of character encoding. You can change how many bytes are read, from the beginning, to perform this detection. This can reduce IO and memory impact by reading a small amount of bytes in order to decide if conversion is needed. If conversion is required the entire Content Pointer must be read into memory for conversion of character encoding.
For example, if you are extracting text for email content then reading 4096 bytes (4 KB), for detection, may be enough.
Example
The above configures advteConverter to convert ISO-2022-KR to UTF-8 for Content Pointers that are less than 4194304 bytes (4 MB). The entire Content Pointer will be read into memory to detect the ISO-2022-KR character encoding.
User Interface (AIE Admin)
XML
<?xml version="1.0" encoding="UTF-8"?> <component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="advteConverter" class="com.attivio.advancedtextextraction.transformer.ingest.textextraction.AdvancedTextExtractionConvertDocument" xsi:schemaLocation=" http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd"> <properties> <property name="optionFilePath" value="advancedtextextraction/advancedtextextraction-search-export.cfg"/> <container-property name="docTypeConfig" reference="advancedDocTypeConfig"/> <container-property name="childDocumentPostProcessor" reference="defaultAdvTeChildPostProcessor"/> <property name="documentTimeout" value="${advancedtextextraction.documentTimeout}"/> <list name="characterEncodingsToConvertToUtf8"> <entry value="ISO-2022-KR"/> </list> </properties> </component>