AIE provides the ability to perform phonetic (sounds-like) matching in keyword searches. This feature assists the matching of proper names in documents. Names sometimes have irregular spelling, particularly when transliterated from another language or when recorded in immigration documents. For instance, the name of Libyan leader Muammar Gaddafi is spelled in more than thirty different ways by European and American government offices and news outlets.
Phonetic matching does not require any special modules. It is part of the core AIE feature set.
- Phonetic Matching is currently only available for English.
View incoming links. No links were found.
What is Phonetic Matching?
The EncodeSoundsLike ingest transformer analyzes each text token and transforms it into a phonetic code equivalent. This phonetic equivalent is then stacked with the original token in a tokenList, and is eventually added to the index.
The EncodeSoundsLike query transformer analyzes each query token and transforms it into a phonetic equivalent. The query is then modified to perform an OR query on both versions of the token.
For instance, the name Gaddafi is tokenized as KTF by EncodeSoundsLike. Variant spellings, such as Qadhafi or Khadafi also transform to KTF when they appear. Queries mentioning any similar name are also transformed into KTF. This produces more useful matches than a strict matching or spelling-correction can.
The EncodeSoundsLike transformers use an encoder, which is officially described as a subclass of StringEncoder.
At first glance there appears to many encoders, but most of the listed encoders are either obsolete or are narrowly specialized. The Caverphone family of encoders, for instance, was created for matching names found in late 19th-century Dunedin, New Zealand. The ColognePhonetic encoder is for German names. The BeiderMorseEncoder matches names of Ashkenazic Jews.
The best general-purpose encoder for English text is the DoubleMetaphone encoder, which is the default for the AIE transformers. It is optimized for English text that contains names of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origins.
Implementing Phonetic Matching
The next few sections demonstrate how to set up phonetic matching in AIE. You can build the Factbook example from the Quick Start Tutorial and then perform these steps to add phonetic matching to it.
Changes in Schema.xml
You can perform phonetic matching on the title and text fields without making any changes to your<project-dir>\conf\schema.xml file. However, to apply phonetic matching to new text fields, you must add these fields to the AIE Schema or they won't get indexed.
Define the new fields to be very similar to the default text field, especially with respect to the highlighting properties. Since these are inexact matches, you want the highlighting feature to show you exactly what was matched.
Also, for the new fields to work with fieldless keywords, ("Michael" rather than "text:Michael") modify the content field in schema.xml to include the names of the new fields.
Create a new document transformer and a new query transformer, and add them to the appropriate workflows.
Create Document Transformer
To create a document transformer, create a new component from the EncodeSoundsLike document transformer. This component adds phonetic codes to IngestDocuments at ingestion.
1. Access the Palette page from the the System Management node in the Use the Attivio Administrator.
2. Click the New link and select the EncodeSoundsLike document transformer. The component editor opens.
3. Provde the component the name soundsLike.
4. Fill out the Field Mappings section of the form.
The component receives input text from the IngestDocument's title field, and writes the transformed output back into the same title field. Similarly, it looks for input in the text field, and writes the transformed text back into the text field.
5. Save the component.
Add to AttivioLinguistics Workflow
After saving the soundsLike component, you can add it to the end of the attivioLinguistics workflow to minimize the risk of it interacting with other linguistic transformations.
1. Access the Workflows page for the System Management node in the Use the Attivio Administrator.
2. Open the Document section, which lists the ingest transformers.
3. Select the attivioLinguists workflow. This opens an editor.
4. Click the Add Existing Component button to add the soundsLike component to the workflow.
5. Save the workflow.
Both document and query transformers are required. This section defines the query transformer.
Create Query Transformer
A query transformer component based on the EncodeSoundsLike query transformer is required to add phonetic codes to queries.
3. This opens a component editor.
4. Provide the component the name "querySoundsLike."
5. Fill out the Field Mappings section of the form.
The component takes input from title and writes it back to title. It takes input from text and writes it back to text. Finally, it takes input from the content field and writes it back to the content field.
6. Save the component.
Why does the query transformer operate on the content field but the document transformer does not? Recall that the content field is a concatenation of the document's title, text, and certain other fields. This concatenation occurs as the document is processing into the index, so the content field actually does not exist in the IngestDocument. The concatenation pulls all the phonetic tokens from title and text into the new content field before indexing.
On the query side, no concatenation occurs. To test phonetic tokens against the content field, you must tell the transformer explicitly to do so.
Add to Workflow
In this step you add the querySoundsLike component to the end of the queryAttivioLinguistics workflow to minimize the risk of interaction with other linguistic transformations.
2. Open the Query section, which lists the query transformers.
3. Select the queryAttivioLinguists workflow. This opens an editor.
4. Click the Add Existing Component button to add the querySoundsLike component to the workflow.
5. Save the workflow.
There are two ways to verify that phonetic matching is in operation.
Using the FactBook Demo
If you add the components described above to the Factbook Demo, you can then search for "Michael" and find matches in the following country records:
- "Mikheil" in the article about Georgia.
- "Mughal" (Mongol) in the article about Pakistan.
- "Mikhail" in the article about Russia.
Or you can search for variants of "Kadafi" and find "Muammar Abu Minyar al-QADHAFI" in the article about Libya.
By Inspecting Tokens
The Field Guide to Tokens demonstrates how to capture tokenized text from IngestDocuments before indexing, and how to use the Debug Search page to capture tokenized queries. These techniques let you examine the phonetic tokens on both sides of the process, to help diagnose any puzzling mismatching that occurs. The Field Guide is illustrated in part with phonetic tokens.