Overview
Dictionary-based entity extraction can be configured to detect interesting terms in a document and set them aside for later querying or faceting. The following images show entities extracted from country records of the Factbook example:
The default English name dictionary consists of over 860,000 names which were extracted from Wikipedia (English). This dictionary can be used to extract proper names from English text with high precision (almost everything identified is a name), at the expense of low recall (not all names are identified). If higher recall is required, then First+Last Name Extraction or Statistical Entity Extraction may be used instead.
Required Modules
These features require that the entityextraction module be included when you run createproject to create the project directories. (This module is part of the demo group.)
Dictionary Entity Extraction
This page refers to AIE's Dictionary Entity Extraction feature. The dictionaries supplied by Attivio are most pertinent to names, locations, and corporations as seen in English, but the dictionaries can be extended for other languages by the user.
AIE offers additional forms of Entity Extraction in addition to this one.
The Advanced Linguistics Module offers its own entity-extraction features that can be applied to English and to several other languages.
Sentence Boundaries
Dictionary entity matching does not cross sentence boundaries. If your entity dictionary contains an Angela James
entry, the dictionary entity extractor will identify these words as an entity in the single-sentence text "I spoke to Angela James" but will not identify them in the two-sentence text "I spoke to Angela. James was also present."
Changes are not retroactive!
Changes to dictionary entity extraction, including any modification to the dictionaries, are not retroactive. To apply the changes to documents that have already been ingested, you will have to load the documents again.
Reliability of Location Data
The Location dictionary is derived from data at geonames.org, which is licensed under a Creative Commons Attribution 3.0 License. See http://creativecommons.org/licenses/by/3.0/. The Data is provided "as is" without warranty or any representation of accuracy, timeliness or completeness.
View incoming links.
.
Entity Dictionary Formats
The format of CSV file entity dictionaries is described here.
Entity dictionaries may have one or two columns.
- Single-column dictionaries (like the default people and location dictionaries) simply contain one long list of entities.
- The company dictionary has two columns. The first value on a line is used to match an entity in text. The second value is a "normalized" entity name. This lets us index a single entity even through its name appears in many different forms in the documents. To use this feature, the extractDictionaryEntities component (such as peopleFinder in the extractBaseEntities workflow) must be modified to use the customized dictionary.
Load Entity Dictionaries
While Attivio supplies dictionaries to extract persons, locations and companies (when not using the statistical entity extraction provided by the Advanced Linguistics Module), these dictionaries must be imported into the Dictionary Manager in order to be accessible by the pre-configured entity-exrtraction transformers.
Follow the directions in the Using a Managed Dictionary below to load and publish the provided dictionaries.
- <install_dir>\resources\entityextraction\dictionaries\entities_en_company.csv.gz
- <install_dir>\resources\entityextraction\dictionaries\entities_en_location.csv.gz
- <install_dir>\resources\entityextraction\dictionaries\entities_en_people.csv.gz
Customizing Dictionary Entity Extraction
To customize AIE's Dictionary Entity Extraction behavior you must perform three tasks:
- Copy and adapt the default entity dictionary to suit your project.
- Modify the appropriate document transformer to use the customized dictionary.
- Distribute the dictionary and the modified transformer across your AIE nodes.
These tasks can be accomplished by manipulating dictionary CSV files, or by importing the CSV files into the Dictionary Manager. The two strategies differ in detail but work equally well.
Copy/Adapt Entity Dictionaries
Using CSV files
Included with the Attivio Intelligence Engine (AIE) (in the entityextraction module) are entity dictionary files for persons, locations and companies. Note that these dictionaries are intended for use with English text. The dictionaries are:
- <install_dir>\resources\entityextraction\dictionaries\entities_en_company.csv.gz
- <install_dir>\resources\entityextraction\dictionaries\entities_en_location.csv.gz
- <install_dir>\resources\entityextraction\dictionaries\entities_en_people.csv.gz
If you intend to modify a dictionary, start by unzipping the .gz file. The customized dictionary does not need to be zipped in order to work.
Copy the dictionary file to the <project-dir>\resources\ directory of your project. You may rename the file if you wish.
The dictionaries are simple .CSV text files that can be edited in a text editor. It is best to enclose each entity in double-quotes.
It is a best practice to keep the entries in alphabetical order within the file. This makes it easier to detect duplicate entries.
You will need to configure a workflow component to use the dictionary (below).
Remember that any update to the dictionary files will require that you use the AIE Agent CLI to deploy the files to all AIE indexing nodes (discussed below on this page).
Using a Managed Dictionary
To customize one of AIE's standard entity dictionaries in the Dictionary Manager, follow these steps.
Unzip the .gz file and save the CSV file to your desktop or other convenient location.
Navigate to the Dictionary Manager in the the AIE Administrator. You might need to provide credentials when you open the Dictionary Manager. The default credentials are user aieadmin, password attivio.
Click the Import Dictionary button. Supply the location of the CSV file. Import the dictionary.
After import, make the desired term edits in the dictionary.
You will have to approve and publish the dictionary before it will be available for use.
You will need to configure a workflow component to use the dictionary (below).
Modify Entity-Extraction Transformers
Dictionary entity extraction is performed by the companyFinder, peopleFinder, and locationFinder components of the extractBaseEntities workflow.
This workflow performs AIE's core linguistics entity-extraction operations (but not statistical entity extraction, which usually requires the Advanced Linguistics Module). The companyFinder, locationFinder and peopleFinder stages are based on the ExtractDictionaryEntities transformer.
Using CSV files
To use a customized dictionary CSV file, place the dictionary file in your <project-dir>\resources\ directory and edit the Entity Dictionaries map on the component. This is the locationFinder component, showing the new path to the location dictionary.
AIE tries to be smart about interpreting the path to the dictionary. If you provide only the file name, AIE will automatically look for the file in the <project-dir>\resources\ directory. You can also use the form ${attivio.project}\resources\entities_en_location.csv, which makes the location more explicit for a human viewer. It is not a good practice to use a literal machine-specific path, because it may not be correct when the transformer is distributed to other servers.
Using a Managed Dictionary
AIE will look in the AIE Store for a managed dictionary that matches the name in the component's Dictionary Name field. If it finds a published dictionary with the same name, it will use that dictionary and ignore the settings in the Entity Dictionaries map:
Don't use the dictionary's URI
Note that the managed dictionary's URI string from the Dictionary Manager should not be entered in the Dictionary URI field on the component editor. A managed dictionary is linked to the component by its name, not by its URI.
In this example we emptied the Entity Dictionaries map to remove any confusion about which kind of dictionary is in use.
Just be sure that the Dictionary Name matches the name of the published dictionary.
Distribute Changes across Nodes
When you make changes in a CSV file in the <project-dir>\resources\ directory, you must use Agent CLI to deploy the project again before the changes become available for testing.
Changes made to a managed dictionary through the Dictionary Manager are automatically available to all nodes. The changes to the workflow component, however, must be deployed through the Agent CLI.
Adding a New Entity Type
What if you want to use Dictionary Entity Extraction to find a new type of entity that has special relevance to your application? The general approach is as follows:
- Add the new entity type as a field in the AIE Schema.
- Create a new entity dictionary.
- Include the dictionary and schema update in the deployed project through AIE Agent.
- Create a new dictionary entity-extraction component that references the new entity type.
- Add the component to the extractBaseEntities workflow (or to some other appropriate workflow location).
- Consider re-ingesting the documents you have already added to the index.
Modify the AIE Schema
If you want the new entities to be listed in search results or to be shown as a facet (like the people, location and company fields in the FactBook demo), you will need to add the entity type to the AIE schema as a new field. In this example the new EuropeanCapital entity type is a special kind of location, so the easy approach is to copy the location field definition and modify it. Here is the original location field definition, followed by the modified copy:
<field name="location" type="string" indexed="true" stored="true" displayName="Locations"/> <field name="EuropeanCapital" type="string" indexed="true" stored="true" displayName="European Capitals"/>
Edit both the name and the displayName attributes. The displayName is the label that will appear on the facet display.
Create a New Dictionary
To support this exercise, we created a CSV file of European capital cities. This file works well with the FactBook demo data, and represents a generic "new entity type" dictionary. We called the file en_europe_capitals.csv
Here are a few lines of content:
"Dublin", "London", "Amsterdam", "Berlin", "Warsaw", "Minsk", "Brussels", "Prague", ... many more ...
Save this file in the <project-dir>\resources\ directory.
Deploy the Dictionary and Schema
Open the Agent CLI for this project and deploy it. This copies the new dictionary and the updated schema file to the configuration server(s).
It is necessary to restart the AIE ingestion nodes. Restarting the nodes distributes the new dictionary and the schema file to each of the AIE nodes. (If you skip this step, you might have difficulty later when you try to modify the entityExtraction workflow.)
Create a New Entity-Extraction Component
Create a new ingest component based on ExtractDictionaryEntities .
- In the AIE Administrator, navigate to System Management > Palette > New > Platform Components > Document Transformers and select the ExtractDictionaryEntities component type.
- Give the new component a name. We used extractEuropeanCapitals.
- Set the Entity Type field of the new component to your new entity type, such as EuropeanCapital. This label will be used as a field name and as a scope token tag during the ingestion process. It must exactly match the new field that you added to the AIE Schema.
- Set the Entities Dictionary map of the new component to the location of the new dictionary. In this example that would be ${attivio.project}\resources\en_european_capitals.csv.
Modify entityExtraction Workflow
The next step is to add the new component to the entityExtraction workflow. (It is a Best Practice for uses to modify entityExtraction instead of extractBaseEntities.)
- In AIE Administrator, navigate to System Management > Workflows > Ingest and open the entityExtraction workflow for editing.
- Click the Add Existing Component button.
- Select the extractEuropeanCapitals component.
- Use the Move Up button to move the component to the top of the list.
- Save the workflow. (Note: If you have difficulty saving the workflow, you might need to restart the AIE nodes before performing this step.)
Consider Re-Ingesting the Indexed Documents
Changes to entity extraction features are not retroactive. They are applied to new documents from this point forward. If you want to extract the entities from documents that you have already indexed, you will have to reload them.
Run the Example
We followed the steps described above, re-ingested the FactBook country documents, and tweaked Search UI to display the EuropeanCapital field and facet. Then we queried for text:scope(EuropeanCapital) using the Advanced Query Language. This query matches all documents that mention a European Capital entity in the text field.
We found nineteen documents that mentioned European capitals. Here are two of them:
This was the new facet list of matching EuropeanCapital entities. Note the "European Capitals" display label on the facet:
Technical Notes
To use an entity dictionary, do the following:
- Make sure that AIE is installed and that the entityextraction module is installed.
- Create a project which uses the entityextraction module. Or follow the instructions here.
Dictionary entity extraction processes all natural language fields. The default AIE schema specifies that the "title" and "text" fields are enabled for natural language processing. To process other fields, add the following property to the field properties declaration in <project-dir>\conf\schema\default.xml:
<property name="naturalLanguage" value="true"/>
See Field Properties for more details.
- Alternatively, an instance of the extracting component (such as locationFinder) can be configured to process an explicit list of fields by setting the
input
property on the field. In this case, the input fields must be declared to be tokenized. - If a custom ingest workflow is being used, ensure that entity extraction occurs after tokenization (standardAnalyzer) in the workflow.
- Test to see if entities are extracted.
- Ingest (or re-ingest) the Factbook corpus. (Main article: Quick Start Tutorial.)
- Search for
*:*
in Search UI. - Open the Search Options, and check the Debug checkbox. This temporarily shows all result fields.
Look for the fields people, company, and location in the detail section of the returned documents.
Empty entity fields never display in the detail section of the returned documents. For example, people: appears in the detail section of a document only if the entity extractor finds at least one person in the document that exists in the Persons dictionary.