Entity extraction is a process for finding the names of persons, locations and corporations (among others) that are found in unstructured text. The identified entities are then added to the document as metadata, and are often used to generate facet lists. AIE offers several entity-extraction tools that are listed on this page.
Entity extraction tools use multiple modules depending on the tool and the language to be analyzed. See the documentation for the individual tools for more specific information.
Entity Scope Highlighting in SAIL
If you are exploring entity extraction features, note that SAIL color-codes some of the more common entities in the search results:
- People: Green
- Locations: Blue
- Companies: Red
- Dates: Purple.
These colors are assigned in <install-dir>\webapps\sail\resources\css\scopesearch.css.
View incoming links.
AIE's default behavior is to extract persons, locations, corporations, dates, hashtags and mentions from unstructured text, but there are many other kinds of "entities" that can be extracted.
- biomedical: drug names, chemical names, gene names, etc.
- financial: currencies, stocks, bonds, banks
- identifiers: phone numbers, ISBNs, zip codes, addresses, URLs, timestamps, etc.
- military: weapons, vehicles, ships, etc.
Although some entity-extraction features are dedicated to a specific type of entity (such as First+Last Name Extraction), others can be extended to include any entities that are important to your project.
Metadata Extraction Tools in AIE
This section presents a synoptic view of the metadata-extraction tools available in AIE and its add-on modules. Most, but not all, of these tools are examples of entity extraction. The others perform similar metadata-extraction functions but do not technically exploit "entities." The purpose of the table is to help the reader understand the options and trade-offs among these features, and to know where to find the specific tools.
|Title of Feature||Included in module?||Languages?||Description|
|Hashtag, Mention, and Date Extraction||entityextraction||Any language.||Three independent document transformers in the extractBaseEntities workflow automatically extract #hashtags, @mentions, and dates.|
|entityextraction||Any language. AIE supplies English dictionaries.|
Extracts persons, locations and companies. The dictionaries supplied with AIE are for English. Accepts custom dictionaries to support other entity types and other languages.
|Conditional Entity Extraction||AIE core||Any language|
This technique routes individual documents to specialized workflows, which implement different kinds of entity extraction. That way "Mustang" is a car in some contexts and a horse in others. It is not an "entity extraction" tool, per se, but is a means of coordinating such tools.
|First+Last Name Extraction||entityextraction||Any language. AIE supplies lists of names as they appear in English.|
Recognizes names of persons based on first-name and last-name dictionaries, plus a series of rules that recognize a well-formed name. Can be extended by the user.
|Pattern-Based Entity Extraction||entityextraction||Any language.||Uses regular-expression patterns to extract entities such as telephone numbers, license-plate numbers, or email addresses from free text.|
|Pattern-Based Categorization||entityextraction||Any language.|
Uses regular-expression patterns to place a document into a category. Not technically an "entity extraction" or "classification" tool, although it superficially resembles both.
A noun phrase consists of a noun plus its modifiers, as in "programming language." Depends on an English language model which is not user-extendable.
|Key-Phrase Extraction||keyphrase||Most languages|
Extracts words or phrases that are used unusually frequently in this document, compared to their overall frequency in the language. Uses a language model trained on a large corpus of text in each language (not user extendable).
|Entity Sentiment||entitysentiment||Most languages||Uses a custom language model (that you create) to assign sentiment scores to entities in text. The sentiment score is based on the tone (friendly, hostile) of the words that were near the entity in the text.|
|Statistical Entity Extraction||alm|
Arabic, Chinese, Dutch, English, Farsi, French, German, Hebrew, Italian, Japanese, Korean, Portuguese, Spanish, Russian, Urdu.
This is entity extraction provided by the Advanced Linguistics Module. Returns persons, organizations, and locations, among others. Based on a statistical model of text tagged by hand. Not extendable by users.
|Custom Entity Extraction in Java||AIE core||Any language||Using the Java Server API, it is relatively simple for a Java programmer to pull field values from an IngestDocument, manipulate them in Java, and write new values into another field. You can implement your own entity-extraction logic by this mechanism.|