Page tree
Skip to end of metadata
Go to start of metadata

Overview

Entity extraction is a process for finding the names of persons, locations and corporations (among others) that are found in unstructured text. The identified entities are then added to the document as metadata, and are often used to generate facet lists.  AIE offers several entity-extraction tools that are listed on this page. 

Required Modules

Entity extraction tools use multiple modules depending on the tool and the language to be analyzed.  See the documentation for the individual tools for more specific information.

Entity Scope Highlighting in SAIL

If you are exploring entity extraction features, note that SAIL color-codes some of the more common entities in the search results:

  • People: Green
  • Locations: Blue
  • Companies: Red
  • Dates: Purple.

These colors are assigned in <install-dir>\webapps\sail\resources\css\scopesearch.css.

View incoming links.

Entity Classes

AIE's default behavior is to extract persons, locations, corporations, dates, hashtags and mentions from unstructured text, but there are many other kinds of "entities" that can be extracted. 

  • biomedical: drug names, chemical names, gene names, etc.
  • financial: currencies, stocks, bonds, banks
  • identifiers: phone numbers, ISBNs, zip codes, addresses, URLs, timestamps, etc.
  • military: weapons, vehicles, ships, etc.

Although some entity-extraction features are dedicated to a specific type of entity (such as First+Last Name Extraction), others can be extended to include any entities that are important to your project. 

Metadata Extraction Tools in AIE

This section presents a synoptic view of the metadata-extraction tools available in AIE and its add-on modules.  Most, but not all, of these tools are examples of entity extraction.  The others perform similar metadata-extraction functions but do not technically exploit "entities."  The purpose of the table is to help the reader understand the options and trade-offs among these features, and to know where to find the specific tools.

Title of FeatureIncluded in module?Languages?Description
Hashtag, Mention, and Date ExtractionentityextractionAny language.Three independent document transformers in the extractBaseEntities workflow automatically extract #hashtags, @mentions, and dates.

Dictionary Entity Extraction

entityextractionAny language.  AIE supplies English dictionaries.

Extracts persons, locations and companies. The dictionaries supplied with AIE are for English.  Accepts custom dictionaries to support other entity types and other languages.

Conditional Entity ExtractionAIE coreAny language

This technique routes individual documents to specialized workflows, which implement different kinds of entity extraction.  That way "Mustang" is a car in some contexts and a horse in others.  It is not an "entity extraction" tool, per se, but is a means of coordinating such tools.

First+Last Name ExtractionentityextractionAny language. AIE supplies lists of names as they appear in English.

Recognizes names of persons based on first-name and last-name dictionaries, plus a series of rules that recognize a well-formed name.  Can be extended by the user.

Pattern-Based Entity ExtractionentityextractionAny language.Uses regular-expression patterns to extract entities such as telephone numbers, license-plate numbers, or email addresses from free text. 
Pattern-Based CategorizationentityextractionAny language.

Uses regular-expression patterns to place a document into a category.  Not technically an "entity extraction" or "classification" tool, although it superficially resembles both.

Noun-Phrase ExtractionentityextractionEnglish

A noun phrase consists of a noun plus its modifiers, as in "programming language."  Depends on an English language model which is not user-extendable.

Key-Phrase ExtractionkeyphraseMost languages

Extracts words or phrases that are used unusually frequently in this document, compared to their overall frequency in the language.  Uses a language model trained on a large corpus of text in each language (not user extendable).

Entity SentimententitysentimentMost languagesUses a custom language model (that you create) to assign sentiment scores to entities in text.  The sentiment score is based on the tone (friendly, hostile) of the words that were near the entity in the text. 
Statistical Entity Extractionalm

Arabic, Chinese, Dutch, English, Farsi, French, German, Hebrew, Italian, Japanese, Korean, Portuguese, Spanish, Russian, Urdu.

This is entity extraction provided by the Advanced Linguistics ModuleReturns persons, organizations, and locations, among others. Based on a statistical model of text tagged by hand.  Not extendable by users. 

See the Features by Language Table and individual language pages under Configuring Languages in Attivio for more details.

Custom Entity Extraction in JavaAIE coreAny languageUsing the Java Server API, it is relatively simple for a Java programmer to pull field values from an IngestDocument, manipulate them in Java, and write new values into another field. You can implement your own entity-extraction logic by this mechanism.
  • No labels