Page tree
Skip to end of metadata
Go to start of metadata

Overview

One way to get a good summary of a document is to extract its most distinctive words and phrases. This process is called Key-Phrase Extraction. It's been done by humans since the early days of academic conference proceedings, but this component of AIE does it automatically during document ingestion.  The following image shows key phrases that were extracted from country records in the Factbook example:

This page explains how to enable and set up key-phrase extraction, and how to generate facets from key phrases.

Required Modules

These features require that the keyphrases module be included when you run createproject to create the project directories. (The languagemodel module will be automatically added to the project along with the keyphrases module.) 

The keyphrases module is included in the demo module group.

Key-Phrase Extraction

This page refers to AIE's Key-Phrase Extraction feature. It depends on an English language model that is not user-extendable.  Additional models are available for Dutch (nl), French (fr), German (de), Italian (it), Spanish (es), Portuguese (pt), and many other languages.

Key-Phrase Highlighting in SAIL

If you are exploring key-phrase extraction, note that Search UI colors key-phrases orange in the search results.

View incoming links.

Important Concepts

Term

Discussion

Language Model

A language model is a record of how probable words and sequences of words are, measured over a large amount of text.  (Language models are large enough that they are packaged as separate downloads that we add to AIE as needed.)

Informativeness

How probable it is that you would encounter this word or phrase in ordinary text.

Phrasehood

How probable a phrase is compared to its constituent parts.

Keyword or Key Phrase

Words or phrases that are used unusually frequently in this document, compared to their overall frequency in that language.

In order to extract key phrases in AIE, a language model and a stop word list are required. AIE ships with an English language model and an English stopword list. Language models and stop word lists for some languages may be downloaded here: Dutch (nl), French (fr), German (de), Italian (it), Spanish (es), Portuguese (pt). For other languages, please contact sales@attivio.com.

Key-Phrase Extraction Transformer

The StatisticalKeyPhraseExtractor extracts key phrases from documents as described above. It stores them in a field for later use. The most common use is to facet on the key phrases that are associated with query response documents. This allows users to have an overview of the contents of the query responses, and to quickly see the phrases that are related to their original query.

To enable key-phrase extraction, include the keyphrases module as one of the modules supplied to createproject. This will add two files to the project: <project_dir>/conf/components/extractKeyPhrases.xml, and <project-dir>\conf\features\core\InsertComponent.extractKeyPhrases-ingestPostProcess-first.xml. These files define the extractKeyPhrases component and add it to the ingestPostProcess workflow.

<project-dir>\conf\features\core\InsertComponent.extractKeyPhrases-ingestPostProcess-first.xml
  <f:insertComponent component="extractKeyPhrases" enabled="true" featureNameSource="component,workflow,position,relativeComponent" position="first" skip-if-exists="false" workflow="ingestPostProcess"/>

Configuration Parameters

This is the default configuration of the extractKeyPhrases component:

<project_dir>/conf/components/extractKeyPhrases.xml
<?xml version="1.0" encoding="UTF-8"?>

<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="extractKeyPhrases" class="com.attivio.platform.transformer.ingest.linguistics.StatisticalKeyPhraseExtractor" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd ">
  <!--Generated configuration-->
  <properties>
    <property name="languageModelServiceName" value="languageModelService"/>
    <property name="stopWordDictionaryName" value="keyphraseStopwords"/>
    <map name="stopWordDictionaries">
      <property name="en" value="keyphrases/dictionaries/very_common_words_en.csv"/>
    </map>
    <property name="useAllNaturalLanguageFields" value="true"/>
    <property name="unigrams" value="false"/>
    <property name="bigrams" value="true"/>
    <property name="trigrams" value="false"/>
    <property name="allowDuplicates" value="false"/>
    <property name="output" value="keyphrases"/>
  </properties>
</component>

There are a number of configurable properties of the key-phrase extractor:

property name

values

description

default

languageModelServiceName

string

The name of the language model service.

languageModelService

useAllNaturalLanguageFields

boolean

Assert whether or not the key-phrase extractor should look at all the natural language fields, as defined by the schema.

true

fields

list of field names

The list of field names to be processed when useAllNaturalLanguageFields is false.

 

stopWordDictionaries

Language-to-filename property map

A key phrase cannot contain a word in the stopword dictionary for the appropriate locale.

 

defaultLocale

locale string

The default locale if a field does not have a locale.

en-us

output

field name

The field in which the extracted key phrases are to be stored. The default "keyphrases" field is defined in the file <install_dir>/conf/keyphrases/schema.xml.

keyphrases

unigrams

boolean

Asserts whether or not to output keywords (i.e. one-word-long key phrases).

false

bigrams

boolean

Asserts whether or not to output two-word key phrases.

true

trigrams

boolean

Asserts whether or not to output three-word key phrases.

false

allowDuplicates

boolean

Asserts whether to allow key phrases composed of duplicate words (e.g. "Tech Tech").

false

minKeyPhraseDocumentFrequency

int >= 1

Specifies the minimum number of times a candidate key phrase must be in the document.

2

maxKeyPhrasesPerDocument

int >= 1

No more than this many key phrases will be extracted per document.

1000

minSurprise

float >= 0.0

The minimum amount of "surprise" (negative log likelihood) required of candidate key phrases.

30.0

Notes on these parameters:

  • If maxKeyPhrasesPerDocument limits the number of key phrases extracted for a document, those with the highest surprise value are chosen. Unigrams are extracted first, then bigrams, then trigrams.
  • By default, only bigrams are extracted. Many interesting unigrams are actually part of a longer phrase, so extraction of unigram key phrases is disabled by default. Also, in a short document, an interesting trigram phrase may not occur more than minKeyPhraseDocumentFrequency times, and so it may not be included.

The keyphrases module depends on the languageModelService component, which is defined in the file <project_dir>\conf\components\languagemodel/languageModelService.xml. The default languageModelService contains language information for English only. For more information on using language models, see the documentation for Language Models.

Using Key Phrases

  1. Ensure that the keyphrases and languagemodel modules are listed in the <project-dir>\conf\configuration.xml file.
  2. For languages besides English:
    • For each language, add the stopword dictionary to the stopWordDictionaries map. 
    • For each language, download the appropriate languagemodel jar and put it in the <install_dir>/lib directory.
    • For each language, add a map section to the models map for the languageModelService component. Copying the "en" map and replacing "en" with the appropriate language code should suffice.
  3. Ingest (or re-ingest) content in a workflow that includes the extractKeyphrases component.

Viewing Key Phrases in Search UI

Search UI is configured by default to display Key Phrases in a Tag Cloud widget.