One way to get a good summary of a document is to extract its most distinctive words and phrases. This process is called Key-Phrase Extraction. It's been done by humans since the early days of academic conference proceedings, but this component of AIE does it automatically during document ingestion. The following image shows key phrases that were extracted from country records in the Factbook example:
This page explains how to enable and set up key-phrase extraction, and how to generate facets from key phrases.
These features require that the keyphrases module be included when you run createproject to create the project directories. (The languagemodel module will be automatically added to the project along with the keyphrases module.)
The keyphrases module is included in the demo module group.
Key-Phrase Highlighting in SAIL
If you are exploring key-phrase extraction, note that Search UI colors key-phrases orange in the search results.
View incoming links.
A language model is a record of how probable words and sequences of words are, measured over a large amount of text. (Language models are large enough that they are packaged as separate downloads that we add to AIE as needed.)
How probable it is that you would encounter this word or phrase in ordinary text.
How probable a phrase is compared to its constituent parts.
Keyword or Key Phrase
Words or phrases that are used unusually frequently in this document, compared to their overall frequency in that language.
In order to extract key phrases in AIE, a language model and a stop word list are required. AIE ships with an English language model and an English stopword list. Language models and stop word lists for some languages may be downloaded here: Dutch (nl), French (fr), German (de), Italian (it), Spanish (es), Portuguese (pt). For other languages, please contact firstname.lastname@example.org.
Key-Phrase Extraction Transformer
The StatisticalKeyPhraseExtractor extracts key phrases from documents as described above. It stores them in a field for later use. The most common use is to facet on the key phrases that are associated with query response documents. This allows users to have an overview of the contents of the query responses, and to quickly see the phrases that are related to their original query.
To enable key-phrase extraction, include the keyphrases module as one of the modules supplied to createproject. This will add two files to the project: <project_dir>/conf/components/extractKeyPhrases.xml, and <project-dir>\conf\features\core\InsertComponent.extractKeyPhrases-ingestPostProcess-first.xml. These files define the extractKeyPhrases component and add it to the ingestPostProcess workflow.
This is the default configuration of the extractKeyPhrases component:
There are a number of configurable properties of the key-phrase extractor:
The name of the language model service.
Assert whether or not the key-phrase extractor should look at all the natural language fields, as defined by the schema.
list of field names
The list of field names to be processed when useAllNaturalLanguageFields is false.
Language-to-filename property map
A key phrase cannot contain a word in the stopword dictionary for the appropriate locale.
The default locale if a field does not have a locale.
The field in which the extracted key phrases are to be stored. The default "keyphrases" field is defined in the file <install_dir>/conf/keyphrases/schema.xml.
Asserts whether or not to output keywords (i.e. one-word-long key phrases).
Asserts whether or not to output two-word key phrases.
Asserts whether or not to output three-word key phrases.
Asserts whether to allow key phrases composed of duplicate words (e.g. "Tech Tech").
int >= 1
Specifies the minimum number of times a candidate key phrase must be in the document.
int >= 1
No more than this many key phrases will be extracted per document.
float >= 0.0
The minimum amount of "surprise" (negative log likelihood) required of candidate key phrases.
Notes on these parameters:
maxKeyPhrasesPerDocumentlimits the number of key phrases extracted for a document, those with the highest surprise value are chosen. Unigrams are extracted first, then bigrams, then trigrams.
- By default, only bigrams are extracted. Many interesting unigrams are actually part of a longer phrase, so extraction of unigram key phrases is disabled by default. Also, in a short document, an interesting trigram phrase may not occur more than minKeyPhraseDocumentFrequency times, and so it may not be included.
The keyphrases module depends on the languageModelService component, which is defined in the file <project_dir>\conf\components\languagemodel/languageModelService.xml. The default languageModelService contains language information for English only. For more information on using language models, see the documentation for Language Models.
Using Key Phrases
- Ensure that the keyphrases and languagemodel modules are listed in the <project-dir>\conf\configuration.xml file.
- For languages besides English:
- For each language, add the stopword dictionary to the stopWordDictionaries map.
- For each language, download the appropriate languagemodel jar and put it in the <install_dir>/lib directory.
- For each language, add a map section to the models map for the languageModelService component. Copying the "en" map and replacing "en" with the appropriate language code should suffice.
- Ingest (or re-ingest) content in a workflow that includes the extractKeyphrases component.
Viewing Key Phrases in Search UI
Search UI is configured by default to display Key Phrases in a Tag Cloud widget.