Page tree
Skip to end of metadata
Go to start of metadata

Overview

AIE processes English text by default using AIE Core Linguistics. This capability is transparently part of every AIE project. 

Note that the ISO-639-1 2-letter code for the English language is "en".

ALM analysis of English

The Advanced Linguistics Module (ALM) can also process English, but this capability is normally turned off. You would consider turning it on if you needed the ALM's advanced statistical entity extraction on English text. See English (using ALM).

 

View incoming links.

Prerequisites

There are no prerequisites needed to process English language text. AIE processes English language text by default.

English Language Features in AIE

AIE supports the following linguistic analysis features for the English language.

  • Segmentation: The English language does not require segmentation.
  • Lemmatization: The ALM supports lemmatization for the English language, which occurs during tokenization. AIE may be configured to use stemming instead (one or the other).
  • Decompounding: Decompounding is not performed on English text.
  • Classification Model: Classification for the English language is supported, but additional training data or dictionaries may be required. Requires license for the Classifier module.
  • Sentiment Analysis Model: Sentiment Analysis for the English language is supported, but additional training data or dictionaries may be required. Requires license for the Sentiment module.
  • Entity-Sentiment Analysis: Entity-sentiment analysis is available for the English language, but a professional-services engagement is required. Requires licenses for the Classifier, Sentiment, and Entity-Sentiment modules.
  • Key Phrase Extraction: Key-phrase extraction is available in the English language.
  • More-like-this: More-like-this querying is available in the English language.
  • OCR Module: Optical character recognition is supported for the English language. Requires license for the OCR module.
  • Dictionary-Based Entity Extraction: Dictionary-based entity extraction in the English language is supported, but additional training data or dictionaries may be required.
  • First+Last Name Extraction: First+Last Name Extraction is available for the English language. AIE provides extensive dictionary files of first and last names, which can be extended by the user. .
  • Statistical Entity Extraction:   Statistical entity extraction is supported for the English language. Requires that the Advanced Linguistics Module license be upgraded to activate the Advanced Language Module for this language.
  • Spelling Correction: Spelling correction is supported for the English language.

Language Model

Language models are used for extracting key phrases (main article: Key Phrase Extraction) and extracting queries for related documents (main article: "More-Like-This" Support).

The languagemodel module included in the default AIE installation includes an English language model.

Stopwords

Stopwords are typically common words (such as "the" or "a" in English) or words that are not desirable in a certain query context, such as the phrase: "I need information about." Several AIE components use a stopword list. 

Main article: Stopword Removal

The following stopword lists are available for download below, and they are also installed by default in <install_dir>/ conf/dictionaries/ :

  • A small stopword list, containing approximately 50 words: 50_stopwords_en.txt.
  • A stopword list containing approximately 100 words: 100_stopwords_en.txt.
  • A large stopword list, containing approximately 500 words: 500_stopwords_en.txt. Reviewing this list before use is recommended as it contains common words such as committee, trade, and europe.

 

Lemmatization

Lemmatization is the process of augmenting documents and/or queries using a lemmatization dictionary, which maps words to lemmas, or root forms. AIE supports lemmatization for English by default, using lemmatization by reduction during ingestion and during query processing.

AIE core linguistics lemmatization of English is performed by the EnglishTokenizer. If you want to turn off lemmatization, edit the <project-dir>/conf/features/core/TokenizerModel.english.xml file, adding an f:property element with name "lemmas" and value "off":

<project-dir>/conf/features/core/TokenizerModel.english.xml
<?xml version="1.0" encoding="UTF-8"?>
<ff:features xmlns:ff="http://www.attivio.com/configuration/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fbase="http://www.attivio.com/configuration/features/base" xmlns:f="http://www.attivio.com/configuration/features/core" xsi:schemaLocation="http://www.attivio.com/configuration/config http://www.attivio.com/configuration/config.xsd http://www.attivio.com/configuration/features/base http://www.attivio.com/configuration/features/baseFeatures.xsd http://www.attivio.com/configuration/features/core http://www.attivio.com/configuration/features/coreFeatures.xsd">
 <f:tokenizer class="com.attivio.platform.tokenizer.EnglishTokenizer" enabled="true" fallbackLocale="en" name="english">
 	<f:property name="lemmas" value="off"/>
 </f:tokenizer>
</ff:features>

Lemmatization modifies incoming text, so you'll have to re-index any documents that have been ingested up to this point. A mix of lemmatized and unlemmatized documents in an index produces incomplete search results.

AIE can be configured to perforrm algorithmic stemming instead of lemmatization (see below), but this practice is not encouraged.

Stemming

Stemming is not recommended. It is better to use the default lemmatization feature instead.

To enable stemming for core AIE linguistics (of English), you can modify the EnglishTokenizer to perform stemming instead of lemmatization. Edit the <project-dir>/conf/features/core/TokenizerModel.english.xml file, adding an f:property element with name "lemmas" and value "stem":

<project-dir>/conf/features/core/TokenizerModel.english.xml
<?xml version="1.0" encoding="UTF-8"?>
<ff:features xmlns:ff="http://www.attivio.com/configuration/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fbase="http://www.attivio.com/configuration/features/base" xmlns:f="http://www.attivio.com/configuration/features/core" xsi:schemaLocation="http://www.attivio.com/configuration/config http://www.attivio.com/configuration/config.xsd http://www.attivio.com/configuration/features/base http://www.attivio.com/configuration/features/baseFeatures.xsd http://www.attivio.com/configuration/features/core http://www.attivio.com/configuration/features/coreFeatures.xsd">
 <f:tokenizer class="com.attivio.platform.tokenizer.EnglishTokenizer" enabled="true" fallbackLocale="en" name="english">
 	<f:property name="lemmas" value="stem"/>
 </f:tokenizer>
</ff:features>

Stemming modifies incoming text, so you'll have to re-index any documents that have been ingested up to this point. A mix of stemmed and unstemmed documents in an index produces incomplete search results.

Synonym Expansion

Synonym expansion is the process of augmenting queries using a synonym dictionary to map words to synonym sets.

The procedures below support two paths of synonym usage:

  • One is a short path to testing/demonstrating synonym behavior in the Debug Search page. Use this for initial testing of a new synonym dictionary.
  • The second path enables synonym behavior from any independent search client (such as SAIL). It involves creating a new query transformer in Java.

Direct use of CSV files is deprecated

Synonym expansion by loading CSV files directly into the querySynonymizer component is a deprecated practice in 4.3 and will not be available in the next major release. Use the procedure shown below to load the same CSV files into managed dictionaries instead. 

 

Synonym Dictionaries

AIE ships with an example synonym dictionary file for demonstration purposes only. It is configured in the querySynonymizer transformer, but is non-functional in its default state. It has one synonym:

<install-dir>\conf\dictionaries\synonyms_en.csv
cars,automobiles

There is a very large synonym file (synonyms_wordnet.zip) attached to this page that contains English language synonyms derived from WordNet. It is complete, but some of its synsets contain rare or unusual words that can degrade query results. Some of the synsets are extremely large (over 70 synsets have more than 1000 items) which can make some queries very slow. For instance, the word "bag" has many synonyms that includes every kind of bag or pocket. If a query mentions "bag," AIE launches an ORed query that looks for every synonym separately and then combines the results.

A second synonym dictionary is also available (cleanedSynonyms_1m_10_en.csv). This dictionary is truncated to remove very rare words and to reduce synsets to 10 items (only the 10 most common words or phrases are kept). While less complete than the file mentioned above, this synonym dictionary minimizes the slow query issues.

Each entry of the synonym file uses the following format:

maple,california box elder|box elder|silver maple|mountain maple|full moon maple|hedge maple|great maple|red maple|sugar maple|field maple

The entry above lists synonyms for "maple" delimited by a pipe character (|). Note that this dictionary is not bidirectional; there is no entry for "sugar maple" -> "maple". To achieve bidirectionality, use the Dictionary Manager's Expansion Mode setting on the dictionary or on any of its individual terms.

Once you have downloaded and unpacked the synonym file, you must import it into the AIE Administrator's Dictionary Manager. Use this procedure:

  1. Open the AIE Administrator to the Dictionary Management page. (The default user/password is aieadmin/attivio.)
  2. Create a new synonym dictionary. Give it a unique name.
  3. Import the synonym file into the dictionary.
    1. Download the file to a location on the computer where you run the AIE Administrator.
    2. Import the terms into the new dictionary on the Dictionary Management page.
  4. Approve and Publish the dictionary.

The new dictionary will be automatically saved to the Store and will be available to all AIE nodes for querying as needed.

Configure querySynonymizer

The next step is to edit the querySynonymizer component to refer to the new dictionary. The component definition is in <project-dir>\conf\components\querySynonymizer.xml. You can edit querySynonymizer from the AIE Administrator (and then update and re-deploy the project). Enter the name of the dictionary in the Default Dictionary Name field (defaultDictionaryName property), and the dictionary locale in the Default Dictionary Locale field (the defaultDictionaryLocale property).

<project-dir>\conf\components\querySynonymizer.xml
<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="querySynonymizer" class="com.attivio.platform.transformer.query.ExpandSynonyms" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd ">
 <properties>
  <map name="dictionaries">
   <property name="en" value="dictionaries/synonyms_en.csv"/>
  </map>
  <property name="defaultDictionaryLocale" value="en"/>
  <property name="defaultDictionaryName" value="Synonyms"/>
  <list name="fields">
   <entry value="title"/>
   <entry value="content"/>
  </list>
 </properties>
</component>

The default fields for synonym substitution are "title" and "content". If you want to extend synonyms to other fields, you must add the field names here.

You may ignore the "dictionaries" map. AIE ignores this property once the defaultDictionaryName field is populated.

Testing Synonym Expansion

The easy way to see if your synonyms are working correctly is to launch a query from the Debug Search page. This page has a checkbox that enables the synonym feature. Check this box and use the "Legacy XML" view of the results. If synonyms are operating, you will see a BooleanOrQuery element containing synonyms listed as phrase elements.

Note that the Debug Search page sends a REST URL to the AIE server. This URL encodes all of the features that are set by the Debug Search page. The synonyms feature is enabled by the following parameters:

Partial REST URL from Debug Search page
&l.synonyms.mode=ON&l.synonymBoost=25

If your search client application uses the REST API to communicate with AIE, you can add these parameters to the URL to enable synonyms.

Synonyms from the Java API

You can enable synonym expansion from the Java Server API or the Java Client API by Creating a Custom Query Transformer or by creating a Java-based Search Application. In either case, the key is to set QueryRequest.setSynonymsMode(SynonymsMode.ON). This enables synonym expansion on all queries, as shown in the following example:

SampleQueryTransformer.xml
public List<QueryFeedback> processQuery(QueryRequest query) throws AttivioException {
    ...
    // next three lines enable synonym expansion
    query.setSynonymsMode(SynonymsMode.ON);
    query.setSynonymDictionaryName(dictionary);
    query.setSynonymBoost(boost);
    ... 
  }

This example is from the Creating Custom Query Transformers page. One inserts the new query transformer into the queryAttivioLinguistics workflow ahead of the querySynonymizer component. If you take this route, there is no need to modify the querySynonymizer at all, unless you need to expand the list of fields where synonyms should be applied.

This method turns on synonym expansion for all queries from any source.

Acronym Expansion

Acronym expansion is the process of augmenting queries using an acronym dictionary to map common acronyms to their expanded form.

The procedures below support two paths of acronym usage:

  • One is a short path to testing/demonstrating acronym behavior in the Debug Search page. Use this for initial testing of a new acronym dictionary.
  • The second path enables acronym behavior from any independent search client (such as SAIL). It involves creating a new query transformer in Java.

Direct use of CSV files is deprecated

Acronym expansion by loading CSV files directly into the queryAcronymExpander component is a deprecated practice in 4.3 and will not be available in the next major release. Use the procedure shown below to load the same CSV files into managed dictionaries instead. 

Acronym Dictionaries

AIE ships with an example acronym dictionary file for demonstration purposes only. It is configured in the queryAcronymExpander transformer, but is non-functional in its default state. It has one acronym:

<install-dir>\conf\dictionaries\acronyms_en.csv
irs,internal revenue service

There is a file of 126,000 acronyms attached to this page (acronyms_wikipedia_en.zip). It contains English language acronyms derived from Wikipedia. Each acronym expands to a single phrase.

The acronym file uses the following format:

GUPS,"General Union of Palestinian Students"
GURC,"Georgetown University Rugby Football Club"
GURFC,"Georgetown University Rugby Football Club","Glasgow University Rugby Football Club"
GUSD,"Goleta Union School District"
GUUG,"German Unix User Group"

Note that most entries expand to a single pharse, but it is possible for an acronym to expand to multiple phrases, as shown by GURFC.

Once you have downloaded and unpacked the acronym file, you must import it into the AIE Administrator's Dictionary Manager. Use this procedure:

  1. Open the AIE Administrator to the Dictionary Management page. (The default user/password is aieadmin/attivio.)
  2. Create a new acronym dictionary. Give it a unique name.
  3. Import the acronym file into the dictionary.
    1. Download the file to a location on the computer where you run the AIE Administrator.
    2. Import the terms into the new dictionary on the Dictionary Management page.
  4. Approve and Publish the dictionary.

The new dictionary will be automatically saved to the Store and will be available to all AIE nodes for querying as needed.

Configure queryAcronymExpander

The next step is to edit the queryAcronymExpander component to refer to the new dictionary. The component definition is in <project-dir>\conf\components\queryAcronymExpander.xml. You can edit queryAcronymExpander from the AIE Administrator (and then update and re-deploy the project). Enter the name of the dictionary in the Default Dictionary Name field (defaultDictionaryName property), and the dictionary locale in the Default Dictionary Locale field (the defaultDictionaryLocale property).

<project-dir>\conf\components\queryAcronymExpander.xml
<?xml version="1.0" encoding="UTF-8"?>

<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="queryAcronymExpander" class="com.attivio.platform.transformer.query.ExpandAcronyms" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd ">
 <properties>
  <map name="dictionaries">
   <property name="en" value="dictionaries/acronyms_en.csv"/>
  </map>
  <list name="fields">
   <entry value="title"/>
   <entry value="content"/>
  </list>
  <property name="defaultDictionaryLocale" value="en"/>
  <property name="defaultDictionaryName" value="Acronyms"/>
 </properties>
</component>

The default fields for acronym substitution are "title" and "content". If you want to extend acronyms to other fields, you must add the field names here.

You may ignore the "dictionaries" map. AIE ignores this property once the defaultDictionaryName field is populated.

 

Testing Acronym Expansion

The easy way to see if your acronyms are working correctly is to launch a query from the Debug Search page. This page has a checkbox that enables the acronym feature. Check this box and use the "Legacy XML" view of the results. If acronyms are operating, you will see a BooleanOrQuery element containing acronyms listed as phrase elements.

Note that the Debug Search page sends a REST URL to the AIE server. This URL encodes all of the features that are set by the Debug Search page. The acronyms feature is enabled by the following parameters:

Partial REST URL from Debug Search page
&l.acronyms.mode=ON&l.acronymBoost=25

If your search client application uses the REST API to communicate with AIE, you can add these parameters to the URL to enable acronyms.

Acronyms from the Java API

You can enable acronym expansion from the Java Server API or the Java Client API by Creating a Custom Query Transformer or by creating a Java-based Search Application. In either case, the key is to set QueryRequest.setAcronymsMode(acronymsMode.ON). This enables acronym expansion on all queries, as shown in the following example:

SampleQueryTransformer.xml
public List<QueryFeedback> processQuery(QueryRequest query) throws AttivioException {
    ...
    // next three lines enable acronym expansion
    query.setAcronymsMode(acronymsMode.ON);
    query.setAcronymDictionaryName(dictionary);
    query.setAcronymBoost(boost);
    ...
  }

This example is from the Creating Custom Query Transformers page. One inserts the new query transformer into the queryAttivioLinguistics workflow ahead of the queryAcronymExpander component. If you take this route, there is no need to modify the queryAcronymExpander at all, unless you need to expand the list of fields where acronyms should be applied.

This method turns on acronym expansion for all queries from any source.

 

Entity Resources

For Dictionary Entity Extraction, AIE provides three entity dictionaries to use for extracting entities from English text:

  • <install-dir>\conf\entityextraction\dictionaries\entities_en_company.csv. Appx. 70,000 corporations.
  • <install-dir>\conf\entityextraction\dictionaries\entities_en_location.csv. Appx. 127,000 locations.
  • <install-dir>\conf\entityextraction\dictionaries\entities_en_people.csv. Appx. 853,000 persons.

Using ALM with English

AIE does not need the Advanced Linguistics Module (ALM) for most linguistic processing of English. However, the ALM's Statistical Entity Extraction and its Lemmatization for English are quite powerful. To switch over to ALM as the default tokenizer for English, see English (using ALM).

 

 

  • No labels