Page tree

Overview

Stopwords are typically common words (such as "the" or "a" in English) or words that are not desirable in a certain query context, such as the phrase: "I need information about." Several AIE components use a stopword list. You can remove or annotate stopwords at index time using the fieldStopwords ingest transformer, which is an instance of the ApplyStopwords . You can remove or downweight stopwords at query time using the queryStopwords query transformer, which is an instance of RemoveStopwords .


View incoming links.

Stopwords in AIE

AIE supports several processing modes with respect to stop words. They are not all equally useful: 

  • No special processing.  By default, stopwords are not treated differently than other words. This is the default configuration of AIE.
  • Stopword removal at index time.  There are significant drawbacks to this approach.
    • Changing the stopword list requires you to reload everything that is already in the index.
    • This approach produces a smaller index, but suffers the same poor precision as legacy systems.  For example, such a system cannot distinguish between "turkey or stuffing", "turkey and stuffing", and "turkey without stuffing" (assuming that "or", "and", and "without" are removed stopwords). Phrase and "near" queries are typically affected most significantly.
    • If you don't remove the same stopwords during indexing and during querying, many queries will fail to match anything. (See warning below.)
  • Stopword removal at query time.  Removing a small number of words at query time can improve the search experience, especially if queries are very wordy and are not matching relevant documents due to AIE's "logical AND" interpretation of queries which use the Simple Query Language.
  • Stopword downweighting at query time.   Most stopwords are very common words that do not influence document scores significantly.  This approach further reduces how much stopwords contribute to the overall score of each document.  
  • Custom query rewriting.  Some AIE users have found significant benefit from removing common words or phrases from the beginning of queries (this is sometimes called anti-phrasing).  It is easy to add custom query transformers to the AIE query workflow.

Best Practice

The Best Practice is to remove or downweight a small number of stop words at query time.

Ingest stopwords require query stopwords

Note that the set of stopwords removed at query time must include all stopwords removed at index time. If a stopword is removed at index time  but not at query time then a query containing that term will never find any matching documents (assuming that the query is a logical AND query, which is the default).

Create a Stopwords Dictionary

Many of the languages supported by AIE come with stopword files that you can download and import into your project. For instance, the English (en) page has three stopword files as attachments. The full set of 137 stopword files may be found in <install_dir>/conf/dictionaries/ 

To import one of these files as a stopword dictionary, follow this general procedure:

  1. Open the AIE Administrator to the Dictionary Management page. (The default user/password is aieadmin/attivio.)
  2. Create a new stopword dictionary. Give it a unique name.
  3. Import the stopwords file into the dictionary.
    1. Download the file to a location on the computer where you run the AIE Administrator.
    2. Change the filename to use a .CSV extension.
    3. Edit the file to remove duplicate entries.
    4. Import the terms into the new dictionary on the Dictionary Management page.
  4. Approve and Publish the dictionary.

The new dictionary will be automatically saved to the Store and will be available to all AIE nodes for both ingestion or querying as needed.

Stopwords for Ingestion and Querying

Ingestion-side stopword removal has drawbacks.

Each time you change the ingest stopword list, you will have to re-ingest all the documents that are already in the index.

If you remove stopwords during ingestion, you must remove the same stopwords during querying.

To apply stopwords during ingestion and querying:

  1. Create a stopwords dictionary in the Dictionary Manager. (See the procedure above.)
  2. In the AIE Administrator, navigate to the System Management > Palette > fieldStopwords component. Configure the fieldStopwords component to use the new dictionary.
    1. Put the name of the new dictionary in the Default Dictionary Name field.
    2. Set the input fields where stopwords should be applied.
  3. In the AIE Administrator, navigate to the System Management > Palette > queryStopwords component. Configure the queryStopwords component to use the same stopword dictionary.
    1. Put the name of the new dictionary in the Default Dictionary Name field.
    2. Set the Default Locale of the new dictionary.
    3. Set the input fields where stopwords should be applied.
  4. Edit the AIE schema file, <project-dir>\conf\schema\default.xml.
    1. Add <property name=stopwords.mode value="index"/> to the fields where you want ingest or query stopwords to be applied.
  5. In the AIE-CLI:
    1. First update the project (to copy the modified components into the collection of source files).
    2. Then deploy the project (to copy the modified schema file to the configuration servers). Use the  force.
  6. When testing, remember that the stopwords will still appear in the search results. They are removed from the index, not from the display fields.

Stopwords for Queries Only

Best Practice

This is the recommended way to use stopwords in AIE.

 

To apply stopwords during querying only:

  1. In the AIE Administrator, navigate to System Management > Palette > queryStopwords component.
  2. Configure queryStopwords to use the new stopword dictionary.
    1. Put the name of the new dictionary in the Default Dictionary Name field.
    2. Set the Default Locale of the new dictionary.
    3. Set the input fields where stopwords should be applied.
  3. In a text editor, edit the AIE schema file: <project-dir>\conf\schema\default.xml.
    1. Add <property name=stopwords.mode value="query"/> to the fields where you want query stopwords to be applied.
  4. In the AIE CLI:
    1. First update the project (to copy the modified queryStopword component into the collection of source files).
    2. Then deploy the project (to copy the modified schema file to the configuration servers). Use the force option in the AIE CLI.
  5. To see the query stopwords feature in action, pose an Advanced Query Language query with stopwords turned on, like this one:
    OR(title:London,title:Paris,stopwords=on)

 

Blacklists

In addition to the above discussion, many AIE components support an additional dictionary of words and/or phrases which are ignored in some way by the AIE component. For example, the StatisticalKeyPhraseExtractor uses a list of stop words (but not phrases) to suppress common words from being extracted as parts of key phrases.  These stopword lists, or blacklists, are independent of the lists of stopwords used at index and query time.

 

  • No labels