Page tree
Skip to end of metadata
Go to start of metadata

Overview

"More Like This" (or "Query by Example") is a capability that allows users to find documents that are similar to a document of their choice. This process could involve creating a query that reflects the content of the document, and then retrieving the documents that match most closely. However, in order to do this efficiently, AIE supports the pre-computation of the "more like this" query for a document during ingestion based on its contents.

This page explains how to enable and set up more-like-this query extraction, and how to generate queries from the more-like-this query field. 

Required Modules

These features require that the morelikethis module be included when you run createproject to create the project directories. In addition, additional files must be downloaded in order to process languages other than English.

View incoming links.

Important Concepts

Term

Discussion

Language Model

A language model is a record of how probable words and sequences of words are, measured over a large amount of text.  (Language models are large enough that they are packaged as separate downloads.)

Informativeness

How probable it is that you would encounter this word or phrase in ordinary text.

Keyword or Keyphrase

Words or phrases that appear unusually frequently in this document, compared to their overall frequency in the language.

In order to extract a more-like-this query in AIE, a language model and a stop word list are required. AIE ships with an English language model and an English stopword list. Language models and stop word lists for some languages may be downloaded here: Dutch (nl), French (fr), German (de), Italian (it), Spanish (es), Portuguese (pt). For other languages, please contact sales@attivio.com.

More-Like-This Query Extraction Transformer

The ExtractSimilarQuery transformer extracts "more-like-this" queries from documents as described above. It stores them in a field for later use. The most common use is to provide a button associated with a document which matches a query. This allows users to review all documents which match a query, and to "drill down" to the documents which are similar to one of the result documents.

To enable more-like-this query extraction, include the morelikethis module as one of the modules supplied to createproject. This will load the file <install_dir>\conf\morelikethis\module.xml, which defines the extractMoreLikeThisQuery component and adds it to the ingestPostProcess workflow.

<install_dir>\conf\morelikethis\features.xml
<beans>
  <f:insertComponent workflow="ingestPostProcess" position="first" component="extractMoreLikeThisQuery" />
</beans>

Configuration Parameters

This is the default configuration of the extractMoreLikeThisQuery component:

<install_dir>\conf\morelikethis\module.xml
<component name="extractMoreLikeThisQuery"
           class="com.attivio.platform.transformer.ingest.linguistics.ExtractSimilarQuery">
  <properties>
    <property name="languageModelServiceName" value="languageModelService" />
    <map name="stopWordDictionaries">
      <property name="en" value="morelikethis/dictionaries/big_stopwords_en.csv" />
    </map>

    <property name="useAllNaturalLanguageFields" value="true" />

    <property name="minUnigramCount" value="1" />
    <property name="minBigramCount" value="1" />
    <property name="minTrigramCount" value="2" />
    <property name="minTermsForQuery" value="3" />
    <property name="minTokenLength" value="3" />

    <property name="maxUnigramTerms" value="20" />
    <property name="maxBigramTerms" value="20" />
    <property name="maxTrigramTerms" value="0" />

    <property name="output" value="morelikethisquery" />
    <property name="validTokenRegex" value="[^0-9]*" />
 </properties>
</component>

There are a number of configurable properties of the more-like-this query extractor:

property name

values

description

default

languageModelServiceName

string

The name of the language model service.

languageModelService

stopWordDictionaries

Language-to-filename property map

A more-like-this query cannot contain a word in the stopword dictionary for the appropriate locale.

 

defaultLocale

locale string

The default locale if a field does not have a locale.

en-us

useAllNaturalLanguageFields

boolean

Assert whether or not the more-like-this query extractor should look at all the natural language fields, as defined by the schema.

true

input

list of field names

The list of field names to be processed when useAllNaturalLanguageFields is false.

 

minUnigramCount

integer

Limits the unigrams to those terms which occur at least this many times

2

minBigramCount

integer

Limits the bigrams to those terms which occur at least this many times

2

minTrigramCount

integer

Limits the trigrams to those terms which occur at least this many times

2

minTermsForQuery

integer

Omits queries with a small number of terms

3

minTokenLengthintegerTokens shorter than minTokenLength are not included in the morelikethis query. 

maxUnigramTerms

integer

Limits the number of unigrams used to build the more-like-this query

20

maxBigramTerms

integer

Limits the number of bigrams used to build the more-like-this query

20

maxTrigramTerms

integer

Limits the number of trigrams used to build the more-like-this query

20

output

field name

The field in which the extracted more-like-this queries are to be stored. The default field, "morelikethisquery", is defined in the file <install_dir>\conf\morelikethis\schema.xml.

morelikethisquery

validTokenRegexstringA regular expression that all tokens must match. Tokens that do not match this regular expression are not included in the morelikethis query."[^0-9]*" which rejects tokens that contain digits)

Notes on these parameters:

  • If maxUnigramTerms, maxBigramTerms, or maxTrigramTerms limits the number of terms extracted for a document, those with the highest surprise value are chosen.

The morelikethis module depends on the languageModelService component, which is defined in the file <install_dir>\conf\languagemodel\module.xml. By default, only the English language model is loaded. For more information on using language models, see the documentation for Language Models.

The more-like-this query looks something like this: OR( unigram1, unigram2, ..., bigram1, bigram2, ..., trigram1, trigram2, ..., minimum=3 ). The minimum number of terms to match is always 1/3 of the total number of unigrams, bigrams, and trigrams. The more-like-this query is not saved if fewer than minTermsForQuery terms are found when processing the document.

Using More-Like-This Queries

  1. Be sure that the morelikethis and languagemodel modules are loaded by including them in the list of modules supplied to createproject
  2. For languages besides English:
    • For each language, download and add the stopwords dictionary to the stopWordDictionaries map.
    • For each language, download the appropriate languagemodel jar and put it in the <install_dir>/lib directory.
    • For each language, add a map section to the models map for the languageModelService component. Copying the "en" map and replacing "en" with the appropriate language code should suffice.
  3. Ingest (or re-ingest) content in a workflow that includes the extractMoreLikeThisQuery component.

Notes

The query computed by the morelikethis module for a document does not take into account query workflows that may change the query's meaning. For example, if the morelikethis query for a document is submitted to a query workflow that includes automatic spelling correction, the query may be mutated into one that does not return that document.

  • No labels