"More Like This" (or "Query by Example") is a capability that allows users to find documents that are similar to a document of their choice. This process could involve creating a query that reflects the content of the document, and then retrieving the documents that match most closely. However, in order to do this efficiently, AIE supports the pre-computation of the "more like this" query for a document during ingestion based on its contents.
This page explains how to enable and set up more-like-this query extraction, and how to generate queries from the more-like-this query field.
These features require that the morelikethis module be included when you run createproject to create the project directories. In addition, additional files must be downloaded in order to process languages other than English.
View incoming links.
A language model is a record of how probable words and sequences of words are, measured over a large amount of text. (Language models are large enough that they are packaged as separate downloads.)
How probable it is that you would encounter this word or phrase in ordinary text.
Keyword or Keyphrase
Words or phrases that appear unusually frequently in this document, compared to their overall frequency in the language.
In order to extract a more-like-this query in AIE, a language model and a stop word list are required. AIE ships with an English language model and an English stopword list. Language models and stop word lists for some languages may be downloaded here: Dutch (nl), French (fr), German (de), Italian (it), Spanish (es), Portuguese (pt). For other languages, please contact firstname.lastname@example.org.
More-Like-This Query Extraction Transformer
The ExtractSimilarQuery transformer extracts "more-like-this" queries from documents as described above. It stores them in a field for later use. The most common use is to provide a button associated with a document which matches a query. This allows users to review all documents which match a query, and to "drill down" to the documents which are similar to one of the result documents.
To enable more-like-this query extraction, include the morelikethis module as one of the modules supplied to createproject. This will load the file <install_dir>\conf\morelikethis\module.xml, which defines the extractMoreLikeThisQuery component and adds it to the ingestPostProcess workflow.
This is the default configuration of the extractMoreLikeThisQuery component:
There are a number of configurable properties of the more-like-this query extractor:
The name of the language model service.
Language-to-filename property map
A more-like-this query cannot contain a word in the stopword dictionary for the appropriate locale.
The default locale if a field does not have a locale.
Assert whether or not the more-like-this query extractor should look at all the natural language fields, as defined by the schema.
list of field names
The list of field names to be processed when useAllNaturalLanguageFields is false.
Limits the unigrams to those terms which occur at least this many times
Limits the bigrams to those terms which occur at least this many times
Limits the trigrams to those terms which occur at least this many times
Omits queries with a small number of terms
|minTokenLength||integer||Tokens shorter than minTokenLength are not included in the morelikethis query.|
Limits the number of unigrams used to build the more-like-this query
Limits the number of bigrams used to build the more-like-this query
Limits the number of trigrams used to build the more-like-this query
The field in which the extracted more-like-this queries are to be stored. The default field, "morelikethisquery", is defined in the file <install_dir>\conf\morelikethis\schema.xml.
|validTokenRegex||string||A regular expression that all tokens must match. Tokens that do not match this regular expression are not included in the morelikethis query.||"[^0-9]*" which rejects tokens that contain digits)|
Notes on these parameters:
maxTrigramTermslimits the number of terms extracted for a document, those with the highest surprise value are chosen.
The morelikethis module depends on the languageModelService component, which is defined in the file <install_dir>\conf\languagemodel\module.xml. By default, only the English language model is loaded. For more information on using language models, see the documentation for Language Models.
The more-like-this query looks something like this: OR( unigram1, unigram2, ..., bigram1, bigram2, ..., trigram1, trigram2, ..., minimum=3 ). The minimum number of terms to match is always 1/3 of the total number of unigrams, bigrams, and trigrams. The more-like-this query is not saved if fewer than minTermsForQuery terms are found when processing the document.
Using More-Like-This Queries
- Be sure that the morelikethis and languagemodel modules are loaded by including them in the list of modules supplied to createproject
- For languages besides English:
- For each language, download and add the stopwords dictionary to the stopWordDictionaries map.
- For each language, download the appropriate languagemodel jar and put it in the <install_dir>/lib directory.
- For each language, add a map section to the models map for the languageModelService component. Copying the "en" map and replacing "en" with the appropriate language code should suffice.
- Ingest (or re-ingest) content in a workflow that includes the extractMoreLikeThisQuery component.
The query computed by the morelikethis module for a document does not take into account query workflows that may change the query's meaning. For example, if the morelikethis query for a document is submitted to a query workflow that includes automatic spelling correction, the query may be mutated into one that does not return that document.