Overview
The queryPreProcess and queryAttivioLinguistiics workflows are specific to the analysis of unstructured data.
| ||
|
|
Note that while all of these components are enabled in the default workflow, their operation is highly dependent on the query properties in the query, and on the schema field properties of the fields which appear in the query.
Spelling correction, when enabled, occurs after stopword extraction but before other natural-language transformations in the default query workflow. Spelling correction can sometimes change a search term in a way that triggers unexpected synonym expansion, etc. Disabling spelling correction while testing natural-language features is recommended.
View incoming links.
Linguistics Workflow Transformers
Tokenization
Queries are tokenized by the queryAnalyzer component, which tokenizes each field using the query.tokenizer schema field property. The default tokenizer works as described in the section on the ingestion default tokenizer.
Tokenization also includes lemmatization (on by default) and stemming (off by default) for all languages supported by AIE.
Query Stopword Remover
Stopword configuration offers a complex array of features involving creating stopword dictionaries, filtering stopwords from some fields of the index, filtering stopwords from some fields of queries, and using different stopword dictionaries for different languages. See Stopword Removal for an overview of these features.
The Query Stopword Remover component examines certain fields of a query for tokens from a stopword dictionary. The Query Stopword Remover's further actions depend on the schema properties, the properties of the query, and the information in the stopword dictionary.
At query time, there are three modes of stopword removal, which are set by using the stopwords query parameter:
- off - Stopwords are not removed (unless the schema requires stopword removal at index time, in which case this becomes remove).
- remove - Stopwords are removed from the query.
- block - Stopwords are assigned boosts (presumably downboosts) from the query stopword dictionary.
A typical query stopword dictionary (from AIE's Dictionary Manager) looks like this:
"#TYPE","STOPWORD" "#NAME","Stopwords" "#GROUP","MyDictionaries" "#LOCALE","en" "and","5" "because" "the" "then"
The query stopword dictionary can be exported as a CSV file, where the first element on each line is a stopword. The second element, if present, is a boost weight. Boost weights below 100 are downboosts, used to reduce the match score of a document that contains the stopword. In block mode, if the stopword has no boost weight assigned in the dictionary, it is deleted from the query.
Stopword removal can also be configured in the AIE Schema. The schema field property stopwords.mode can have the following values:
- index - specifies that all stopwords are removed at index time.
- off - specifies no special stopword removal processing.
- query - specifies that stopword removal is under the fine-grained control of the query.
Here is an example showing how to use the schema to remove stopwords from queries against the text field:
<field name="text" type="TEXT" multivalue="true" indexed="true" stored="true" sort="false" tokenize="yes"> <properties> <property name="stopwords.mode" value="query" /> </properties> </field>
Note that if stopwords are removed at index time (stopwords.mode = "index"), the Query Stopword Remover component also removes them at query time.
Note that stopword properties on the query override the ones in the schema (unless, as noted above, the schema specifies stopword removal at index time, which overrides everything else).
The following XML configuration for the Query Stopword Remover component is taken from the file <project-dir>\conf\components\queryStopwords.xml after editing the field, locale, and dictionary name properties in the component editor of the AIE Administrator. (It is also necessary to update the project in order to write this file into the project source tree before you can view it).
<component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="queryStopwords" class="com.attivio.platform.transformer.query.RemoveStopwords" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd "> <properties> <list name="fields"> <entry value="text"/> </list> <property name="defaultDictionaryLocale" value="en"/> <property name="defaultDictionaryName" value="Stopwords"/> </properties> </component>
The ingestion and query stopword removers should both use the same stopword dictionary.
To compare this with ingestion-time stopword removal, see the section on the ingestion stopword remover.
Query Synonym Expander
The Query Synonym Expander expands the query using a synonym dictionary created in the Dictionary Manager. The dictionary can be exported as a .CSV file with the following format:
- The first item on teach line is the term to expand.
- Second and subsequent items are synonyms in the form "synonym" or "synonym*boost".
The following is an example of an exported synonym dictionary:
"#TYPE","SYNONYM" "#NAME","Synonyms" "#GROUP","MyDictionaries" "#LOCALE","en" "Labrador Retriever","dog" "guitarist","musician" "pirate","sailor","buccaneer^200","privateer^300"
Note that by default, each word query maps to lower case before examining the synonym dictionary. The following Simple Query Language queries expand into advanced queries as shown:
original query | expanded query |
---|---|
pirate | (or (pirate, sailor, buccaneer*200, privateer*300) |
Query Synonym Expansion is enabled on a per-query basis in AIE. The default is to not perform query synonym expansion. To enable query synonym expansion, do one of the following:
- Select Synonyms:ON or AUTO in the debug search interface.
- When using Search UI, select a Profile that enables synonyms. Profiles are managed through the Attivio Business Center (ABC).
- From Java, use QueryRequest.setSynonymsMode().
To configure the querySynonymizer component, use the AIE Administrator > System Management > Palette and filter for querySynonymizer. Click on the component to open the component editor.
Use the name of the synonym dictionary as displayed in the Dictionary Manager.
Note the list of Configured Fields that should be synonymized. The content field will be used when you submit a fieldless query. It contains the concatenated values of the author, title and text fields.
You can set a synonym mode for a field in the AIE Schema. For example, consider a case where field s1 is defined, which specifically sets property synonyms.mode to off in the schema.
<field name="s1" type="string" indexed="true" stored="true"> <properties> <property name="synonyms.mode" value="off"/> </properties> </field>
If a schema field sets synonyms.mode to on by default, then setting synonyms=false in the query language will override that setting for that specific query.
Query Acronym Expander
The Query Acronym Expander expands the query using an acronym dictionary created in the Dictionary Manager. The dictionary can be exported as a .CSV file with the following format:
Each column is specified with two fields:
- Acronym - the term to expand
- Expansions - the phrase(s) to expand to, with optional boost values. Default boost is 100.
For example, consider the following small dictionary:
"#TYPE","ACRONYM" "#NAME","Acronyms" "#GROUP","MyDictionaries" "#LOCALE","en" "AFAICT","as far as I can tell" "ASAP","as soon as possible^130" "IRS","internal revenue service","interest rate swap" "VIP","very important person"
Queries containing acronyms are rewritten as follows:
original query | expanded query |
---|---|
irs | or(irs, "internal revenue service", "interest rate swap") |
This query looks for "Internal Revenue Service" and "interest rate swap" as well as "IRS".
Query Acronym Expansion is a field-based property declared in the file <project_dir>\conf\schema\default.xml. Enable the Query Acronym Expansion at query time by adding the following to the schema field properties:
<property name="acronyms.mode" value="query" />
By default, acronyms are not expanded.
Note that acronym expansion is case-insensitive. So, both cod and COD may expanded to cash on demand, even when cod refers to the type of fish.
If there are multiple entries in an acronym dictionary with the same TERM (the part before the first comma), AIE only processes the one that occurs last in the dictionary. In all such cases, ensure that your acronym TERM isn't already in the dictionary. If it is, append the entries to the pre-existing entry line, as described above.
To customize the queryAcronymExpander, navigate to the AIE Administrator > System Manager > Palette and filter for queryAcronymExpander. Click on the component to open the component editor.
Enter the name of the dictionary as shown in the Dictionary Manager.
Also, note the list of fields that is part of the configuration. This list contains the fields that are typically acronymized.
Also, if schema field sets acronyms.mode to query by default, then setting acronyms=false in the query will override that setting for that specific query.
Linguistics Query Workflow Transformer Dependencies Table
Transformer | Dependencies |
---|---|
Query Tokenizer | None |
Query Stopword Remover | Must be turned off or on at both ingestion and query time for search to work properly. |
Query Synonym Expander | If stemming is enabled and precedes the Query Synonym Expander (as in the default workflow), the stem is used to look up any synonyms. To avoid unexpected results, stemming all terms and synonyms is recommended. |
Query Acronym Expander | If stemming is enabled and precedes the Query Acronym Expander (as in the default workflow), the stem is used to look up any synonyms. To avoid unexpected results, stemming all terms and synonyms is recommended. |