In Content Ingestion - Concepts and Tools we examined the process of adding documents to the AIE index. This page examines the other side of AIE: querying.
AIE offers many useful query-related features and mechanisms. This page is a survey of these topics intended to provide perspective and access though links to more-specific documentation.
View incoming links.
Concepts and Vocabulary
This section presents the most basic concepts relating to querying the AIE index.
Query and Response
For example, terms related to "query" or "querying" refer to the process of submitting a search request to the Index Engine. This is the first half of the query/response cycle. In the illustration below, a user application (such as SAIL) sends a query to the AIE Index Engine.
To complete the cycle, the Index Engine must compose a reply and send it back to the user application. This process is the "response" half of the cycle.
You should think of "query" and "response" as related but very different processes. "Query" relates to every part of passing the request into AIE for searching. "Response" concerns passing the results back to the user application. Both support many interesting and powerful features.
QueryString and Query Object
Every query begins as a string in the syntax of one of AIE's two query languages. This QueryString is parsed into a Query object, which represents the query as a tree-structured model that can be operated on programmatically.
In addition to the elements shown above, the parsed query may contain other operators, boosts, etc applied by the engine as a part of the query workflow and execution. You can view the parsed query tree in the "Legacy XML" output format of the Debug Search tool in the AIE Administrator interface. (The query tree is at the end of the output after all the matching documents.)
QueryRequest and QueryResponse
A QueryRequest message encapsulates everything that is known about a specific query. The message originates in the user application and is passed through a number of query-side workflow stages on its way to the Index Engine.
The Index Engine executes the query and then records the results in a QueryResponse message. This message is passed through response workflows on its way back to the user application.
Query and Response Workflows
The Query and Response workflows fall within the domain of the Java Server API. The API can be used to create QueryRequest messages and Query objects, and to create new query or response transformers as needed.
The distinction between Query and Response appears at the workflow level as the defaultQuery workflow and the defaultResponse workflow.
Note that both default workflows are subflows of the search workflow. The searcher component sends QueryRequests to the Index Engine, and receives QueryResponses from it.
The Attivio Platform exposes a rich set of query interfaces. There is a Google-like keyword search interface (Simple Query Language) and a sophisticated query language that supports relational search (Advanced Query Language).
AIE supports many query-related features through a variety of mechanisms. The subtopics in this section provide a survey of these features with links out to more specific documentation.
Facets and Facet Finding
A facet is a list of distinct values for a specific field within an Attivio Intelligence Engine (AIE) schema. The individual values within a facet are referred to as facet values. For example, the Factbook demonstration offers a facet list of the tables found in the index after ingestion. When the user clicks on one of these items, AIE returns search results from that table only.
AIE locates and constructs facets in two ways:
- Many search clients present the user with a "landing page" that shows facets covering the entire corpus. These facets are generated by AIE's FacetFinder feature.
- Once the user starts to "drill down" to results, facet lists are usually based on the current result set, and can be controlled in various ways by making modifications to the QueryRequest object. See Facets for more information.
AIE offers highlighting of text segments that match the current query, as seen in this example of a search for "nation":
Note that the highlight accurately encloses variants of the original word, which implies that there is quite a lot going on behind the scenes. For more information see Highlighting Results.
If a query produces zero results, it can be automatically resubmitted in a modified form. There are two versions of this feature. Both may be enabled at the same time if desired.
Spell checking is an important part of any effective full-text search system, as users who misspell a query term are easily frustrated when the query returns poor results.
AIE provides the following spell checking and correction features:
- Spelling suggestions for misspelled query terms, providing multiple suggestions ordered by likelihood.
- Automatic correction of misspelled query terms, using:
- Only the most likely suggestion for each misspelled term.
- All suggestions (including the original terms) up to some configurable number.
- Automatic correction of misspelled query terms when a query results in zero hits, using either or both correction modes.
For more information, please see Manage Dictionaries and Configure Spell Checking.
The query is resubmitted after converting the ANDed terms into ORed terms. See Query Resubmission for more information.
Relevancy Model (Scoring)
Relevancy is a measure of how much a particular document matches a user query. AIE sorts a query result set according to the relevancy score of each document. AIE provides excellent search results out of the box; however, it is possible to improve the relevancy of search results for specific scenarios. The primary mechanism for tuning relevancy is via the Relevancy Feature configuration.
Relevancy can be adjusted to provide score boosts for matches that fulfill certain criteria:
- Boost for matching a term in a specific field (the title field, usually).
- Boost when a field contains an implicit phrase.
- Boost when the field value exactly matches the query (that is, "Tom Sawyer").
- Boost when multiple terms occur in close proximity to one another.
- Boost matches that occur within a scope.
- Boost the score of newer documents over older ones.
- Allow documents to carry a static boost value (always boost this document by this much).
- Boost the document based on geographic distance from a point. (Nearest pizza parlor).
It is also possible to substitute your own scoring formula. For more information, see Machine Learning Relevancy.
As mentioned above, search results in a QueryResponse object are usually sorted by relevancy score. There are other options, however, including the following ones:
- Sort based on a field value.
- Sort based on freshness (a field expression).
- Sort based on document order in the index.
- Sort randomly (so each user sees the results in a different order).
For more information, please see Sorting Results.
AIE is "schema-neutral" in that it allows you to extend your system with new tables of data and content at any time, without having to change the underlying data model or re-ingest information. This is in contrast to conventional databases, where creating new tables is a major inconvenience.
AIE queries can combine SQL-like JOIN processing of these index tables with keyword (or more advanced search features) on unstructured text. For instance, in the Factbook demo we showed this example of a JOIN query from the Advanced Query Language:
JOIN(AND(table:city, size:>5000000), INNER(AND(table:news, economy), on=country))
This query lists all cities that have over five million people, provided there is at least one current news article for the same country that contains the word "economy".
For more information, please see Relational Querying JOIN.
Field Expressions are added to the QueryRequest to tell the Index Engine which document fields to return, and how to perform useful manipulations on some of the field values. These manipulations occur inside the Index Engine during the creation of the QueryResponse message, and therefore are not part of either the defaultQuery or defaultResponse workflows.
Field expressions can perform the following manipulations, among others:
- Controlling which fields will be returned from matching documents.
- Transforming returned values in various interesting ways:
- Arithmetic operations
- Trig functions and conversions
- Simple descriptive statistics
- Tests of equality and inequality
- Adding metadata (such as a mean value or geographic distance from a point) to the returned document.
- Sorting returned documents in customized ways.
- Boosting match scores.
Many of the search features described on this page are implemented through field expressions. See Field Expressions for more information.
AIE can group or "collapse" duplicate results with the same headline into a single query result. This feature is called Field Collapsing. Field Collapsing filters out all except one result that share a unique value for a specified field by setting a FieldCollapse specification on the QueryRequest .
See Field Collapsing for more information.
Scope search is a special feature of the Advanced Query Language that lets us match query terms within specific contexts, such as "find both of these keywords within the same sentence."
It also supports matching on different types of entities, such as "find two persons and one company in the same sentence."
For more information, see Scope Search.
Caching and Autowarming
When AIE first encounters a particular query, it sends the query to the Index Engine and pulls matching records from the disk. It then stores optimized portions of the query and the response data structures in memory (caching). For subsequent evaluations of the same or similar queries query, AIE is able to make use of the cache to speed up query execution.
For more information, see Configuring Query Caches and Autowarm Search Caches.
By default, AIE uses a standard paging model for searching the index. The paging model allows developers and/or end users to request a specific number of results starting at a specified offset. These queries were designed for the typical search use case, where the user will review relatively few returned documents.
However, some applications need to retrieve many or even all matching documents, no matter how many there might be. For these applications, AIE offers "streaming" queries. A streaming query allows developers to request all results, facets, or document id's back from a query.
Streaming is set up through the Java Client API. For more information, see Streaming Query API.
Geographic Search Features
When documents contain latitude and longitude fields, AIE can offer various search-related features based on location or distance. These include:
- Filtering results based on distance from a geographic location. (Terrorist attacks within 50 miles of London.)
- Sort results based on distance from a point.
- Add a boost to a document's relevancy score based on distance to a point.
AIE can ingest documents in more than fifty languages. It automatically identifies the "locale" (the language) of each document. This requires a substantial sample of text from each incoming document.
Queries are typically very short in comparison, and often consist of only one word. It is not possible to perform automatic locale detection with such a small sample. When dealing with non-English documents, therefore, it is essential to set the "locale" value in the QueryRequest, so that AIE can process the query terms to better match the index.
The examples that follow illustrate differences in Lemmatization for the same query in several different languages.
- If we search for the ubiquitous word "domino" using the default English query processing, AIE looks for documents that contain "domino."
- However, if we tell AIE that "domino" is written in Spanish or Portuguese, AIE searches for "domino" and also for its lemma "dominar."
- If we identify "domino" as Italian, AIE searches for "domino" and "domare."
- In Hungarian, AIE searches for "domino" and "dominó."
- For Polish, AIE matches documents containing "domino" and "domina."
From these simple examples you can imagine the impact this can have on complex queries. The default English version of the query could be significantly different from the non-English version, with a corresponding difference in matching documents.
During ingestion, AIE processes the text of a document by applying Text Analytics and Linguistic Analysis. This transforms the text into standardized tokens for indexing, and augments the content and metadata to provide a richer and more accurate set of results.
Incoming queries must be treated similarly or they won't match the index very well.
See Workflow Configuration for information on configuring and customizing workflows in AIE.
This section presents a high-level summary of the query-processing steps provided by the defaultQuery workflow. The object is to surface a list of AIE features that process and enhance queries.
The queryInit workflow is the first subflow of the defaultQuery workflow.
It performs some validity checking before converting the query string into a Query object. Then it applies Attivio Business Center settings to the Query object.
These are the components of the workflow, along with notations of the tasks that they perform:
- optimizeGeoSearch: Adds filters to the query to optimize Geographic Searching calculations, if necessary.
- queryValidator: Checks to be sure the the query's locale setting is recognized. This is critical because it tells the tokenization step what language-specific tokenizer to use.
queryParser:Parses the query string in a QueryRequest into a Query
object, using the Java Query Model. Methods on this object let us extend and modify the query in various ways in subsequent workflow stages.
- applyBusinessCenter: This component applies an Attivio Business Center search profile to the query. A search profile can include controlled queries that reorder results and add specific documents to search results. Controlled queries can also block specific results from appearing to a user. The profile can apply dictionaries for synonyms and acronyms. A profile also sets up spelling correction and stop-word processing, if desired.
The queryPreProcess workflow has one component and one mission.
The queryAnalyzer component applies tokenization to the query string. "Tokenization" means to chop the query up into individual words while also noting the position of the words in the original string. The words become "tokens," which are keys to records in the index.
The queryAttivioLinguistics workflow provides further augmentation of the query by performing stopword removal, spell checking, synonym expansion and acronym expansion, presuming that these features were enabled in the search profile. It also removes diacritical marks from query text and annotates the query with a relevancy model.
The components and their contributions are:
- queryUnicodeNormalizer: This component examines the tokens of the query for diacritical (accent) marks, such as Ñ and Ó in cañón . It creates an unaccented copy of the token (canon) and adds it to the query to facilitate matching records in the index.
- queryStopwords: Applies a stopword dictionary to the query, if required by the search profile..
- querySpellCheck: This is a subflow to the querySpellCheck workflow, which implements spell checking if enabled by the search profile.
- default-queryAttivioLinguistics-qt: Applies a relevancy profile to the query. A relevancy profile describes how to weight the match score based on which fields of the document matched terms in the query. Keyword matches in the title of the document usually carry more weight.
- querySynonymizer: Applies a synonym dictionary if required by the search profile. This adds synonym tokens to the query. The token "search" might trigger a synonym "query".
- queryAcronymExpander: Applies an acronym dictionary if required by the search profile. This expands acronyms in the query by adding tokens. "ACL" expands to "Access Control List".
The net effect of this workflow is to massage the query into a form that is likely to bring back a larger set of relevant matches.
The queryPostProcess workflow is concerned with requesting facet lists from the Index Engine.
The queryFacetFinder-index component adds Facet Finder settings to the Query object. A "facet" is a list of potential query terms to use for "drilling down" to a more-specific set of results.
The queryFinalize workflow, as seen in out-of-the-box AIE, is a location where add-on modules can install additional components. It performs no function in the default workflow.
Other Query Transformers
See the AIE javadoc page on Query Transformers for a list of all query transformer classes. Some are used in the default query workflows described above, while others are reserved for customized query flows.
Custom Query Transformers
AIE makes it possible to create custom query transformers and to insert them in the defaultQuery workflow. See Creating Custom Query Transformers and Query Modification for more information.
Processing of QueryResponse messages occurs in the defaultResponse workflow.
The components of this workflow, and their purposes, are:
- customResponse: This is a subflow to the customResponse workflow.
- queryLog: Logs all queries and responses.
removeMetadataFromResponse:A queryResponse message normally includes various metadata fields, such as including a copy of the queryRequest. This component optionally strips out the metadata, controlled by the setIncludeMetadataInResponse()
method of the QueryRequest object.
Other Response Transformers
See the AIE javadoc page on Response Transformers for a list of all response transformer classes. Some are used in the default response workflows described above, while others are reserved for customized query flows.
Custom Response Transformers
It is possible to create your own response transformers to operate on the QueryResponse message before returning it to the client application. See Creating Custom Response Transformers for an example.
This kind of manipulation is usually easier to implement in the client application itself, so it is unusual to customize the response workflow.