Page tree
Skip to end of metadata
Go to start of metadata

Overview

Traditional approaches to relevancy often involve ranking search results by how many words from the query appear in each document, while sometimes discounting or ignoring more common words. One such method is to sort results by the cosine similarity between the Term Frequency, Inverse Document Frequency (TF-IDF) vector representation of the search query and that of each document.

TF-IDF is good enough for search ranking in some scenarios, but in many cases it can perform poorly and lead to a poor user experience. For example consider two documents that match the search query "vacation policy", one of which has backlinks from various important locations such as the company's HR homepage. Since TF-IDF is not aware of a document's page rank, it may not rank the more relevant document higher.

To fix this we could consider scoring each document by a weighted sum of its TF-IDF score, along with a page rank score, and possibly other features as well, and then ranking documents according to their total score. This solves the issue of not being aware of page rank, but it presents a new problem: How to choose the weight for each ranking feature?

If we weight page rank too high we may get popular pages that have nothing to do with our query at the top of our search results, and similarly if we weight page rank too low we may not see any noticeable improvement over TF-IDF scoring. We might consider adjusting the weights manually while observing the results of some test search queries. But this is a volatile approach as we can easily overfit our small test set, leading to poor performance on queries we have not considered.

In our example above we considered a ranking model with two features, but the situation gets harder to manage still if we want to add even more features. This is where Machine Learning Relevancy can provide a much better solution. Instead of trying to guess what model weights will provide the best search user experience, we can use user feedback about which results are and are not relevancy to their search queries to fit the best possible model weights across our entire data set.

User feedback can be explicit, such as a like or a rating for a particular search result, or it can be implicit, such as a click on a particular result. Given a set of candidate ranking features (eg TF-IDF title match score, TF-IDF content match score, document page rank, document views, document type, document age, etc.) and given a sufficient amount of user data, we can calculate the optimal weighting of our features to best describe how relevant a document is to a given query. This is done automatically and efficiently with Machine Learning by increasing weights that cause query, document pairs users have liked or clicked on to score higher, and decreasing weights that cause query, document pairs users have disliked or ignored to score lower.

Since we are using a relatively small number of features to learn what makes a search result relevant given a relatively large amount of user data, and since we are otherwise taking steps to avoid overfitting, we can expect that our model will generalize well to future, previously unseen user search queries due to Occam's Razor. In other words, we will have learned only the most essential features of a document that decides how relevancy it is to a search query.

We can also test and monitor our model on new user searches as they come in, to verify that it does perform as well as expected, and we can periodically retrain our model to make sure it stays aligned with any major shifts in our documents or in user behavior.

 

Attivio's Relevancy feature leverages Machine Learning to provide:

  • Accurate results: Higher quality results and outcomes
  • Personalized results: User and cohort behavior tracking
  • Adaptive experience: Learns as content and queries change
  • Simple experience: No need for manual tuning

Relevancy in Attivio is split into two components:

  • Relevancy Features: Define the input scores used in relevancy models.
  • Relevancy Models: Defines which features will be used for scoring a document, and how to combine the feature scores into a final score for a document.

Relevancy Features

Relevancy features are the inputs used for computing relevancy. Each feature will produce a per-document floating point score which will be used as an input for computing a document's relevancy. Relevancy models will reference these features and specify the weights for these features separately in order to produce a final relevancy score for a document. Relevancy features are configured globally in order to allow multiple relevancy models to reference the same feature.

The following relevancy features are available:

  • Phrase Boost - Compute a score based on proximity matches for search terms.
  • Term Boost - Compute a score based on matches for search terms
  • Field Boost - Use a field expression to produce a score
  • Category Boost - Dynamic relevancy feature that will create a per-category feature based on a field expression that defines the categories

See Managing Relevancy Features for general documentation on working with relevancy features via the Administration UI.

Relevancy Models

Different users can ask the same question and expect different results to come back, based among others on their location, organizational unit, and interests.

Relevancy models can be created to provide a tailored search experience based on the users' profile.

A relevancy model consists of a list of features to use for relevancy, along with the weights to apply to each feature. Based on users' behavior we can learn what result should get the highest relevancy score to ensure we provide accurate and personalized results. 

 

Next steps

This page provides an overview of Attivio's Machine Learning Relevancy capabilities. Next you will want to configure and use Relevancy Models.

See the following pages for more information -



 

  • No labels