Page tree

View incoming links.


Introduction

Note

The information on this page assumes your Attivio Platform and project have Search Analytics installed. If your system doesn't have Search Analytics installed, or if you are upgrading the version you have installed, see Download and install for details on obtaining the latest version of the Search Analytics external module and how to add it to your Attivio Platform installation. You will need to add it to your project using the createproject command-line tool, either when creating a new project (using the -m searchanalytics option or by incrementally adding it to an existing project using the -m searchanalytics option in conjunction with the -i (incremental) option.


Architecture

This diagram shows the components that Search Analytics add to your system and how data flows between them. The components are each described in more detail below.

Search Analytics Architecture


Functionality

The Search Analytics module consists of three pieces that:

  1. monitor users' queries,
  2. pull the query history into the Attivio Index, and
  3. query that data to provide insights into your users' behavior.

Monitoring Users' Queries

The Search Analytics module hooks into two parts of the Attivo Platform in order to collect information about your users' behavior. To access details about the queries your users make, it inserts a component into the query response workflow and uses this to save information about the query being returned, such as the query string, the user who made it, the time it was made, etc. The full list of data collected about queries is shown below. This data is then handed off to the Search Analytics Archiver, which writes it to an archive file in the file system.

In parallel, any time a user clicks on a search result or otherwise causes signals to be added to the Attivio Platform, the Search Analytics module is notified and data about the signal is sent to the Search Analytics Archiver. This data includes the query that produced the document that was clicked, the document's position in the search results, the time the signal was recorded, etc. The Archiver writes this data to the file system as well.

In an unclustered deployment of the Attivio Platform, the query and signal data is written to CSV files on the node's local file system. In a clustered deployment, the data is written to CSV files in the shared Hadoop file system (HDFS) so they are available to all nodes in the system.

See Ingesting Data into the Attivio Index, below, for details on how this data is added to the Attivio Index to make it available for the Search Analytics application to use.

See Configuring Archiving, below, for details on how you can control the way Search Analytics archives your query and signal data.

Ingesting Data into the Attivio Index

The Search Analytics application takes advantages of the power of the Attivio Index to analyze the data it has about your users' behavior. In order to do this, the data must be ingested into the index from he archive files described above. This is done the same way all data is ingested—using a connector. You can set the connector up to run on a schedule, or otherwise customize it according to your needs. The connector is of type "Search Analytics Connector" but can have any name you want to give it, e.g., "search-analytics." You should generally keep the default parameters, although you can customize its behavior if needed, according to the details in Configuring Ingestion, below.

Analyzing Search Analytics Data

When you run the Search Analytics application (available as a link in the Attivio Business Center's main page or from the menu on the left-hand side of the Admin UI), it makes queries against the Attivio Index to retrieve the data that the connector ingested, and manipulate it to create the charts and tables shown on the dashboard.

See Search Analytics Dashboard for more details on the types of analysis it provides.


Configuring Search Analytics

The sections below provide details on how you can configure the various component pieces of the Search Analytics system. You can change:

  • Which data is archived as queries are made and signals are submitted to the system,
  • How often and when query and signal archive files are written,
  • How long archive files are kept,
  • How often archive files are ingested into the index,
  • How long ingested search analytics data is kept in the index, and
  • Which users have access to the Search Analytics application and its data.

Filtering Archived Data

You can customize the data that gets written to the archive files by writing a custom Java class that implements the interface com.attivio.sdk.server.searchanalytics.SearchAnalyticsArchiverFilter in the Attivio SDK. By default, an instance of the class com.attivio.searchanalytics.archiver.DefaultArchiveFilter is used, which passes all data through to the archive files. Generally, we recommend leaving this configuration as-is and instead filtering data on ingestion; this allows you to re-ingest your click and signal data with different filtering settings (see Configuring Ingestion, below) as long as you still have the archive files available.

Here is the definition of the SearchAnalyticsArchiverFilter interface:

com.attivio.sdk.server.searchanalytics.SearchAnalyticsArchiverFilter.java
/**
 * Filters out queries that shouldn't be archived by the searchanalytics module.
 * The default implementation defined in conf/searchanalytics/module.xml archives all query responses.
 */
public interface SearchAnalyticsArchiverFilter {

  /**
   * Decide whether or not to archive a particular query response.
   *
   * @param response
   * @return true if this response should be archived
   */
  boolean accept(QueryResponse response);

  /**
   * Decide whether or not to archive a particular signal.
   *
   * @param signal
   * @return true if this signal should be archived
   */
  boolean accept(Signal signal);
}

See the Attivio SDK documentation on the Attivio Developer Community for details on implementing custom code in your installation.

Configuring Archiving

Archiving of Query Data

You can control which query data is archived by configuring the component that Search Analytics inserts into the "defaultResponse" workflow, queryResponseArchiver. This is configured in the file conf/components/queryResponseArchiver.xml in your project. This component has the following configurable properties:

PropertyTypeDefault ValueDescription
enabledBooleantrueThis controls whether the query data is archived at all. You can temporarily disable collection of query data by setting this property to "false."

archiveEntireQueryResponse

BooleanfalseIf you set this to "true," the entire contents of each query response will be archived, including all of the documents and any metadata returned. Enabling this will greatly increase the size of the data used by Search Analytics. This additional data is not used by the Search Analytics application but might be useful for your own, custom analysis.
writeBatchSizeInteger1000This controls how many lines of CSV data (i.e., how many queries) will be batched in memory before being written to the archive files. You might want to lower this value if you have a system with very few/infrequent queries so they're written to disk more often.
batchesInFileInteger100This controls how many batches are written to a given CSV file before switching to a new file. Together with the writeBatchSize property, this can help you manage the size and number of the archive files in your system. Most deployments will not need to change this property.

Archiving of Signal Data

Similarly, you can control which signal data is archived by configuring  the  SignalArchiveApi  component. It has the following  configurable  properties:

PropertyTypeDefault ValueDescription
enabledBooleantrueThis controls whether the signal data is archived at all. You can temporarily disable collection of signal data by setting this property to "false."
writeBatchSizeInteger2000This controls how many lines of CSV data (i.e., how many queries) will be batched in memory before being written to the archive files. You might want to lower this value if you have a system with very few/infrequent signals so they're written to disk more often.
batchesInFileInteger100This controls how many batches are written to a given CSV file before switching to a new file. Together with the writeBatchSize property, this can help you manage the size and number of the archive files in your system. Most deployments will not need to change this property.

Configuring Retention of Archives

By default, in order to save disk space, the archive files created by the archiver are deleted once they have been ingested into the Attivio Index by the Search Analytics connector. You can change this behavior if you would like to be able to re-ingest them for whatever reason. For example, if you're still determining exactly how you want to configure ingest filtering (see Configuring Ingestion, below), you can leave the archive files on your system longer and then re-run the Search Analytics connector with different settings to see how they affect the data.

Archive retention is configured by changing properties of the bean  ArchiveFilesRetentionPolicy:

PropertyTypeDefault ValueDescription
autoCleanupEnabledBooleantrueIf set, then already ingested archive files will be deleted periodically. The interval at which they are cleaned up is specified by the property cleanupFrequencyInHours. You should generally only ever disable this on a temporary basis to avoid filling your systems file system with archive data.
cleanupFrequencyInHoursInteger3The number of hours between cleanups. Only applicable if autoCleanupEnabled is set to true.
deleteScannedFilesBooleantrueIf set, then already scanned files will be deleted when the clean-up procedure is run.
deleteOldFilesBooleantrueIf set, then files beyond a certain age will be deleted when the clean-up procedure is run.
oldestFileInDaysInteger90The age beyond which archive files will be removed regardless of whether they have been scanned.


Configuring Ingestion

There are a few places where you can control which documents are ingested into the Attivio Index and how they are (or are not) transformed along the way.

Note

The Search Analytics connector uses the default ingest workflows and the components contained in these to process Search Analytics data. It is important that these not be modified in a way that would make the connector fail to properly ingest this data. As a general principal, if you need to customize workflows for particular connectors, you should duplicate them and modify the copy instead of the original, to avoid breaking system functionality such as Search Analytics.

Configuring the Connector

You will generally want to define a Search Analytics connector and set it up to run on a regular schedule. How frequently it runs will depend on how active your system is and how up-to-date you need your search analytics data to be.

You can change certain parameters of the Search Analytics connector from the defaults to suit your installation's requirements. See Connectors for more details on configuring connectors in general.

Property NameDefault ValueDescription
Node Set

* (all nodes)

The set of nodes on which to run the Search Analytics connector. If you have a multi-node installation and want to limit the connector to run on only certain nodes, you can change this to a more restrictive node set (defined in your project's topology).
Incremental Mode ActivatedfalseIf you enable this, the connector will only add new data into the index when it runs, lightening the load on your system. You can always cause the connector to do a full ingestion by resetting it and re-running it (or choosing "Full Run" in the Connector Admin UI).
Search Analytics Index ZonesearchanalytiucsThis is the name of the zone in the index where the Search Analytics data is put, to separate it from the default zone that contains your "real" data. If you changed the zone that Search Analytics uses in the features.xml file, you'll need to change it here as well. See The Search Analytics Zone, below.
Days to Retain90This determines how long the Search Analytics data is kept. When the Search Analytics connector runs, it will delete any data from the index that is older than this many days. This helps prevent the size of the Search Analytics data from becoming too great for your index, but also limits the time ranges over which you can perform the analysis. (You can set this value to a negative number to prevent the connector from ever deleting the data, though you should make sure you understand the impact of this on the capacity of your Attivio Index before doing so.)
Document ID Prefix(blank)This string is prepended to all document IDs generated by the Search Analytics connector. You should generally leave this blank.
Ingest WorkflowingestThis is the workflow which that processes all documents ingested by the Search Analytics connector. Do not change this. This must be set to the "ingest" workflow in order for the components inserted into the workflow for filtering and finding terms in your users' queries to work properly.

Configuring the Term Extractor

The SearchAnalyticsTermExtractor component is used to pull out the terms contained in each query so they can be identified separately in the analysis done by the application. For example, if a Simple Query Language query is executed with the string "Solar Eclipse," two terms will be pulled out: "Solar" and "Eclipse." Similarly, if an Advanced Query Language query is executed with the query string "AND(watermelon,orange,"dragon fruit")," three terms will be pulled out: "watermelon," "orange," and "dragon fruit."

In addition to extracting terms from queries, this component can prevent certain queries from being added to the index. You can use the following configuration properties to determine which queries the extractor will ignore (and which will not subsequently be ingested into the index):

PropertyTypeDefault ValueDescription
ignoreAPIgeneratedQueriesBooleanfalseIf this is set to true, any queries that were generated by the Java API instead of via the REST API will be dropped from the Index and will not be included in the Search Analytics application's analysis.
allowOnlyBusinessCenterQueriesBooleanfalse

If this is set to true, only queries that have the flag abc.enabled on them will be indexed. All other queries will be dropped from the Index and will not be included in the Search Analytics application's analysis.

You can ensure that queries have this flag set by doing one of the following:

  • including the following in your post, if submitting a query via the REST API endpoint /rest/searchApi/search:

    "restParams": {
        "abcEnabled": [
          "true"
        ]
      }
  • passing abc.enabled=true if submitting a query via the REST API endpoint /rest/searchApi/simpleCgi,
  • configuring a businessCenterProfile in configuration.properties.js if using Search UI, or
  • explicitly setting a Query Parameter if using the Java SDK,

    QueryRequest.getQuery().setParameter("abc.enabled", true);
lowercaseQueriesBooleanfalse

If this is set to true, all queries and terms will be stored by Search Analytics in lowercase format. This will ensure that multiple queries that differ only in their case will be analyzed as if they were identical. For example, if this is set to true, then Search Analytics will conflate queries for "France," "france," and "FRANCE" into a single statistic; if it is false, then these three will be treated as separate queries with their own statistics.

NOTE

If you enable lowercase queries, you will no longer be able to see the queries exactly as users typed them. If this is important to you, leave this property set to false (e.g., if it is important to you to know whether your users are typing "MyCompany" or "mycompany" when searching for your company's name, leave this property set to false).


You can use the component editor in the Admin UI to modify these properties.

Configuring Additional Query Exclusion

You can use the properties of the SearchAnalyticsIngestFilter component, which is added to the "ingestInit" workflow by the module, to control the filtering of system queries and queries containing particular terms. By default, the component is set to exclude all system queries from Search Analytics. By default, this filter also excludes any queries which are made against the Search Analytics tables in the index: attivio.searchanalytics.querylog and attivio.searchanalytics.signal (i.e. those made by the Search Analytics application itself).

You can stop the exclusion of system queries by changing the value of the property excludeSystemProperties from "true" to "false." You can also change the terms that will cause a query to be excluded by editing the list-type property, excludeList.

You should not need to modify this component directly; use the other filtering techniques described here instead. If you do modify this configuration, you must do so directly in the file conf/searchanalytics/module.xml instead of in the Admin UI.

Configuring Custom Ingestion Filtering

If you need to perform your own custom filtering while data is being ingested, you can create an implementation of the interface  com.attivio.sdk.server.searchanalytics.SearchAnalyticsFeederFilter. You can look at any relevant properties of the individual documents or the query response and decide to allow or disallow each document. By default, the Search Analytics scanner simply lets all queries through. If you implement this interface and instantiate a bean using your class, you can configure the Search Analytics scanner to use that bean and you will have the opportunity to either modify documents as they're being ingested or filter them out. For example, your class could add fields to (or delete them from) the document being indexed (note that if you add your own custom fields, you will need to add them to the Attivio schema—see conf/searchanalytics/schema.xml for them to be persisted in your index).

Here is the definition of the SearchAnalyticsFeederFilter interface:

com.attivio.sdk.server.searchanalytics.SearchAnalyticsFeederFilter.java
/**
 * A bean that implements this interface can be used to add fields to the SearchAnalyticsScanner
 * document based on the QueryResponse object this document is based on. It can also be used to
 * filter out documents based on the QueryResponse object or document field values. If the
 * document represents signal info then the QueryResponse parameter will be null. QueryResponse
 * will also be null if the archiver was not configured to save the entire query request/response.
 * The document is dropped all together and not fed if false is returned.
 *
 * The default bean is an instance of DefaultSearchAnalyticsFeederFilter, defined in conf/searchanalytics/module.xml,
 * and allows all documents through, unmodified.
 */
public interface SearchAnalyticsFeederFilter {

  /**
   * Fields can be added to the document based on the information in the QueryResponse object. The document will not be fed
   * if false is returned.
   *
   * @param response will be set for query documents and null for signal documents
   * @param doc      the document to be ingested; you may modify its fields if desired
   * @return true    if the document should continue being fed; false to filter it out
   */
  boolean processDocument(QueryResponse response, IngestDocument doc);
}

See the Attivio SDK documentation on the Attivio Developer Community for details on implementing custom code in your installation.

Configuring Users

You can configure which users are able to see the Search Analytics data. Note that there are two places where users need to be added—if not added in both places, they won't have full access to the Search Analytics application.

Adding Principals for Content Security

The documents created by the Search Analytics Scanner are associated with ACLs to allow only certain users to access the data. By default, only members of the administrators group can do so. In addition to the administrators group, members of groups named <domain>_SearchAnalyticsGroup are given access. At any time (even after ingestion) such a group can be defined and populated with user and group principals, allowing the administrator to extend the list of Search Analytics users. These ACLs are created by the searchAnalyticsSecurity  component in the ingestInit workflow: see conf/searchanalytics/features.xml. A common way to facilitate granting and revoking Search Analytics access to users is to include a group from your external directory service in this special group and then use that service to add users to and remove users from that group, outside of the Attivio Platform.

The simplest way to add grant Search Analytics access to a user or group is to add them to a users.xml file and use the XML Connector to ingest them into the system as principals. See XML Principal Scanner for details on how to do this. If adding an external group, such as one defined in Active Directory, make sure to use the correct principal ID from that system (and make sure you've ingested the principals from that system) in order for the indirect granting of privileges to work. As an example, you might have the following group definition in your users.xml file:

XML Principal Scanner Definitions in users.xml
  <group id="aie_searchanalyticsgroup@aie" name="aie_searchanalyticsgroup">
     <membership principal="<group_principal_id1>@<realm>" group="aie_searchanalyticsgroup@aie" />
     <membership principal="<group_principal_id2>@<realm>" group="aie_searchanalyticsgroup@aie" />
  </group>

Adding the "SA User" Role

You will need to configure the role-based security for the users and groups who should be able to see the Search Analytics application in the UI. You can do this in the Role Assignments application accessible from the "Tools" section in the left-hand column of the Admin UI. See Managing Roles in the Attivio Business Center for details. If you want to add the role to an entire group of users. For example, if your system is synchronized with an Active Directory server, you can create a group for Search Analytics users in AD and add the SA User role to that group and then you can add/remove users from the group in Active Directory without needing to modify the Attivio Platform.

The Search Analytics Zone

By default Search Analytics documents are stored in a hidden zone in the index, called "searchanalytics." This is defined in conf/searchanalytics/features.xml:

Hidden Zone Definition in features.xml
<f:addIndexZone index="index"> 
  <f:zone name="searchanalytics"> 
    <f:route field="table" value="attivio.searchanalytics.querylog"/> 
    <f:route field="table" value="attivio.searchanalytics.signal"/> 
    <f:property name="hidden" value="true"/> 
  </f:zone> 
</f:addIndexZone>

You should never need to change this, but not that if you do, you must also change the zone configured int the Search Analytics connector. See Configuring the Connector, above.


Maintaining the Archive from the CLI

You can use commands in the aie-cli command-line tool to view and manipulate the archive files written by the Search Analytics module's archiver component. CLI commands can be used to view the state of the archive and to manually clean it up. While running the CLI, type searchanalytics help archive to see a full description of the available commands. They are:

searchanalytics archive status—Displays the current status of the archive, including the location of the archive files and how many archive files of each type exist.

searchanalytics archive flush—Flushes the archive buffers into the file system (note that the Search Analytics scanner will automatically flush the buffers when it starts, so this is only useful 

searchanalytics archive purgeAll—Deletes all archive files. They are deleted regardless of their age or whether they have already been scanned, so be careful when using this command.

searchanalytics archive purgeScanned—Deletes already scanned archive files. Use in case automatic deletion is not configured or you want to delete the files before the policy would.

searchanalytics archive purgeOld <older than in days> —Deletes old files. Use in case automatic deletion is not configured or you want to delete the files before the policy would.


What's Not Supported

The following are not supported by the Search Analytics module:

  • Archiving/analysis of information for streaming queries.
  • Configuration/use of more than one Search Analytics scanner.
  • No labels