Page tree
Skip to end of metadata
Go to start of metadata

Overview

Attivio can group or "collapse" duplicate results with the same headline into a single query result. This feature is called Field Collapsing. Field Collapsing filters out all except one result that share a unique value for a specified field by setting a FieldCollapse  specification on the QueryRequest .


JoinRollupMode TREE

For Field Collapsing, the JoinRollupMode

 should be set to TREE.


Not compatible with Streaming Queries

Note that Field Collapsing cannot be used with Streaming Queries

 

View incoming links.

Field Collapse Specification

Field collapsing can be applied to a QueryRequest  by using a FieldCollapse  specification.


REST Syntax

collapse=FIELDNAME(mode=MODE, sort=SORT, facet=FACET, rows=ROWS)

Supported Parameters

Parameter

Default

Description

FIELDNAME

required

Specify the field to use for collapsing (must be joinable, i.e., not a TEXT or SHAPE field)

MODE

DEFAULT

Specify the mode for field collapsing

SORT

mode specific

Specify the sort order for documents in a group

FACET

true

See #Faceting

ROWS

1

Number of rows per group (must be >= 1)

Example

collapse=cat(mode=DEFAULT, sort=$score:desc, facet=true, rows=2)

Field Collapsing Modes

2 field collapsing modes are currently supported:

  • DEFAULT - provides standard field collapsing, scaled to support large number of unique field values.
  • TWO_DIMENSIONAL - provides a 2 dimensional view of the index by returning a result set for each unique field value.

Default Field Collapsing

Sorting

The sort specification specified for field collapsing in default mode will determine the ordering of documents in each group. This will be used to determine the root document returned for each group. Default field collapsing only supports single level sort. Specifying a multi-level sort specification will in an exception.

Default Order

The default ordering for field collapsing is by natural index order. This method should be used unless business requirements require more explicit selection of a group's root document.

// Explicitly specify default order
FieldCollapse collapse = new FieldCollapse("site");
collapse.setSort( SortSpecification.DOCUMENT_ORDER_SORT );

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(collapse);
request.setJoinRollupMode(JoinRollupMode.TREE); 

This ordering will perform the best and has the lowest memory requirements.

Relevancy Order

This root document selection method will result in the document with the highest relevancy score for a group being selected as the root document. This method requires more memory and CPU than selecting by natural ordering.

// Rest syntax: collapse=site(sort=$score:desc)
FieldCollapse collapse = new FieldCollapse("site");
collapse.setSort( SortSpecification.RELEVANCY_SORT );

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(collapse);
request.setJoinRollupMode(JoinRollupMode.TREE); 

Arbitrary Sort Order

Any valid single-level sort specification can be used to order groups as desired. This method requires more memory and CPU than selecting by natural ordering.

// Rest syntax: collapse=site(date:desc)
FieldCollapse collapse = new FieldCollapse("site");
collapse.setSort("date", Sort.SortOrder.DESCENDING);

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(collapse);
request.setJoinRollupMode(JoinRollupMode.TREE); 

Faceting

Inclusion of collapsed documents in facet calculations can be disabled for default field collapsing if desired. When disabled, facet counts will only reflect the first document for each group.

// Rest syntax: collapse=site(facet=false)
FieldCollapse collapse = new FieldCollapse("site");
collapse.setFacet(false);

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(collapse);
request.setJoinRollupMode(JoinRollupMode.TREE); 

Result Format

When more than one row is requested for each group, additional rows will be returned as children of the root document for each group.

Java API

Child documents can be retrieved in the Java API via SearchDocument.getChildDocuments() .


SearchDocument root = results.getDocument(0); // get root document
SearchDocument[] children = root.getChildren();
// Process child documents.

REST API

Child documents will be attached to a <SearchDocument> element in the <children> sub element.

NULL values

Each document that does not populate the field collapsing field will be returned in its own "group" for default field collapsing.

Multi-partition limitations

Default field collapsing has limitations when used with an AIE index which has multiple partitions.


  • Result count is approximate (and may be larger than final result when paging through full result set).
  • Facet counts are approximate.
  • Child documents will only be provided from one partition.

See the following sections for more information.

Total result count may not be accurate

The returned number of matching rows for a collapsed result set may not be fully accurate, as the total row count will only account for documents collapsed within each partition and for documents returned in the first N results (where N is the number of requested hits) returned from each partition, as the top N results from each partition are collapsed together.

Total row count can be guaranteed for a number of search result hits up to a configurable point. By default, this point is the number of rows requested in the response; so, when requesting hits=10 and offset=0 (i.e. results 1-10), the total number of hits will be accurate; however, when the user moves to the next page of results (i.e. results 11-20), the total number of hits may be different than when the first page of results was requested.

This point up to which the total result count will be accurate can be increased by setting the QueryRequest's Search Depth  parameter. If a search result has a total result count less than or equal to the search depth, the total result count will be accurate, and in the case of more results than the search depth, the total result count will be accurate for paged result requested where hits + offset * hits is less than or equal to the search depth. For example, setting the search depth to 30 would be useful for maintaining consistent total result counts for the first 3 pages of 10 results.


Facet counts may not be accurate

When merging facet results from each index partition, the facets from each node are calculated before collapsing is performed for the results returned from each index partition. Since collapsing reduces the number of documents in the result set, facet counts for buckets may be higher than the actual number of results returned when a particular facet bucket filter is applied to the original query.

Child documents

If the same "group" of documents is found on multiple nodes, then child documents will only be returned from one of the nodes. The rest of the child documents will be discarded.

For example, if the documents in the group "1" are indexed in multiple partitions, the documents for all but one partition will be discarded in the final result set.

From first partition

"1" -> Document 7 (group leader on first partition)
 |-> Document 2
 |-> Document 5

From second partition:

"1" -> Document 9 (group leader on second partition)
 |-> Document 6
 |-> Document 10

If "Document 7" is selected as the primary document for the group during query dispatching, then "Document 9", "Document 6" and "Document 10" will be discarded and not included in the group.

2-D Search Field Collapsing

2-D Search field collapsing will segment the result set using the specified field. This will result in returning the top N documents for each unique field value. This method of field collapsing can be enabled by setting the mode for field collapsing to FieldCollapse.Mode.TWO_DIMENSIONAL.

Rows Per Group

The number of rows returned per group can be specified with FieldCollapse.setRows() . By default, this value is 1, which will result in a single document per group.


// Rest syntax: collapse=cat(rows=10)
FieldCollapse collapse = new FieldCollapse("cat");
collapse.setMode(FieldCollapse.Mode.TWO_DIMENSIONAL);
collapse.setRows(10);

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(request);
request.setJoinRollupMode(JoinRollupMode.TREE); 

Sorting

Any arbitrary sort can be used to order rows in each group. By default, a group's rows will be ordered by score descending.

Single Level Sort

// Order rows in each group by title ascending
// Rest Syntax: collapse=cat(mode=2D, sort=title:asc)
FieldCollapse collapse = new FieldCollapse("cat");
collapse.setMode(FieldCollapse.Mode.TWO_DIMENSIONAL);
collapse.setSort( "title", Sort.SortOrder.ASCENDING );

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(collapse);
request.setJoinRollupMode(JoinRollupMode.TREE); 

Multi-Level Sort

// Order rows in each group by title ascending
// Rest Syntax: collapse=cat(mode=2D, sort=title:asc, sort=author:desc)
FieldCollapse collapse = new FieldCollapse("cat");
collapse.setMode(FieldCollapse.Mode.TWO_DIMENSIONAL);
SortSpecification sort = new SortSpecification();
sort.add( new Sort("title", Sort.SortOrder.ASCENDING) );
sort.add( new Sort("author", Sort.SortOrder.DESCENDING) );
collapse.setSort(sort);

QueryRequest request = new QueryRequest("*:*");
request.setFieldCollapse(collapse);
request.setJoinRollupMode(JoinRollupMode.TREE); 

Result Format

When using 2-Dimensional search, one document will be returned for each group. This document will contain no fields. It will contain the top N rows for the group as child documents. The total number of rows in the group will be retrievable via SearchDocument.getTotalChildren() .


NULL Values

All documents that do not populate the field collapsing field will be grouped together for two-dimensional search.

Limitations

The field being collapsed on for 2-Dimensional search must contain maximum 1024 unique values. In general, this feature should only be used when the number of possible groups is small. If the number of unique values for a field is large, default field collapsing should be used instead.