The contents of this document are proprietary and confidential and covered under non-disclosure.
Overview
The Attivio Intelligence Engine (AIE) supports Relational Querying, a unique capability that combines full-text searching with structured, relational querying as is typically performed with SQL. (Attivio has applied for a patent for this capability.)
To prepare data for relational searching, all tables in a relational model are indexed into a single index with a unified schema representing all fields in all tables. The schema includes fields for each table, one or more common fields, and one or more keys. A query-time JOIN operator is then used to identify the relationship at query time. This article describes how to ingest relational data and perform JOIN operations in combination with full-text searches.
JOIN key limited to 4096 characters
JOIN keys cannot be larger than 4096 characters. For this reason, the use of text fields as JOIN keys is discouraged.
View incoming links.
Schema Configuration
To use the JOIN operator the content must be ingested correctly into a schema that provides for a common key field shared by all tables. In the factbook example this is the "country" field. It is profiled in <project-dir>\conf\schema\default.xml.
<field name="country" type="STRING" indexed="true" stored="true" sort="false"/>
Ingestion
This example assumes that tables are loaded from CSV (comma-delimited) files (i.e. one table per file). Tables can also be ingested from databases directly by using the DBConnector.
The following component definitions should be placed in custom configuration files.
An important part of configuring the ingestion of data into a joinable schema is to populate the Document Id fields for ingested table rows correctly. The Attivio CSVConnector makes it easy to manage key mappings at ingestion time. For example, in the Factbook example, the Medal connector is configured in <install-dir>\conf\factbook\factbook.xml. Note the idFieldFormat property and the idFields list.
<connector name="medalConnector"> <scanner class="com.attivio.connector.CsvScanner"> <properties> <property name="startDirectory" value="${factbook.content.dir}/medals.csv" /> <property name="idFieldFormat" value="MEDAL-%s" /> <property name="firstRowAreFieldNames" value="true" /> <list name="idFields"> <entry value="country" /> </list> </properties> </scanner> <feeder> <properties> <property name="ingestWorkflowName" value="medalIngest" /> <map name="staticFields"> <property name="table" value="medal" /> </map> </properties> </feeder> </connector>
The idFieldFormat ensures that there is a unique ID for each document/record ingested into AIE.
Querying
See Join Query Syntax for syntax of JOIN queries. See the Quick Start Tutorial for examples of JOIN queries run from Search UI.
Using HTTP/REST
The following query parameters are used for performing JOIN searches:
Parameter | Description | Notes |
---|---|---|
q | The primary query | The JOIN query |
join.rollup | Specifies how child documents are returned. | Modes: CHILDREN - (Deprecated.) Use TREE. |
For example:
JOIN(table:country, INNER(table:city, on=country), INNER(table:medal, on=country), INNER(table:news,on=country))
To duplicate this query using HTTP/REST, use the following query string:
Java API
It is possible to issue a JOIN query via the Java Client API. See the Search Application Example for details.
Multi-Field Join
It is possible to use two separate fields for performing the equi-join.
This can be done using the following query:
JOIN(field1:value1, INNER(field2:value2, on="key=key"))
The above query is equivalent to the following SQL:
SELECT * FROM table1 INNER JOIN table2 ON table1.key = table2.key WHERE table1.field1 = value1 AND table2.field1 = value2;
In the query above table1.key and table2.key must be the same schema field type.
Performance Considerations
It takes longer to prepare the join cache for joining between two fields like this. Once the cache is prepared, the execution speed of the query itself should be identical to a standard single field join. It is recommended that join fields be merged at ingest time and a single field join is used at query time.
Multi-Value Fields
Join now fully supports multi-value fields.
For efficiency, it is recommended that you explicitly configure the "multivalue" setting for joinable fields. If the field's multivalue setting is not configured, this will be automatically detected during cache loading.
Example
<schema name="default"> ... <field name="multivalue_joinid" type="string" multivalue="true" indexed="true" stored="true"/> <field name="singlevalue_joinid" type="string" multivalue="false" indexed="true" stored="true"/> ...
Joining on a multi-value field works similar to using join on single-value fields using a link table.
For example, given the two tables:
table | doc | joinid |
---|---|---|
a | a.1 | 1, 2, 3 |
a | a.2 | 1 |
a | a.3 | 4, 5 |
table | doc | joinid |
---|---|---|
b | b.1 | 2, 3 |
b | b.2 | 1, 4 |
b | b.3 | 2, 5 |
The query below will produce the following result:
JOIN(table:a, INNER(table:b, on="joinid"))
- Document 1: a.1
- Child 1: b.1 (from joinid=2)
- Child 2: b.1 (from joinid=3)
- Child 3: b.2 (from joinid=1)
- Child 4: b.3 (from joinid=2)
- Document 2: a.2
- Child 1: b.2 (from joinid=1)
- Document 3: a.3
- Child 1: b.2 (from joinid=4)
- Child 2: b.3 (from joinid=5)
b.1 appears as a child for a.1 twice because the documents share 2 values for joinid
Relevancy Considerations
In a JOIN() query a record/document's score equals the score for the parent document plus the score for all child documents.
Attivio is very interested in learning how you use or would like to use the JOIN operator. Feel free to let us know what you're doing...