Page tree

Overview

An ontology is a structured framework of information that describes a domain. An ontology describes the individuals (instances), classes (concepts), attributes, and relations of some vocation such as banking, securities trading, genetics, publishing, or any other realm comprised of many related object types, and using many terms to describe them. 

In AIE, ontologies enrich AIE's ability to generate metadata, follow relationships, and otherwise provide structure to initially unstructured textual information.  Therefore, effective ontologies enrich AIE's understanding of hierarchical relationships, synonym expansion, document tagging, and entity extraction. Thus ontologies help AIE execute more precise and versatile queries.


View incoming links.

 

What is in an Ontology?

Ontologies are defined using terms, relationships, properties, and values.

Terms

An ontology is constructed from multiple ontology terms. Each term represents an object having descriptive properties and relationships with other objects.

Each term has a unique ID and a name. The name is not necessarily unique, although normally it is.

A term is classified as either preferred term (PT) or as a non-preferred term (NPT):

  • Preferred Term (PT) – This is the primary or official name of the term, as used in AIE. Each preferred term must have a class type associated with it. For example, Country, Region, Company, etc.
  • Non-preferred Term (NPT) – Non-preferred terms are typically synonyms of a preferred term.

Term Properties and Relationships (Triples)

In the world of ontology, modeling is based on a three-part statement known as a triple. A triple is a statement similar to a sentence.  It represents a fact. 

A triple contains information in three positions: the subject, predicate and object.

Triples can occur in three forms:

  • Term - Relationship - Term:  The triple establishes a relationship between two terms (two objects).  For example, Delaware is a part of the United States.  In this case the predicate is referred to as a relationship.  Relationships are defined by name and by type.
  • Term - Property - Value:  The triple describes a property of the term, and includes a value. For example, Delaware's population is 907,135.  In this case the predicate is referred to as a property
  • Term - Attribute - <empty>:  The triple describes an attribute of the term.  This is essentially a property without a value.

Relationship Types

There are three types of relationships: hierarchical, equivalence, and associative. Hierarchical and equivalence relationships each have two predefined named relationships:

  • hierarchical– Special type of relationship used to define the hierarchy structure within the ontology.
    • Broader Term – Relates a term to another term that can represent a parent.
    • Narrower Term – Relates a term to another term that can represent a child.
  • equivalence – In AIE, equivalence relationships are used to define synonym relationships. See also Synonym Expansion.
    • Use For – Relates a preferred term to one of its synonym non-preferred terms.
    • Use – Relates a non-preferred (synonym) term to its preferred term.
  • associative – Represents a relationship that does not have hierarchical or equivalence semantics. For example, the relationship associating a person entity with their employer entity. Relationships that are not hierarchical or equivalence types are considered associative.

The direction of the Use For and Use equivalence relationships is important for use in Synonym expansion. Only Use relationships starting from a non-preferred term are considered when the synonym dictionary is generated.

A typical commercial ontology can contain data sourced from a number of locations (see below):

Building and Using Ontologies with AIE

The Ontology Module combines features that exploit ontologies for more powerful information retrieval with features for building, maintaining, and exporting ontologies.

Benefits of Using an Ontology

An ontology augments AIE's features in a number of ways:

  • Entity Extraction – Ontologies can drive the entity-extraction process.
  • Synonym Expansion – Ontologies can drive query synonym expansion based on object relationships in the ontology.
  • Metadata Tagging – You can augment concepts or entities identified in a document with information from linked ontology data.
  • Term Lookups – High-performance prefix lookup of ontology terms can be used in UI generation and auto-complete processing.
  • Topic Tagging – Automatic document tagging can suggest tags or estimate a document's aboutness in publishing systems or during ingestion.

Tools for Creating an Ontology

AIE's ontology module lets you load pre-existing ontologies, and build, capture, and distribute new ontologies. 

Required Modules

These features require the inclusion of the Ontology Module when you run createproject to create the project directories.

If you are cannot access the download page, contact sales@attivio.com or your Attivio Account Representative for access.

Example Data

The Ontology Module loads a example geographic ontology as an aid to exploring and learning the module features.

To override the example ontology data and substitute your own, edit <project-dir>\conf\bean\ontologyLoader.xml and modify the ontologyLoader bean.

The AIE Ontology Module contains several components to help you create and maintain complex ontologies.

  • Defining an Ontology – Ability to load/access arbitrary numbers of terms/properties/etc., and access using published APIs.
  • Input Formats – Ability to import ontology data from thesThes, SmartLogic, MeSH, and YAML formats.
  • Ontology Validation – User-extendable validation rulesets for validating loaded ontologies.
  • Ontology UI – Ontology tools are provided to test various ontology retrieval, look-up, topic tagging and validation APIs.
  • Extending the Ontology Module – Various components in the ontology module are user-extendable.

Installing the Ontology Module

  1. Download the ontology module from the Ontology Download page.
  2. Unzip the module archive on top of your AIE installation.
  3. Add the ontology module to a new project using the createproject command with the -m ontology option:

     createproject -n ont1 -o c:\attivio-projects -m ontology

          Alternatively, add the ontology module to an existing project using the --incremental -m ontology option:

     createproject -n proj1 -o c:\attivio-projects --incremental -m ontology

See Create a New Project for more details of running the createproject command.

Ontology Data Loading

Ontology data can load into memory either from a file or from an AIE index.

  • file-based: In this case, the ontology resides in files that load during AIE start-up. This is the more common practice. Updating the ontology requires re-loading all ontology data, although it is not necessary to restart AIE. Use the JMX interface for the OntologyManagementService to reload the ontology.
  • index-based: In this case, the ontology resides in the AIE index, and is retrieved by an AIE query during start-up. Index-based loading allows for incremental updates, which is preferable for very large (over 1m terms) ontologies where reloading all of the data is undesirable.

In either case, ontology data is read at AIE start-up and is stored in memory data structures that can be called on by ontology-aware components in the system.

In addition, the index-based ontology has these characteristics:

  • In a cluster of ontology nodes, the data resides in a single place - the index.
  • You can make incremental updates to the ontology without restarting the ontology nodes.
  • You can roll ontology versions backwards and forwards.

In the case of index-based ontology, the data is fed into the index beforehand as described in Ontology Loading and Versioning.

Ontology Input Formats

The supported ontology input formats are listed below. For details and examples of each format, see Ontology Input Formats.

Format

Description

zThes

zThese taxonomy format. See also http://zthes.z3950.org/.

SmartLogic(TM)

SmartLogic OntologyManger report export format.

CSV

CSV format.

YAML

Attivio internal format.

OWL/RDFS

OWL/RDFS format ontology data. See here for loader details.

Loading an Ontology

The default installation loads a sample ontology containing ISO country and UN region information.

To load an ontology, specify an ontology definition in the <project-dir>\conf\features\ontology\OntologyModel.ontology.xml file in the project. The ontology definition references a loader that is specific to the type of ontology.

<project-dir>\conf\features\ontology\OntologyModel.ontology.xml
<ont:ontology name="ontology" locale="en" tokenizer="default" nodeset="*">
   <ont:loader ref="ontologyLoader" />
</ont:ontology>

Ontology bean parameters are defined below:

Parameter

Description

name

Name of the ontology.

tokenizer

This tokenizer is used to generate the Prefix Matcher data structure. Only the default tokenizer is supported at this time.

locale

As the Ontology module does not currently tokenize term names in a language specific way, leave this parameter set to English (en).

nodeset

Specifies the nodeset that this ontology is active on.

loader

Specifies a bean reference to an instance of an ontology loader. Different loader types are described in more detail below.

Loading an Ontology from Index

To load an ontology from the index, define an ontology loader bean. Syntax for synonyms and relationships is visible in the following examples in this section.

<bean id="ontologyLoader" class="com.attivio.ontology.loader.IndexOntoLoader">
  <property name="noteSynonyms">
  ...
  </property>
  <property name="classRelationships">
  ...
  </property>
</bean>

Standard parameters for ontology loader beans are defined below.

Parameter

Description

id

Loader bean name

noteSynonyms

Optional. List of names of term property/notes that can be considered synonyms of a term, for example, the 3-digit ISO code of a country.

classRelationships

Optional. Map of relationship names that are considered parent relationships per class type. See here for more details.

Loading an Ontology from a File

To load an ontology from file, define an ontology loader bean based on a specific loader class. Each ontology format has its own loader. Examples are in the following sections.

Standard parameters for ontology loader beans are defined below.

Parameter

Description

id

Loader bean name.

class

Name of loader class.

ontologyFile

Location of the ontology file.

modelFile

ZthesStreamingLoader only. Location of the ontology model file.

noteSynonyms

Optional. List of names of term property/notes that can be considered synonyms of a term, for example, the 3-digit ISO code of a country.

classRelationships

Optional. Map of relationship names that are considered parent relationships per class type. See here for more details.

Loading a Zthes Ontology from a File

To load a Zthes format ontology, use the ZthesStreamingLoader..

<bean id="ontologyLoader" class="com.attivio.ontology.loader.ZthesStreamingLoader">
  <property name="ontologyFile" value="conf/ontology/ontologies/ontology.xml"/>
  <property name="modelFile" value="conf/ontology/ontologies/ontology-model.xml"/>
  <property name="noteSynonyms">
    <list>
      <value>ISO_3</value>
    </list>
  </property>
  <property name="classRelationships">
    <map>
      <entry key="*">
        <list>
          <value>Broader Term</value>
        </list>
      </entry>
      <entry key="Country">
        <list>
          <value>-Broader Term</value>
          <value>Part of Region</value>
        </list>
      </entry>
    </map>
  </property>
</bean>

Loading a SmartLogic Ontology from File

To load a SmartLogic XML report format ontology, use the SmartLogicStreamingLoader. This loader does not require a model file.

<bean id="ontologyLoader" class="com.attivio.ontology.loader.SmartLogicStreamingLoader">
  <property name="ontologyFile" value="conf/ontology/ontologies/ontology.xml"/>
  <property name="noteSynonyms">
    <list>
      <value>ISO_3</value>
    </list>
  </property>
  <property name="classRelationships">
    <map>
      <entry key="*">
        <list>
          <value>Broader Term</value>
        </list>
      </entry>
      <entry key="Country">
        <list>
          <value>-Broader Term</value>
          <value>Part of Region</value>
        </list>
      </entry>
    </map>
  </property>
</bean>

Loading an Ontology from CSV Files

To load an ontology from CSV file(s), use the CSVLoader. This loader takes a list of DelimitedLoaderConfig bean references for its configuration.

CSVLoader Example 1

In the example below, the following CSV file is read:

ID,Name,ISO_2,ISO_3
248,Aaland Islands,AX,ALA
4,Afghanistan,AF,AFG
8,Albania,AL,ALB
12,Algeria,DZ,DZA
16,American Samoa,AS,ASM
20,Andorra,AD,AND
...

The configuration below tells us:

  • Data is stored in a file called country-2.csv.
  • The first row contains column headers.
  • Term IDs are in column 1, names are in column 2.
  • The class type is hard-coded to Country for all terms defined in the file.
  • No attribute or relation columns are defined.
  • Remaining columns ISO_2 and ISO_3 are used to define properties.
<bean id="ontologyLoader" class="com.attivio.ontology.loader.CSVLoader">
  <property name="configs">
    <list>
      <ref bean="loaderConfig1"/>
    </list>
  </property>
</bean>

<bean id="loaderConfig1" class="com.attivio.ontology.loader.config.DelimitedLoaderConfig">
  <property name="inputFile" value="test/data/delimited/country-2.csv"/>
  <property name="firstRowHeaders" value="true"/>
  <property name="idColumn" value="id"/>
  <property name="nameColumn" value="name"/>
  <property name="className" value="Country"/>
</bean>

CSVLoader Example 2

In the example below, the following CSV file is read:

ID,Name,Type,Synonyms,PT,level,Code,parent,use,hasThing,attribute
0,Term 0,Thing 1,,TRUE,,,,,1,ATTR_1
1,Term 1,Thing 1,Term 1_1|Term 1_2|Term 1_3,Yes,1,AB,,,2,ATTR_2
2,Term 2,Thing 1,,No,2,ABC,,1,3,ATTR_2
3,Term 3,Thing 1,,Yes,2,DE,1,,,ATTR_2
4,Term 4,Thing 1,,No,3,DEF,2,,,ATTR_1
5,Term 5,Thing 2,Term 5_1|Term 5_2,,1,FDR,,,,ATTR_2
...

The configuration below tells us:

  • Data is stored in a file called ontology-2.csv.
  • The first row contains column headers.
  • Term IDs are in column 1, names are in column 2 and class types are in column 3.
  • Column 4 contains a multi-value field synonyms, split on the "|" character.
  • Column 5 contains preferred term information, with Terms 0, 1, 3 and 5 being set to true.
  • Attributes are defined in the column with the attribute header.
  • Broader Term relations are included in the parent column.
  • Use relations are defined in in the use column.
  • The column named hasThing is used to define a custom relation.
  • Remaining columns level and Code are used to define properties.
<bean id="ontologyLoader" class="com.attivio.ontology.loader.CSVLoader">
  <property name="configs">
    <list>
      <ref bean="loaderConfig1"/>
    </list>
  </property>
</bean>

<bean id="loaderConfig1" class="com.attivio.ontology.loader.config.DelimitedLoaderConfig">
  <property name="inputFile" value="test/data/delimited/ontology-2.csv"/>
  <property name="firstRowHeaders" value="true"/>
  <property name="idColumn" value="id"/>
  <property name="nameColumn" value="name"/>
  <property name="preferredColumn" value="pt"/>
  <property name="classColumn" value="type"/>
  <property name="multiValueDelimiter" value="|"/>
  <property name="allowMultiValueColumns" value="true"/>
  <property name="attributeColumns">
    <list>
      <value>attribute</value>
    </list>
  </property>
  <property name="broaderTermColumns">
    <list>
      <value>parent</value>
    </list>
  </property>
  <property name="useColumns">
    <list>
      <value>use</value>
    </list>
  </property>
  <property name="relationColumns">
    <list>
      <value>hasthing</value>
    </list>
  </property>
</bean>

DelimitedLoaderConfig Properties

Property

Type

Description

Required?

Default Value

inputFile

String

Relative (or absolute) reference to the file containing ontology information.

Yes

N/A

firstRowHeaders

boolean

Flag denoting whether first row contains header information.

No

false

startRow

int

Row number to start reading at (rows are numbered starting with 0). Headers are also read at this row number.

No

0

idColumn

String

Name of the column containing term ID data.

No

"id"

idPrefix

String

Prefix to prefix term IDs with during read. If used, relation columns must use the prefixed id value.

No

null

nameColumn

String

Name of column containing term names.

No.

"name"

preferredColumn

String

Name of column containing preferred term flag. A term is a preferred term if the column contains "true" or "yes" (case-insensitive) and a non-preferred term if the column contains "no" or "false" (case-insensitive). A term with a blank value in this column is a preferred term.

No

All terms are preferred terms.

classColumn

String

Name of the column to use for term class types.

No

"type"

className

String

Name of class to use. Overrides value of classColumn.

No

N/A

allowMultiValueColumns

boolean

Flag denoting whether data in fields will be split using the multiValueDelimiter into separate values. By default all relationship columns (broader/narrowertermColumns, use/useForColumns, relationColumns) will be treated as multi-value regardless of the setting of this parameter.

No

true

multiValueDelimiter

String

Character(s) to use to split columns into multiple values (see allowMultiValueColumns). If this is set to a multi-character value, the entire string is the delimiter (as opposed to any of the single characters being treated as a delimiter).

No.

"|"

columnNames

String[]

This property contains the column names to associate with the read data. Ignored if firstRowHeaders is true.

No

N/A

broaderTermColumns

String[]

Names of columns that represent Broader Term relationships. The columns must contain a term ID to link to. Inverse Narrower Term relations are automatically added.

No

null

narrowerTermColumns

String[]

Names of columns that represent Narrower Term relationships. The columns must contain a term ID to link to. Inverse Broader Term relations are automatically added.

No

null

useColumns

String[]

Names of columns that represent Use relationships. The columns must contain a term ID to link to. Inverse Use For relations are automatically added if not specified in the file.

No

null

useForColumns

String[]

Names of columns that represent Use For relationships. The columns must contain a term ID to link to. Inverse Use relations are automatically added if not specified in the file.

No

null

relationColumns

String[]

Names of columns that represent custom relationships. The columns must contain a term ID to link to.

No

N/A

attributeColumns

String[]

Names of columns containing attributes.

No

null

Notes on Loading Ontologies from CSV Files

  • If a data row contains too many entries, the extra(s) are ignored.
  • If a data row contains too few entries, the last missing column(s) are assumed to have a null value.

Loading a YAML Ontology from File

To load a YAML format ontology, use the YamlLoader. This loader does not require a model file.

<bean id="ontologyLoader" class="com.attivio.ontology.loader.YamlLoader">
  <property name="ontologyFile" value="conf/ontology/ontologies/ontology.yaml"/>
  <property name="noteSynonyms">
    <list>
      <value>ISO_3</value>
    </list>
  </property>
  <property name="classRelationships">
    <map>
      <entry key="*">
        <list><value>Broader Term</value></list>
      </entry>
      <entry key="Country">
        <list>
          <value>-Broader Term</value>
          <value>Part of Region</value>
        </list>
      </entry>
    </map>
  </property>
</bean>

Loading from OWL/RDFS/RDF Files

OWL/RDFS/RDF data is loaded by the OWLLoader. and the RDFSLoader. These loaders have various configuration parameters that control how the ontology is interpreted into the internal Attivio ontology model format.

These configuration parameters are defined as follows:

Parameter

Description

id

Loader bean name.

class

Name of loader class. Options are:

  • OWLLoader – Load OWL ontology models and instance data.

  • RDFSLoader – Load RDFS ontology models and instance data.

noteSynonyms

Optional. List of names of term property/notes that are considered to be a synonym of a term, for example, the 3-digit ISO code of a country.

classRelationships

Optional. Map of relationship names that are considered parent relationships per class type. See here for more details.

base

Base URI of the ontology.

lang

The language used in the ontology files listed in the ontologyFiles and instanceFiles parameters. Unless these files are loaded by import alone, they must be of the same format as provided by this parameter. Options are:

  • N3 – Notation 3 (N3) or Turtle format files.
  • RDF/XML – RDF expressed as XML.

spec

The ontology model spec to use to read and make sense of the ontology. Options are case-insensitive and are defined as follows:

  • OWL_DL_MEM – OWL DL language profile with no reasoner. This is the default when this parameter is unset. Default for OWLLoader.
  • OWL_DL_MEM_RDFS_INF – OWL DL language profile with rule reasoner with RDFS-level entailment-rules.
  • OWL_MEM – OWL full language profile with no reasoner.
  • OWL_MEM_RDFS_INF – OWL full language profile with rule reasoner with RDFS-level entailment-rules.
  • RDFS_MEM – RDFS language profile with no reasoner. Default for RDFSLoader.
  • RDFS_MEM_RDFS_INF – RDFS language profile with rule reasoner with RDFS-level entailment-rules.

modelFiles

Locally resolvable file names containing ontology model data. Optionally, these files can also contain the ontology instance data by specifying parameter instanceDataInModel as true.

instanceFiles

Locally resolvable file names containing ontology instance data. If parameter instanceDataInModel is set to true, files pointed to by this parameter are ignored.

documentMap

This parameter defines a map of ontology import URIs onto locally resolvable file names. It is used during ontology load to resolve the location of imported ontologies to local files. If imports are present, and this parameter is not supplied, the ontology load process attempts to resolve imported ontology URIs by going out to the internet. This is slow and not recommended.

instanceDataInModel

Default false.

localNameForIDs

Flag denoting whether the local name portion of the resource URI is used for the term ID. If set to false, the full URI is used. Default true.

localNameForProperties

Flag denoting whether the local name portion of the property URI is used for the term property names. If set to false, the full URI is used for property names. Default true.

localNameForTypes

Flag denoting whether the local name portion of the property URI is used for the term class types. If set to false, the full URI is used for class types. Default true.

foafNameForName

Defines whether the foaf:name property is used for a resource name if present. Default true.

dcTitleForName

Defines whether the dc:title property is used for a resource name if present. Default true.

rdfsLabelForName

Defines whether the rdfs:label property is used for a resource name if present. When used, the default label is checked first, if this does not exist, the English language label is read. Default true.

useNSPrefixForTypes

Defines whether the namespace prefix is prefixed onto the class type local name to differentiate class types in different ontology URI scopes. Default true.

These loaders use the Jena API to handle model loading. See here for more information.

Example 1 - Loading an RDFS ontology in RDF/XML format with model and instance data combined.

  <bean id="ontologyLoader" class="com.attivio.ontology.loader.RDFSLoader">
    <property name="base" value="http://attivio.com/ont"/>
    <property name="lang" value="RDF/XML"/>
    <property name="spec" value="RDFS_MEM"/>
    <property name="modelFiles">
      <list>
        <value>test\ontologies\rdf\rdfs\astronomy.rdfs</value>
      </list>
    </property>
    <property name="instanceFiles">
      <list/>
    </property>
    <property name="instanceDataInModel" value="true"/>
    <property name="localNameForIDs" value="false"/>
    <property name="foafNameForName" value="true"/>
    <property name="dcTitleForName" value="true"/>
    <property name="rdfsLabelForName" value="true"/>
    <property name="localNameForProperties" value="true"/>
    <property name="localNameForTypes" value="true"/>
  </bean>

Example 2 - Loading OWL ontology in N3 format with separate model and instance data and an import (time.owl).

  <bean id="ontologyLoader" class="com.attivio.ontology.loader.OWLLoader">
    <property name="base" value="http://attivio.com/ontology/test/"/>
    <property name="lang" value="N3"/>
    <property name="spec" value="OWL_DL_MEM"/>
    <property name="modelFiles">
      <list>
        <value>test\ontologies\rdf\test\test.n3</value>
      </list>
    </property>
    <property name="instanceFiles">
      <list>
        <value>test\ontologies\rdf\test\data.n3</value>
      </list>
    </property>
    <property name="documentMap">
      <map>
        <entry key="http://www.w3.org/2006/time#" value="test\ontologies\rdf\test\time.owl"/>
      </map>
    </property>
    <property name="instanceDataInModel" value="false"/>
    <property name="localNameForIDs" value="false"/>
    <property name="foafNameForName" value="true"/>
    <property name="dcTitleForName" value="true"/>
    <property name="rdfsLabelForName" value="true"/>
    <property name="localNameForProperties" value="true"/>
    <property name="localNameForTypes" value="true"/>
  </bean>

Defining Class Relationships

Class relationships are specified as a map of entries keyed by class name from the ontology. Each class type then lists the types of relationship to add and return in the look-up service as suggested "parents". For example, searching for auto-complete by Country on string chil returns Chile, listing its parents/links as the region South America, and Americas, because the Part of Region relationship was included in the classRelationships configuration for this class type.

The special list name * is used to denote relationships to added to any class type in the ontology. Generally, the Broader Term relationship is added here, as this hierarchical relationship represents a link that can generally be considered a parent.

To override and remove a relationship from a particular class type that was added under the wildcard list *, prefix the relationship name with a dash, for example: -Broader Term.

Startup Operations

Ontology load operations appear in the Attivio log file and console on the INFO channel.

Typical log output seen when loading an ontology is shown below.

...
2015-01-13 14:25:54,344 INFO  Attivio [main_local_10.1.5.64_17000] - System Ready (startupTime=56s)
2015-01-13 14:25:54,504 INFO  OntologyManagementService [pool-9-thread-1] - Starting to load ontology 'ontology' version '0' ...
2015-01-13 14:25:54,505 INFO  ReloadingOntology [Thread-178] - Loading ontology 'ontology' to version '0'
2015-01-13 14:25:54,531 INFO  ZthesStreamParser [Thread-178] - Processing zthes model file 'C:\unscanned\attivio-projects-2\data-agent\projects\otest1\default\resources\ontology-model.xml'
2015-01-13 14:25:54,672 INFO  ZthesStreamParser [Thread-178] - Processing zthes ontology file 'C:\unscanned\attivio-projects-2\data-agent\projects\otest1\default\resources\ontology.xml'
2015-01-13 14:25:54,875 INFO  ZthesStreamParser [Thread-178] - File 'ontology.xml' completed in 202ms (432 terms)
2015-01-13 14:25:54,955 INFO  ReloadingOntology [Thread-178] - Loaded base ontology 'ontology' version '0' in 447ms
2015-01-13 14:25:54,955 INFO  OntologyManagementService [pool-9-thread-1] - Ontology 'ontology' version '0' loaded (base)
2015-01-13 14:25:54,957 INFO  OntologyEntityDictionaryBean [pool-9-thread-1] - Generating entity dictionary 'ontology/ontologyEntityDictionary' in file:/c:/unscanned/attivio-projects-2/data-agent/projects/otest1/default/data/data-local/ontology/ontology_entities_ee.csv (typed=false)
2015-01-13 14:25:54,982 INFO  OntologySynonymDictionaryBean [pool-9-thread-1] - Generating synonym dictionary 'ontology/ontologySynonymDictionary' in file:/c:/unscanned/attivio-projects-2/data-agent/projects/otest1/default/data/data-local/ontology/ontology_synonyms.csv
2015-01-13 14:25:54,990 INFO  OntologySynonymDictionaryBean [pool-9-thread-1] - Synonym dictionary 'ontology/ontologySynonymDictionary' generated in file:/c:/unscanned/attivio-projects-2/data-agent/projects/otest1/default/data/data-local/ontology/ontology_synonyms.csv
2015-01-13 14:25:58,716 INFO  ReplicaConfig [abc-index-indexer.abc-index-content-dispatcher-dispatcher-1] - [admin.abc-index-local-cd] Engine abc-index-local setting global version for index abc-index-part0 to [default=14ae052ae65.37]
2015-01-13 14:25:59,738 INFO  ReplicaConfig [abc-index-indexer.abc-index-content-dispatcher-dispatcher-2] - [admin.abc-index-local-cd] Engine abc-index-local setting global version for index abc-index-part0 to [default=14ae052ae65.3b]
2015-01-13 14:26:00,417 INFO  ReplicaConfig [abc-index-indexer.abc-index-content-dispatcher-dispatcher-3] - [admin.abc-index-local-cd] Engine abc-index-local setting global version for index abc-index-part0 to [default=14ae052ae65.3f]
2015-01-13 14:26:00,540 INFO  BusinessCenterEngine [abc-index-indexer.abc-index-content-dispatcher-dispatcher-3] - [admin.abc-index-local-cd] Publishing Dictionary To Content Store: acs://contentStore/dictionaries/SYNONYM/ontologySynonymDictionary/ontology
2015-01-13 14:26:00,671 INFO  ManagedDictionaryRepository [ZooKeeperClientCnxn--EventThread-14ae4b933150009] - Resource Modified: acs://contentStore/dictionaries/SYNONYM/ontologySynonymDictionary/ontology (af626903-1840-4264-b8a9-7256ec0af14f name=ontologySynonymDictionary group=ontology type=SYNONYM fst locale=en)
2015-01-13 14:26:00,794 INFO  OntologySynonymDictionaryBean [pool-9-thread-1] - Synonym dictionary 'ontology/ontologySynonymDictionary' published 
2015-01-13 14:26:00,831 INFO  OntologyUtils [pool-9-thread-1] - Ontology listeners notified for ontology 'ontology' - notify endpoints requested
2015-01-13 14:26:00,831 INFO  OntologyManagementService [pool-9-thread-1] - Finished loading ontology 'ontology' version '0'


...

In addition to the main ontology load operations - Ontology loadOntology stage, token provider (tokenized term name cache) generation and Ontology postLoad stage, there are additional steps that derive data from a loaded ontology that may happen depending on the active configuration.

These additional steps are:

  • Dictionary Generation – Generation of the synonym and entity extraction dictionaries and notification of components that use them. See here for details.
  • Prefix Matcher Generation – Generation of the term name prefix data structures used in look-up by name operations.
  • Topic Tagger Generation – Generation of the data structures used in the tagging function.

Defining an Ontology

The ontology reads into in-memory data structures for use in the various purposes outlined below.

The data-structure building blocks are OntologyTerm and TermRelation objects.

OntologyTerm

Abstract class that represents a single term in the ontology.

Property

Type

Description

id

String

Unique identifier string for the term. This ID must be unique across the ontology.

name

String

Name of the term.

version

int

Ontology version of which this term is a part.

preferred

boolean

Flag denoting whether the term is preferred (true) or non-preferred (false). Non-preferred terms represent synonyms and should always link to a preferred term via a USE relationship.

classes

String[]

Class types associated with the term.

attributes

String[]

Attributes associated with the term.

relations

TermRelation[]

Array of relations associated with the term. See below for details.

properties

Map<String,List<String>>

Named property map associated with the term.

TermRelation

Represents a directional relationship between one term and another.

Property

Type

Description

type

String

Type of relationship.

name

String

Name of relationship.

term

OntologyTerm

Term to which this relationship is directed. Known as the object of the relationship/triple.

termId

String

ID of the term that the relationship points to. Known as the object of the relationship/triple.

parent

boolean

Flag denoting a special type of relationship used for optimizing the amount of data passed over the wire in some circumstances. To configure a related term as a parent, use the optional classRelationships configuration parameter on the ontology loader bean.

Ontology APIs

Once loaded, the ontology object contains a set of data structures that facilitate high-performance retrieval of the following:

  • ontology-term data by term ID/name
  • term attribute
  • term note/property name/value
  • relationship type/name to other terms, etc.

See the Ontology Java API for more details.

Dictionary Generation

Automatic ontology-driven dictionary generation produces dictionaries suitable for use with standard AIE functions. Dictionaries are generated in standard CSV format.

You can generate two dictionary types from ontology data:

  • synonym-expansion dictionaries
  • entity-extraction dictionaries

Synonym-Expansion Dictionaries

Bi-directional synonym dictionaries are generated using equivalences drawn from the equivalence type of relationship in the ontology. Optionally, additional synonyms are generated from the content of certain note field values that could be considered to uniquely identify a term. For example, country ISO code, company ticker, FIPS code, etc. These note field names are configured using the noteSynonyms property in the ontology. You can configure multiple synonym dictionaries utilizing different relationships/notes.

Synonym dictionary generation is defined using the synonymDictionary feature. When the ontology module is added to a project, the following default configuration is added to the AIE project's conf\features\ontology\SynonymDictionaryModel.ontologySynonymDictionary.xml file:

conf\features\ontology\SynonymDictionaryModel.ontologySynonymDictionary.xml
<ont:synonymDictionary name="ontologySynonymDictionary" ontology="ontology" nodeset="*">
  <ont:resource location="${data.directory}/ontology/ontology_synonyms.csv" />
</ont:synonymDictionary>

The synonymDictionaryfeature rests on the OntologySynonymDictionaryBean.

Attributes of synonymDictionary:

Attribute

Description

name

Uniquely identifies the feature by name. This name appears in log output as dictionaries are generated.

ontology

The name of the ontology to used for generating the dictionary.

nodeset

The nodeset that loads the dictionary generator feature.

useNoteSynonyms

Denotes whether the synonym dictionary generates using the values of term notes named in the noteSynonyms ontology configuration parameter. Defaults to true.

Sub-elements of synonymDictionary:

Element

Description

resource

The location (expressed as a URI) of the dictionary resource to generate.

notifications

List of components to notified after the dictionary generates. On notification, these components reload their dictionary resources to stay current with any ontology changes.

transitiveClasses

SynonymDictionary bean only. Defines a set of classes to add to the dictionary in a transitive manner.
For example, assuming there is a term A that is related to terms B, C and D with relationships of type equivalence. When generating the dictionary with the transitiveClasses option unset, the dictionary entries generate as follows:

A,B|C|D
B,A
C,A
D,A


But if term A matches an entry in the transitiveClasses class list, the dictionary generates in a transitive manner as follows:

A,B|C|D
B,C|D|A
C,D|A|B
D,A|B|C

Entity-Extraction Dictionaries

Entity-extraction dictionaries generate containing mappings between ontology terms, their term IDs, and the IDs of any linked preferred terms. As with the synonym dictionary, you can map configured named note field values to term IDs.

Entity-extraction dictionary generation is defined using the entityDictionary feature. When the ontology module is added to a project, the following default configuration is added to the AIE project's conf\features\ontology\EntityDictionaryModel.ontologyEntityDictionary.xml file:

conf\features\ontology\EntityDictionaryModel.ontologyEntityDictionary.xml
<ont:entityDictionary name="ontologyEntityDictionary" ontology="ontology" nodeset="*">
  <ont:resource location="${data.directory}/ontology/ontology_entities_ee.csv" />
  <ont:types>
    <ont:include class="Country" />
    <ont:include class="Region" />
  </ont:types>
</ont:entityDictionary>

The entityDictionary feature rests on the OntologyEntityDictionaryBean.

Attributes of entityDictionary:

Attribute

Description

name

Uniquely identifies the feature by name. This name appears in log output as dictionaries are generated.

ontology

The name of the ontology used to generate the dictionary.

nodeset

The nodeset that loads the dictionary generator feature.

Sub-elements of entityDictionary:

Element

Description

types

Optional parameter for the entity-extraction dictionary. This parameter lets you limit the terms included in the dictionary to only those with the class types listed. This includes non-preferred terms that reference a class of a listed type. If this parameter is not specified, ALL ontology class types are included in the dictionary.

Entity Extraction

The ontology is used to generate entity-extraction dictionaries at start up using ontology terms and synonyms. The entity dictionary is used during ingestion to tag documents with entity IDs. A later transformer stage then completes the process by augmenting the entity IDs with the ontology preferred term names in typed fields.

The standard entity extraction technique uses the entity extraction dictionary that is automatically generated from ontology data See here for details. This facilitates a more controlled search result facet display using only preferred terms from the ontology.

The dictionary is used at ingest time to drive entity extraction by associating unique ontology preferred term IDs in a single document field (entity*ids) identifying all entities located in the text of the document.

A transformer (ApplyOntology) converts these entity IDs to their textual equivalents (term names) in fields matching the ontology class types, for example, country_ent, region_ent. The field names are based on the name of the class in the ontology.

The following schematic outlines the entity extraction process:

Use ASCII-compatible Class Types in Ontology

Class types are used to generate field names during ontology-based entity extraction and must be in ASCII format to avoid field name issues in AIE.

Avoid Words in Common Usage as Ontology Term Names

Entity extraction functions by text matching. When populating an entity extraction dictionary, be careful not to use term names that equate to common words. For example, when using an ontology with country or currency terms, which include short identifying codes, the codes themselves often introducing false positives in the matching process.

Excluding Terms from Entity Extraction

You can apply a special note to a term to exclude it from the entity extraction process (inclusion in the entity extraction dictionary and matching). The note name is "Attivio Note" and the value should be "Attivio do not index".

Synonym Expansion

The synonym-expansion feature is a method of automatically enhancing the terms in a query by adding any known synonyms of that term to the query. This increases document recall. The synonym-expansion feature is driven from a dictionary, which automatically generates from term data in the ontology.

Synonyms of a preferred term are defined in an ontology in a two ways:

  • Non-preferred Terms – can be considered to be synonyms of a single preferred term. The non-preferred term is related to the preferred term via an equivalence relationship. A reciprocal equivalence relationship links the preferred term to a related non-preferred term.
  • Unique Named Properties – a configurable set of named properties you can use to uniquely define a term to augment the synonym dictionary. For example, for a company term, the company's FIPS code could be included as a synonym for the company's name. For a country term, the country's ISO_3 code could be used as a synonym for the country's name.

The following diagram demonstrates the generation of dictionary entries for the country Algeria in the sample ontology.

The ontology Use For relationships provide some synonyms, and the noteSynonyms ontology setting includes the property ISO_3 which contributes the final synonym DZA.

Be careful not to use property values that do not uniquely identify a term. Similarly, do not use note values that could easily contain words in common usage, for example, ISO two-digit country codes are often common words. Avoid using them

When installing the ontology module, the new ontology-based dictionary overrides the existing querySynonymizer component.

Ingestion Metadata Tagging

Generic mapping support is provided via ontology term relationships, relationship types, attributes, and property/note name/values.

For example, locating a Country term could result in the geographical region and top level region being added to the ingested document. This can improve recall by allowing search on metadata not in the text of the document, or allow faceting based on this new metadata.

Complex, multi-step relationships can be followed and used to accrue metadata.

Documents can be tagged/expanded based upon existing/pre-tagged metadata.

Metadata from existing document field data can be cleaned via ontology-based mappings.

Relationships between terms defined in the ontology can be used to add or accrue additional metadata to the document using input from either:

  • data in existing fields, or
  • terms identified as part of the ontology-driven entity extraction process.

Tagging by Ontology Relationship

The MetadataTagging transformer can:

  • read values in a specified input field
  • use those values to look-up data in the ontology by name
  • return term names linked to this identified term by relationship name or type.
  • store this new data in a specified output field.

Transformer properties are defined as follows:

Property

Description

Required

ontologyName

Name of ontology to use to complete the look-up.

Yes

input

Input field name. This field contain text that matches a term name in the ontology.

Yes

output

Output field name.

Yes

relationshipName

Name of relationship to retrieve. Note: One of relationshipName or relationshipType must be specified.

No

relationshipType

Type of relationship to retrieve. Note: One of relationshipName or relationshipType must be specified.

No

Sample configuration is shown below:

<components>
  <component name="metadataTagging" class="com.attivio.ontology.transformer.ingest.field.MetadataTagging">
    <properties>
      <property name="ontologyName" value="ontology" />
      <property name="input" value="country_s" />
      <property name="output" value="region_s" />
      <property name="relationshipName" value="Includes Country" />
    </properties>
  </component>
</components>

<beans>
  <f:insertComponent workflow="ingestPostProcess" position="first" component="metadataTagging" name="insertTagging"/>
</beans>

In the sample configuration above, input field country_s is read as the document processes. The value(s) in this field process in turn. The ontology API retrieves ontology terms that are related to terms of that name, by relationship name Includes Country. Any term(s) retrieved are then added to the output field region_s.

For example, using the supplied sample ontology, with the input field country_s containing text Afghanistan, the transformer uses this text to locate a term of that name in the ontology, then locates terms related to Afghanistan via the Includes Country relationship, and locates terms Asia and Southern Asia (both Region type). The names of these located terms are added to the output field region_s.

Tagging by Ontology Relationship (Advanced)

The OntologyMetadataTagging transformer allows multiple tagging operations to occur at once by providing a configuration taggingConfigs that is a list of TaggingConfiguration beans as shown below.

Defining Advanced Relationship

The syntax of an advanced relationship is shown below:

[<qualifier_prefix>]<relation>[<limit_results>]

Any number can be appended together separated by a | character.

Phrase

Function

<qualifier_prefix>

[*|^|^^] Optional qualifier to control situations where a broader term hierarchy is pointed to by the relation. Options are:

  • * - Return all terms in the hierarchy pointed to by relation. For example, *Part of Region, will return all regions in the brooader term hierarchy after finding the directly linked region terms by following the Part of Region relation.
  • ^ - Return the term nearest the top of its own broader term hierarchy if multiple terms are returned.
  • ^^ - Return the highest term in any broader term hierarchy pointed to by the relation.

<relation>

Name of the relation to follow from subject to object.

<limit_results>

Optionally limit linked terms by class type or name. Options are:

  • [type=<class_type>] – Limit terms matched by following a relation to only terms of the specified class type. This is useful when the same relation name is used to link to terms of multiple class types.
  • [name=<term_name>] – Limit terms matched by following a relation to only terms of the specified name.
<components>
  <component name="metadataTaggingAdvanced" class="com.attivio.ontology.transformer.ingest.field.OntologyMetadataTagging">
    <properties>
      <property name="ontologyName" value="ontology" />
      <container-property name="taggingConfigs">
        <util:list id="taggingConfigs">
          <spring:bean class="com.attivio.ontology.transformer.ingest.field.TaggingConfiguration">
            <spring:property name="input" value="country_s" />
            <spring:property name="output" value="metadata1_s" />
            <spring:property name="mode" value="name" />
            <spring:property name="relationship" value="Use" />
          </spring:bean>
          <spring:bean class="com.attivio.ontology.transformer.ingest.field.TaggingConfiguration">
            <spring:property name="input" value="country_s" />
            <spring:property name="output" value="metadata2_s" />
            <spring:property name="mode" value="type" />
            <spring:property name="relationship" value="equivalence" />
          </spring:bean>
          <spring:bean class="com.attivio.ontology.transformer.ingest.field.TaggingConfiguration">
            <spring:property name="input" value="country_s" />
            <spring:property name="output" value="metadata3_s" />
            <spring:property name="mode" value="name" />
            <spring:property name="relationship" value="Part of Region" />
          </spring:bean>
          <spring:bean class="com.attivio.ontology.transformer.ingest.field.TaggingConfiguration">
            <spring:property name="input" value="country_s" />
            <spring:property name="output" value="metadata4_s" />
            <spring:property name="mode" value="name" />
            <spring:property name="relationship" value="^^Part of Region" />
          </spring:bean>
          <spring:bean class="com.attivio.ontology.transformer.ingest.field.TaggingConfiguration">
            <spring:property name="input" value="country_s" />
            <spring:property name="output" value="metadata5_s" />
            <spring:property name="mode" value="name" />
            <spring:property name="relationship" value="Part of Region|Includes Country" />
          </spring:bean>
        </util:list>
      </container-property>
    </properties>
  </component>
</components>

<beans>
  <f:insertComponent workflow="ingestPostProcess" position="first" component="metadataTaggingAdvanced" name="insertAdvancedTagging"/>
</beans>

In the configuration above, the following tagging operations are configured to work on data in the country_s field:

Config #

Description

Input

Output

1

Follow the Use named relationship from term in country_s, and tag document with located terms.

Holland
Afghanistan

Netherlands
N/A

2

Follow the equivalence relationship type from term in country_s field, and tag document with located terms.

Holland
Afghanistan

Netherlands
N/A

3

Follow the Part of Region relationship from term in country_s, and tag document with located terms.

Holland
Afghanistan

N/A
Asia, Southern Asia

4

Follow the Part of Region relationship to the top of its hierarchy tree (broadest term), and tag document with located terms.

Holland
Afghanistan

N/A
Asia

5

Follow the Part of Region relationship from term in country_s, and from any Region type terms located, follow the Includes Country relationship and tag document with the final set of located terms.

Holland
Afghanistan

N/A
Armenia, Azerbaijan, Bahrain, Bangladesh, Bhutan... (tags with all terms in Asia)

Tagging by Ontology Attribute

Tagging by ontology attributes involves setting the transformer to read values in specified fields, to use them for name look-ups in the ontology, and then return terms. The transformer checks the returned terms for attributes specified. The result of each test is combined in AND or OR fashion and the result placed in a result field

Transformer properties are defined as follows:

Property

Description

Required

ontologyName

Name of ontology to use to complete the look-up.

Yes

fieldNames

Names of fields to read and terms to extract for the values found there. Must be at least one entry in the list.

Yes

attributeName

Name of attribute to test for.

Yes

tagFieldName

Field name to contain the result.

Yes

testMode

Mode of test operation. Options are: AND, OR. For AND, all field value terms must have the specified attribute. For OR, at least one term field value must have the specified attribute. If not specified, the default is AND.

No

resultForSatisfied

The value to add to the tagFieldName field if the test IS satisfied.

Yes

resultForNotSatisfied

The value to add to the tagFieldName field if the test IS NOT satisfied.

Yes

Sample configuration is shown below:

<components>
<component name="attributeTagging" class="com.attivio.ontology.transformer.ingest.field.TagBasedOnOntologyAttributeValue">
  <properties>
    <property name="ontologyName" value="ontology" />
    <list name="fieldNames">
      <entry value="country_s" />
      <entry value="currency_s" />
    </list>
    <property name="attributeName" value="G11" />
    <property name="tagFieldName" value="result_s" />
    <property name="testMode" value="AND" />
    <property name="resultForSatisfied" value="G11" />
    <property name="resultForNotSatisfied" value="Emerging Market" />
  </properties>
</component>
</components>

<beans>
  <f:insertComponent workflow="ingestPostProcess" position="first" component="attributeTagging" name="insertAttributeTagging"/>
</beans>

In the sample above, the fields country_s and currency_s are read and their values used to look-up ontology terms. The terms are then checked for existence of the attribute G11 and if present in all terms (denoted by the testMode parameter AND value) the value G11 is added to the result_s field, and if not satisfied the value Emerging Market is added to the result_s field.

For example:

Case 1:
Country = "United States", Currency = "USD - United States Of America - Dollars"
Attribute checks: "United States" has attribute "G11" = true
"USD - United States Of America - Dollars" has attribute "G11" = true
Result applied: Field "result" content set to "G11".

Case 2:
Country = "Dominican Republic", Currency = "DOP"
Country "Dominican Republic" has attribute "G11" = false
Currency "DOP" has attribute "G11" = false
Result applied: Field "result" content set to "Emerging Market".

Case 3:
Country = "Dominican Republic", Currency = "USD - United States Of America - Dollars"
Country "Dominican Republic" has attribute "G11" = false
Currency "USD - United States Of America - Dollars" has attribute "G11" = true
Result applied: Field "result" content set to "Emerging Market".

Case 4:
Country = "United States", Currency = "DOP"
Country "United States" has attribute "G11" = true
Currency "DOP" has attribute "G11" = false
Result applied: Field "result" content set to "Emerging Market".

Term Look-ups

A high-performance memory-based term name look-up capability is provided as an API for use in auto-complete drop downs and similar functions.

For example, an ontology containing terms (synonyms): United Kingdom (UK, U.K.), Europe, Eastern Europe, Northern Europe, United Arab Emirates (UAE), would allow the following lookups:

Input

Result

"eur"

Europe, Eastern Europe, Northern Europe

"u"

United Kingdom, UK, U.K., United Arab Emirates, UAE - if matching preferred and non-preferred terms

"u"

United Kingdom, United Arab Emirates - if matching only preferred terms

"ua"

United Arab Emirates, UAE

Result ordering is controlled as follows:

  • Preferred terms rank higher than non-preferred terms.
  • Exact match on whole term name.
  • Exact match on the first token in the term name.
  • Starts with match on the first token.
  • Exact match of any token ranks higher than where no token fully matches.
  • Starts with token match closest to the beginning of the term name.
  • Shortest matched term names match higher than longer term names.
  • Alphabetical ordering term name.

The look-up uses a prefix matching store of the full ontology, or of a pre-defined subset of ontology term types.

Multiple prefix matchers can be defined and accessed by name within the provided Java APIs.

Prefix matchers are populated using tokenized term names (with punctuation removed) and optional configured named note values to provide synonym lookup capabilities. The noteSynonyms parameter defines the note names that can look up terms.

Different matcher collection techniques can be defined, for example to return any combination of preferred and non-preferred term matches.

Look-up Configuration

Installing the ontology module adds a prefixMatcher bean definition, which by default includes all terms in the ontology in the in-memory lookup data structure.

<ont:prefixMatcher name="ontologyPrefixMatcher" ontology="ontology" nodeset="*" />

To limit a prefix-matcher to a subset of ontology terms, add a types definition. For example, in the following configuration only terms of type Region are included in data returned by the prefix look-up.

<ont:prefixMatcher name="lookupRegions" ontology="ontology" nodeset="*">
   <ont:types>
      <ont:include class="Region" />
   </ont:types>
</ont:prefixMatcher>

Multiple typed prefix matchers can be defined in this way.

You can achieve finer control over the data returned by the prefix look-up using a collection configuration. A collection configuration is an instance of a com.attivio.ontology.model.CollectorConfiguration bean. A typical configuration is shown below:

<bean id="collector-1" class="com.attivio.ontology.model.CollectorConfiguration">
   <property name="matchMappings">
      <map>
         <entry key="Country" value="PT/NPT"/>
         <entry key="GICS Sector" value="PT"/>
         <entry key="Region" value="PT"/>
      </map>
   </property>
   <property name="replaceMappings">
      <map>
         <entry key="*" value="true"/>
      </map>
   </property>
</bean>

This configuration matches country terms based on their preferred term and non-preferred term names, and matches GICS Sector and Region classes based on their preferred term names only (non-preferred term names are not considered matches for these types). After matching based on a non-preferred term, all classes (denoted by the wildcard *) are replaced with their preferred term.

Using the Prefix Matcher

See below for details of how to test the prefix matcher in the supplied Ontology UI.
See Ontology Java API for details of how to make lookup calls using the Java API.
See below for details of how to integrate the prefix matcher in the Attivio AutoComplete module.

Ontology Valdation

Rules-based ontology validation mechanism allows for checks on common problem areas in a loaded ontology.

Built-in rules are provided for validating the following common errors:

  • Leading/Trailing Spaces– additional spaces in key term parameters cause unexpected behavior. Rule class is RuleSpacesMultiple.

  • Multiple Spaces– additional spaces in key term parameters cause unexpected behavior. Rule class is RuleSpacesLeadingTrailing.

  • Duplicate Terms– terms that have the same name and the same class are highlighted. Rule class is RuleDuplicateTerms

  • Orphaned Terms– non-preferred terms that do not relate to a preferred term are highlighted. Rule class is RuleOrphanedNonPreferredTerms.

  • Multiple Use Non-Preferred Terms– non-preferred terms that relate to more than one preferred term are highlighted. Rule class is RuleMultipleUseNonPreferredTerms.

  • Relationship Cardinality – ensure expected related term counts are adhered to. For example, terms of type Country should relate to terms of type Currency via a single Has Currencynamed relationship. Rule class is RuleCardinalityRelation.

  • Note Cardinality – ensure expected number of notes are present for a given class type. For example, terms of type Countryshould have an ISO_2 and ISO_3 note. Rule class is RuleCardinalityNote.

To define additional rules, create a Java bean that extends the com.attivio.ontology.checker.Rule interface.

Ontology validation is accessible from the Ontology UI - see here for more details.

Configuring Validation Rules

<bean id="checkMultipleSpaces" class="com.attivio.ontology.checker.RuleSpacesMultiple">
  <property name="includes">
    <list>
      <value>term_id</value>
      <value>term_name</value>
      <value>term_attribute</value>
      <value>term_note_name</value>
      <value>term_relation_type</value>
      <value>term_relation_name</value>
    </list>
  </property>
</bean>

<bean id="checkLeadingTrailingSpaces" class="com.attivio.ontology.checker.RuleSpacesLeadingTrailing">
  <property name="includes">
    <list>
      <value>term_id</value>
      <value>term_name</value>
      <value>term_attribute</value>
      <value>term_note_name</value>
      <value>term_relation_type</value>
      <value>term_relation_name</value>
    </list>
  </property>
</bean>

<bean id="checkOrphanedNonPreferredTerms" class="com.attivio.ontology.checker.RuleOrphanedNonPreferredTerms" />

<bean id="checkMultipleUseNonPreferredTerms" class="com.attivio.ontology.checker.RuleMultipleUseNonPreferredTerms" />

<bean id="checkDuplicateTerms" class="com.attivio.ontology.checker.RuleDuplicateTerms" />

<bean id="defaultValidationRules" class="com.attivio.ontology.checker.ValidationRules">
  <property name="rules">
    <list>
      <ref bean="checkMultipleSpaces" />
      <ref bean="checkLeadingTrailingSpaces" />
      <ref bean="checkOrphanedNonPreferredTerms" />
      <ref bean="checkMultipleUseNonPreferredTerms" />
      <ref bean="checkDuplicateTerms" />
    </list>
  </property>
</bean>

Contributing Validation Rules

See Extending the Ontology Module for details on how to contribute additional validation rules.

Topic Tagging

The ontology module combines several features to provide automatic tagging of document content. Tagging utilizes entity extraction, defined ontology relationships between terms, and the text in a document to suggest a set of document tags. This data can then be used as part of a document authoring workflow to automatically tag a document being authored or provide suggestions to an author for manual confirmation. It can also be incorporated into the ingest workflow to provide document aboutness information.

  • Tag Ontology Terms – Provides ability to tag documents based upon the content of the ontology using ontology-driven entity.
  • Calculate Aboutness Score – Defines the Aboutness score of the document through a set of pre-defined tagger rules. These rules are configured as Spring beans in the system configuration. They can be augmented by user-defined rules. Suggested tags can return in aboutness score order. In addition to providing aboutness scoring, the rules can also augment located terms with additional possibly disconnected ontology terms.
  • Accrue Metadata – Provides ability to trace relationships through ontology to provide higher level tagging (for example, find company MyCompany, and automatically add Stock Exchange, Country and Political or Geographical Region).
  • Define Sets of Accrual Metadata – Provide ability to configure multiple sets of relationship tracing (for example, primary, secondary, etc..). Relationship sets can contain sets of pre-defined relationships to accrue separate linked ontology data.
  • User-Extendable API – Provides capability via a generic API to allow for custom interaction with identified entries (for example, notifications/alerting).
  • Uses Workflow API – Provides access to the tagging function through an Attivio document workflow called suggestTags. The suggestTags workflow is accessible through the standard AIE Java API.

Tagging Configuration

Installing the ontology module automatically creates a set of components and a workflow used by the tagging process.
A new document workflow called suggestTags is defined. The suggestTags workflow adds some existing and new components as shown below.

suggestTags

Component

Description

localeEnglish

Sets the document locale to English, overriding any previously set locale.

standardAnalyzer

Standard tokenization stage required for entity extraction. This is a prerequisite for entity extraction.

unicodeNormalizer

Processes standard unicode normalization and transformation steps. This is a prerequisite for entity extraction.

ontologyEntityExtractor

Defines a new ExtractEntities component using a standard DictionaryEntityFinder configured to point to the entity extraction dictionary that is automatically generated by the Ontology feature at startup. See here for more details of configuring dictionary generation.

applyTaggerScore

Calculates the tagger aboutness score and makes any additional modifications as defined by the score configuration.

applyNotifications

Hook for a user-defined notification stage. By writing a Java bean that implements the interface com.attivio.ontology.tagging.TermNotifier and configuring this as the referenced bean in the applyNotification component, the bean is called with the input IngestDocument and a Collection of the located terms in the document. A Collection of Strings return that are associated with the tagging result object for use by the process calling the suggestTags workflow.

Excluding Terms from Tagging

You can apply a special note to a term to exclude it from the tagging process (inclusion in the tagging entity extraction dictionary and matching or accruing via relationship). The note name is "Attivio Note Classify" and the value is "Attivio do not classify".

You can configure tagging behavior through the Tagger feature.

The Tagger feature references beans to define:

  • termCache – the ontology term cache, which contains the terms related to a given term. When a term is located these related terms can be associated with the result as accrued terms.
  • scoreConfig – the tagger score configuration defines the scoring and modification rules used to score located terms and modify the term lists retrieved based on the terms located.

Additional configuration options are defined as follows:

  • types – The class types to include in tagging operations. You can exclude this parameter to include all class types.
  • relationships– The relationships for a given class type followed to add additional term entities to the returned result payload when a term is identified in the input text. Relationships are allotted to named result buckets using the name attribute. Primary and secondary buckets are created by default. The follow element can nest multiple relationship steps to follow from a located term. The relationship element supports the following attributes:
    • name – Name of the relationship to follow from the located term.
    • qualifier– Optional. Options are:
      • DIRECT – Default. Return directly linked terms.
      • TOP – Return the top-most (broadest) term.
      • HIGHEST – Return the highest (broadest) term of multiple terms that may be linked by the named relationship.
      • ALL – Return all terms in the term hierarchy from the linked term.
    • type – Optional. Limit terms linked via the named relationship to only terms of the specified class type.
    • term – Optional. Limit terms linked via the named relationship to only terms of the specified term name.

The default configuration for the tagger feature bean is defined as follows:

<ont:tagger name="ontologyTagger" ontology="ontology" nodeset="*" workflow="suggestTags">
  <ont:scoreConfig ref="taggerScoreConfig"/>
  <ont:termCache ref="ontologyTermCache"/>
  <ont:types>
    <ont:include class="Country"/>
    <ont:include class="Region"/>
  </ont:types>
  <ont:relationships name="primary">
    <ont:class name="Country">
      <ont:follow>
        <ont:relationship qualifier="HIGHEST" name="Part of Region" />
      </ont:follow>
    </ont:class>
  </ont:relationships>
  <ont:relationships name="secondary">
    <ont:class name="Region">
      <ont:follow>
        <ont:relationship qualifier="ALL" name="Broader Term" />
      </ont:follow>
    </ont:class>
    <ont:class name="Country">
      <ont:follow>
        <ont:relationship name="Part of Region" />
      </ont:follow>
    </ont:class>
  </ont:relationships>
</ont:tagger>

The components available to use in tagging are defined below (click for larger image):

Tagger Score Configuration

<bean id="taggerScoreConfig" class="com.attivio.ontology.tagging.score.ScoreConfiguration">
   <property name="global">
      <util:list>
         <ref bean="globalScore" />
      </util:list>
   </property>
   <property name="field">
      <util:list>
         <ref bean="titleScore" />
         <ref bean="synopsisScore" />
         <ref bean="bodyScore" />
      </util:list>
   </property>
</bean>

<bean id="globalScore" class="com.attivio.ontology.tagging.score.GlobalScoreConfiguration">
   <property name="rules">
      <list>
         <ref bean="scoreThreshold" />
      </list>
   </property>
</bean>

<bean id="scoreThreshold" class="com.attivio.ontology.tagging.score.ThresholdRule">
   <property name="score" value="200" />
</bean>

<bean id="titleScore" class="com.attivio.ontology.tagging.score.FieldScoreConfiguration">
   <property name="field" value="tagger.title" />
   <property name="rules">
      <list>
         <ref bean="titleScoreBase" />
         <ref bean="scoreWordLocation" />
      </list>
   </property>
</bean>

<bean id="synopsisScore" class="com.attivio.ontology.tagging.score.FieldScoreConfiguration">
   <property name="field" value="tagger.synopsis" />
   <property name="rules">
      <list>
         <ref bean="synopsisScoreBase" />
         <ref bean="scoreWordLocation" />
      </list>
   </property>
</bean>

<bean id="bodyScore" class="com.attivio.ontology.tagging.score.FieldScoreConfiguration">
   <property name="field" value="tagger.text" />
   <property name="rules">
      <list>
         <ref bean="bodyScoreBase" />
         <ref bean="scoreWordLocation" />
      </list>
   </property>
</bean>

<bean id="titleScoreBase" class="com.attivio.ontology.tagging.score.BaseScoreRule">
   <property name="score" value="300" />
</bean>

<bean id="synopsisScoreBase" class="com.attivio.ontology.tagging.score.BaseScoreRule">
   <property name="score" value="200" />
</bean>

<bean id="bodyScoreBase" class="com.attivio.ontology.tagging.score.BaseScoreRule">
   <property name="score" value="100" />
</bean>

<bean id="scoreWordLocation" class="com.attivio.ontology.tagging.score.WordStartRule">
   <property name="within" value="0" />
   <property name="score" value="100" />
</bean>

The tagger scoring scheme is defined as a ScoreConfiguration bean with the following properties:

  • global – a list of GlobalScoreConfiguration global score bean references.
  • field – a list of FieldScoreConfiguration field score bean references.
  • accrual – a map of scores for each accrual bucket.

Global Rules

Global rules are instances of the GlobalScoreConfiguration bean. Rules can be scoring and/or modifying rules.

Available rules are shown below.

LinkedOccurrencesRule

The LinkedOccurrencesRule is a modifying rule that specifies:

     When a specified number (minOccurs) of terms of a given type (found)

     are related to other terms by the supplied relation (relatedTo)

     and is located in the tagger fields

     a specified named term (addName) of type (addType) is added to the term list in the specified location (addLocation).

Parameter

Type

Description

found

String

Type of class found by tagger entity extraction.

relatedTo

String

How the located entity is related to another named term (via relationship name and term name or type) so that the rule is satisfied.

addName

String

Name of the term to add when the rule is satisfied.

addType

String

Type of the term to add when the rule is satisfied. Allows differentiation between multiple terms of the same name.

minOccurs

Integer

Minimum number of occurrences of a term of this type to locate to satisfy the rule.

addLocation

String

Location in the tagger result to which the accrued entity is added. Default primary.

In the example below, when at least 2 terms of type Product that link directly to a term named Commodities via the businessArea relationship name, the term named Commodities of type Division is added to the primary terms list.

<bean id="accrueCommodities" class="com.attivio.ontology.tagging.score.LinkedOccurrencesRule">
   <property name="found" value="Product" />
   <property name="relatedTo" value="businessArea[name=Commodities]" />
   <property name="minOccurs" value="2" />
   <property name="addName" value="Commodities" />
   <property name="addType" value="Division" />
   <property name="addLocation" value="primary" />
</bean>

Parameter

Type

Description

score

Integer

Score to add if rule is satisfied.

WhenFoundRule

The WhenFoundRule is a modifying rule that specifies:

When a specified number (minOccurs) of terms of a given type (found) is located in the tagger fields, a specified named term (addName) of type (addType) is added to the term list in the specified location (addLocation).

Parameter

Type

Description

found

String

Type of class found by tagger entity extraction.

minOccurs

Integer

Minimum number of occurrences of a term of this type to locate to satisfy the rule.

addName

String

Name of the term to add when the rule is satisfied.

addType

String

the type of the term to add when the rule is satisfied. Allows differentiation between multiple terms of the same name.

addLocation

String

Location in the tagger result to which the accrued entity is added. Default primary.

For example, in the configuration below, when one or more terms of type Company is located, the term Equities of type Division is added to the primary term list.

<bean id="accrueEquities" class="com.attivio.ontology.tagging.score.WhenFoundRule">
   <property name="found" value="Company" />
   <property name="minOccurs" value="1" />
   <property name="addName" value="Equities" />
   <property name="addType" value="Division" />
   <property name="addLocation" value="primary" />
</bean>

Field Rules

Field rules are instances of the FieldScoreConfiguration bean. Available rules are:

BaseScoreRule

Class: com.attivio.ontology.tagging.score.BaseScoreRule

If an entity is found in a field, the defined score is added to the score for the entity. This type of rule can provide higher scores for certain field locations, for example, a title or synopsis as opposed to the body text.

Parameter

Type

Description

score

Integer

Score to add if rule is satisfied.

WordStartRule

Class: com.attivio.ontology.tagging.score.WordStartRule

If an entity is located within a specified number of characters of the start of the field, then the defined score is added to the score for the entity.

Parameter

Type

Description

score

Integer

Score to add if rule is satisfied.

within

Integer

Number of characters from the start of specified field, within which the entity start position must be, so that the score is added to the entity score.

Contributing Tagger Rules

See Extending the Ontology Module for details of contributing tagger rules.

AutoComplete Module Integration

The AutoComplete module provides drop-down support in the Search UI search box. This section outlines how to implement the AutoComplete module using the the ontology module prefix matcher to provide the suggestions.

File Changes

Make the changes described below to the conf\bean\ontologyAutoComplete.xml file.

 

Edit the ontologyAutoComplete provider definition as appropriate to set the prefix matcher to use, the default number of responses and default values for term match/replace flags.

If not already present, add a new autocomplete feature <fac:autocomplete> (config file will be conf\features\autocomplete\AutoComplete.xml) and add the provider contributed by the ontology module.

The JSON response to the autocomplete system includes, the term name for both the value and the display name, the term unique identifier as property "id" and the term class type as property "category".

<project-dir>/conf/bean/ontologyAutoComplete.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.springframework.org/schema/beans" xmlns:util="http://www.springframework.org/schema/util" xmlns:sec="http://www.springframework.org/schema/security" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd http://www.springframework.org/schema/security http://www.springframework.org/schema/security/spring-security-3.1.xsd">
  <bean name="ontologyAutoComplete" class="com.attivio.ontology.beans.OntologyAutoCompleteProvider">
    <property name="matcherName" value="ontologyPrefixMatcher"/>
    <property name="size" value="5"/>
    <property name="defaultMatch" value="true"/>
    <property name="defaultReplace" value="true"/>
  </bean>
</beans>
<project-dir>/conf/features/autocomplete/AutoComplete.xml
<?xml version="1.0" encoding="UTF-8"?>
<ff:features xmlns:ff="http://www.attivio.com/configuration/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fac="http://www.attivio.com/configuration/features/autocomplete" xsi:schemaLocation="http://www.attivio.com/configuration/config http://www.attivio.com/configuration/config.xsd http://www.attivio.com/configuration/features/autocomplete http://www.attivio.com/configuration/features/autocompleteFeatures.xsd">
  <fac:autocomplete enabled="true">
    <fac:provider name="facetProvider" ref="fvp"/>
    <fac:provider name="dictionaryProvider" ref="dvp"/>
    <fac:provider name="ontologyProvider" ref="ontologyAutoComplete"/>
  </fac:autocomplete>
</ff:features>

Ontology User Interface

The Ontology User Interface is accessible from the Modules section of the Use the Attivio Administrator and provides an interface for browsing, searching and testing the currently-loaded ontology.

The interface lets you:

  • view ontology info, terms by type, and term hierarchies
  • visualize the ontology structure
  • retrieve and inspect terms by id, name, attribute, note name/value, relationship type/name, and by term name look-up.

Setting the Default Ontology

The Ontology User Interface connects to only one ontology at a time. By default, it is connected to the one named ontology.

To connect to a different ontology, set a property in the <project-dir>\conf\<project-name>.properties file.

default.ontology.name=myOntology

Info Tab

The Info tab offers general statistics describing the ontology, plus browsing tools for exploring the classes and hierarchy of the ontology.

Info Subtab

The Info subtab summarizes descriptive statistics about the loaded ontology.

InfoInfoTab

Classes Subtab

The Classes subtab provides an alphabetical list of classes and instances in the ontology. Select a class (country or region in the image), and then browse to individual instances of that class.

InfoClassesTab

Hierarchy Subtab

The Hierarchy subtab lets you browse a hierarchy of classes down to the leaf-node instances.

InfoHierarchyTab

By ID Tab

The By ID tab lets you look up a term by its ID number. Type the ID number into the search field and click the Search button.

This must be the literal ID number. Wildcards are not provided.

Results Subtab

The Results subtab shows the list of terms that match the search criteria. In this case, there is never more than one matching term.

ByIDResultsTab

Details Subtab

The Details subtab displays the details of the selected term.

ByIDDetailsTab

Source Subtab

The Source subtab shows the source code for the term, which often includes more details than are provided by the Details display.

ByIDSourceTab

By Name Tab

The By Name tab lets you search for a term by name. Enter the name in the field and click the Search button. The name must match exactly.

The Results, Details, and Source subtabs are similar to those of the By ID tab.

ByNameResultsTab

By Attribute Tab

The By Attribute tab lets you look up a term by one of its attributes. Type the name of the attribute in the search field and click the Search button. It must match exactly.

ByAttriuteDetailsTab

The Results, Details, and Source subtabs are similar to those of the By ID tab.

By Note Tab

The By Note tab lets you look up a term by note name and value. For example, this example is REGION_NAME (the note) and "Africa" (the value).

ByNoteResultsTab

The Results, Details, and Source subtabs are similar to those of the By ID tab, except that the Results list can have multiple entries.

By Relationship Tab

The By Relationship tab lets you search for terms that have a specific relationship to a specific term. For example, the illustration shows terms for which a "broader term" would be "regions."

You can search by relationship name or by relationship type.

ByRelationshipResultsTab

The Results, Details, and Source subtabs are similar to those of the By ID tab, except that the Results list can have multiple entries.

Lookup Tab

The Lookup Tab lets you locate all terms that begin with a specific prefix.

LookupResultsTab

Control

Description

Term

Type in the search string. It matches all terms that contain a word beginning with that string. For example, searching for "G" matches both "Germany" and "French Guiana."

Types

You can narrow results by selecting a class of terms.

Filter

You can narrow results by specifying the value of a relationship. Generally this is a "part of" or "broader term" value. In this example, specifying "Europe" as a filter narrows the list of results to terms that are part of the "Europe" region.

Max

Maximum number of matching results to display. The default (-1) means returns 100 matches.

Match Non-preferred Terms

Include non-preferred terms in the search. This broadens the search to include synonyms.

Replace Non-preferred Terms

If AIE matches a non-preferred term, it substitutes the preferred synonym in the search results. In the example shown above, the search for "G" matched “Gudija,” which is a non-preferred synonym of "Belarus." AIE substituted Belarus in the results list.

Topic Tagging Tab

The Topic Tagging tab sends data to the suggestTags workflow to test the currently configured Tagger settings and to return data (entities) found in the text, or accrued from the tagger configuration. See here for details of the Tagging capability.

ClassificationTab

Validation Tab

Running sets of validation rules on the currently loaded ontology is also supported. These rules scan the ontology for common errors that break relationship links between terms.

ValidatorResultsTab

Control

Description

Ruleset

Choose a set of validation rules to run against the loaded ontology.

checkMultipleSpaces

This rule checks the terms of the ontology for values that contain two or more adjacent spaces.

checkLeadingTrailingSpaces

This rule looks for terms containing leading or trailing white space in the term name.

checkOrphanedNonPreferredTerms

Non-preferred terms are synonyms of preferred terms. This rule locates non-preferred terms that have no preferred synonym.

checkDuplicateTerms

This rule locates terms that have the same name and belong to the same class.

Check Ontology

Click this button to run the ruleset against the ontology.

Reset

Clear the input fields of this page.

Results

This is the list of terms that were found by the validation rules.

Ontology Manager

The Ontology Manager lets you load, reload, delete and monitor ontologies in AIE.

OntologyManager

Control

Description

Reload All Ontologies

Reload all ontologies to their latest version.

Cleanup Ontologies

Remove data for orphaned ontologies.

Get Current Version

Retrieve current version for a named ontology.

List All Versions

List all versions available for a named ontology.

Load Derived Artifacts

Cause derived artifacts to re-load for a named ontology.

Get Load Statistics

Retrieve ontology load statistics.

Load to Version

Load a named ontology to a specified version.

Update to Version

Update a named ontology to a specified version.

Delete Ontology Versions

Delete versions of a named ontology inclusive start to finish.

Extending the Ontology Module

You can extend various components of the ontology module through user-defined code contributions. Details for extending the ontology module are available here.

Copyright (c) 2012 Attivio Inc. All rights reserved.

com.attivio.ontology.beans.OntologySynonymDictionaryBean