Overview
The Attivio Intelligence Engine (AIE) supports document classification and sentiment analysis using a statistical classification engine. Statistical classification technology is an application of machine learning. Statistical models are trained using training data which has been manually labeled. See Building Classification Models for instructions on building models. A classification model is a black box which accepts a document as input and computes a score for each possible output label. The set of output labels is flat, not hierarchical. Unless the training set explicitly includes a miscellaneous category, there is no "none-of-the-above" label – the classifier always chooses the category which best matches each document, even when no category is a good match at all.
Required Modules
These features require that the classifier module be included when you run createproject to create the project directories. This is an Add-on Module that requires special permission to acquire.
See also: Pattern-Based Categorization
AIE also offers Pattern-Based Categorization, which might be considered to be a simple kind of classification. Pattern-based categorization uses regular expressions (regex matchs) to assign a document to a category.
View incoming links.
Configuring ClassifyDocument
Document classification is performed in AIE by the ClassifyDocument ingest transformer component. When a document is received by the ClassifyDocument transformer, the transformer converts some fields in the document into a "bag of words". The "bag of words" for the document is compared against the word statistics which were created during the training process for each label in the classification model. Depending on the transformer settings, the category label and score for zero or more labels are output to a configurable output field.
Example Configuration
This is the default configuration for the classifyDocument transformer:
<component name="classifyDocument" class="com.attivio.platform.transformer.ingest.document.ClassifyDocument"> <properties> <property name="modelName" value="${classification.defaultModelFilename}" /> <map name="stopWordDictionaries"> <property name="en" value="dictionaries/big_stopwords_en.csv" /> </map> <property name="lowerCase" value="true" /> <property name="output" value="${classification.outputField}"/> <property name="outputScore" value="${classification.outputScoreField}"/> <property name="outputMode" value="topN"/> <property name="outputNum" value="1"/> <property name="explaining" value="false" /> <property name="explanationFieldname" value="cat.explain" /> <property name="explanationScoreFieldname" value="cat.explain.score" /> <property name="explanationLength" value="5" /> </properties> </component>
There are three parts to this configuration. The first part specifies the URI of the the classification model to be loaded. The second part specifies how the "bag of words" is computed, and the third part specifies which category labels should be output.
Model Parameters
Parameter | Value | Description |
---|---|---|
modelName | String (no default) | URI of the model to be loaded, either acs://contentStoreName/path or a filename. |
Bag-of-Words Parameters
Parameter | Value | Description |
---|---|---|
stopWordDictionaries | List | Specifies the list of stopword dictionaries (by language). Stop words are ignored when building the bag of words. |
lowerCase | Boolean (default: false) | If true, all words are forced to lower case before being added to the bag of words. |
useAllNaturalLanguageFields | Boolean (default: true) | When true, all words in all natural language fields in the document are included in the bag of words. When false, only the words found in fields explicitly selected by the fields parameter are included. |
fields | List | Specifies the list of fields to use when building the bag of words when useAllNaturalLanguageFields is false. |
If the bag-of-words parameters do not have exactly the same settings as they did when the model was trained, the classification models will silently do a poor job of classifying documents.
All fields used by the classifier need to be tokenized. If any of the fields used in computing the "bag of words" is added after tokenization, a warning will be printed during classification. Main article: Ingestion Tokenization.
Output Parameters for ClassifyDocument
Parameter | Value | Description |
---|---|---|
output | String (default:"cat") | When a category label is output, it is written to this field. |
outputScore | String (default:"cat.score") | When a category score is output, it is written to this field. |
outputMode | String | One of "topN", "byThreshold", or "byTotalScore". |
outputNum | Integer (default: 1) | The maximum number of labels which will be output when outputMode = "topN". |
outputScoreThreshold | Float (default: 0.5) | The minimum score for a label to be output when outputMode = "byThreshold". |
outputScoreSum | Float (default: 0.5) | Labels are be output ordered by decreasing score until the sum of the scores exceeds this value when outputMode = "byTotalScore". |
explaining | Boolean (default: false) | If true, turns on the "explanation" feature of the classifier, so the document keeps a record of the words and phrases that contributed most to its label. |
explanationFieldname | String (default: "cat.explain") | The desired number of explanatory words and phrases are written to this field. |
explanationScoreFieldname | String (default: "cat.explain.score") | For each explanatory word or phrase, its contribution to the raw output score is written to this field. |
explantionLength | Integer (default: 5) | The maximum number of top-weighted contributors to add to the explanation. |
The explanation feature can be turned on at any time, with any classification model. You don't need to train a special explanation model.
Installing Classification Models
Attivio recommends creating models which are trained using the same kind of data as the data which you are going to feed to the classifyDocument transformer. See Building Classification Models for instructions on building customized models.
For demonstration and rapid prototyping purposes, Attivio provides several sample classification models. These models may be downloaded from our site, but are unlikely to be appropriate for use in production.
Sample Classification Models
- classifieds-classification.model classifies classified ads into one of 4 categories: "ForSale", "Housing", "Jobs", and "Services.
- news-classification.model classifies news articles into one of 7 categories: "Computing", "Finance", "Lifestyle", "Media", "Politics", "Science", and "Sports.
- empty-classification.model should never output any categories. Sometimes a "no-op" model is useful for testing configurations before a final model has been built.
Notes on all classification models...
- The sample models were trained on texts written in (modern) English. The ability of the models to classify text decreases rapidly as the text diverges from the training data. The existing models will not work well on politics, general business content, or Twitter. Texts written by British authors, or which were written more than 20 years old, can be expected to use different terms. Obviously, the existing classification models will not work at all on content in languages other than English.
Installing a Sample Classification Model
To install and use a sample classification model:
- Download the model using one of the links above.
- Copy the model to
<install_dir>\conf\classifier\models
. - Override the definition of
classification.defaultModelFileName
in the<install_dir>\conf\classifier
attivio.classification.properties
file to point to the model installed in step 2. (The file is set up so that you can just comment and uncomment two lines.) - Restart AIE on your project. Make sure the project uses the classification module.
- Ingest some documents (Main article: Loading File Content) using the FileConnector and pointing to the textFileIngest workflow. Sample documents may be found in the
<install_dir>\example\classification\classifiedsCategorizationDemo\conf\sampleInput
and<install_dir>\example\classification\newsCategorizationDemo\conf\sampleInput
directories. Since the classifier module has been added to the ingestion workflow, it isn't necessary to use a different workflow to get classification added to ingested documents.
Once the documents are ingested, search for documents using queries such as "cat:ForSale" (with the classifieds model) or "cat:Politics" (with the news model). When using SAIL for these searches, it may be helpful to open the Search Options dialog and check the Debug checkbox. This will display all document fields for inspection.