Overview
The Attivio Intelligence Engine (AIE) supports document classification and sentiment analysis using a statistical classification engine. Statistical classification technology is an application of machine learning. Statistical models are trained using training data which has been manually labeled. In general, the quality of a statistical model is highly dependent on the quality of the training data.
This guide requires installation of the sdk and classifier modules.
Main Article: Create a New Project
View incoming links.
Building a Classification Model
Building a classification model is accomplished by feeding labelled documents (the training data) to an AIE workflow which includes an instance of the BuildClassificationModel transformer. When a document is received by the model-building component, the component extracts the expected label for the document, and saves some fields in the document. When the model-building component receives a "commit" message, all of the saved fields for each label are analyzed, and a statistical summary of the inputs is written out to the model URI. The result of this operation is the classification model. It is not possible to look inside a model - it is a black box.
If cross-validation is enabled, the training data is also divided into n bins; n different models are built, each time leaving out one of the bins, and the model accuracy is measured on the data in the held-out bin. The accuracy is reported in the AIE log, and may also be obtained through JMX inspection of the model-building component. Although this is computationally expensive, the result is a good estimate of the accuracy of the model which is built from the entire training set.
Cross-validation avoids the need for separate “training” and “test” sets by splitting up the training data internally. Some researchers used to report separate accuracy measurements for the training data and the test data, but we now understand that the accuracy on the training data is always higher than on “held-out” data. Note that if you have duplicates in the training set (or near-duplicate documents – or even if you feed in the same data twice without restarting), you can create a situation in which our system estimates a higher cross-validation accuracy than you will actually observe on novel data.
AIE also reports (via log messages) what is known as a “confusion matrix." Suppose you have categories such as “Science”, “Biology”, and “Chemistry” (as well as “English” and “Math”). It is likely that a large number of “Biology” documents will be incorrectly classified as “Science." The confusion matrix reports how many documents of each class are mis-categorized as each other class. This can be a useful guide when you have 20 or more categories and you are trying to reduce the error rate by merging categories.
It is common to build a model in one workflow and to use it in another workflow. The BuildClassificationModel component can notify other components whenever a model is successfully built. See Using Classification Models for information on the components which use these models.
How to Create Training Data
A training set is a set of documents along with a label for each document. Creating a data set can be challenging, but paying attention to details can make a dramatic difference in the accuracy of the resulting model.
Like most statistical classifiers, the classification engine in AIE works best when given a large amount of training data, when the data assigned to different labels are as dissimilar as possible, and when the documents to be classified bear a strong resemblance to the documents used for training.
Since classification is usually more accurate when trained with more documents, Attivio recommends training models using at least 100 documents per category, and using 1000 documents per category if at all possible. It is generally a good idea to have approximately the same number of documents in each category in the training set.
Similarly, classification is usually more accurate when trained with a small number of easily-distinguished categories. Attivio recommends using a small number (fewer than 20) of dissimilar categories, and combining similar categories (such as "Business" and "Finance") if possible.
Finally, although it may seem obvious that the documents to be classified need to be similar to the documents used for training, there are many ways in which the target documents can be subtly unlike the documents used for training. For example, if the training data consists of news articles (in categories such as "Finance" or "Sports") gathered in the wintertime, classification will probably be more accurate on "winter" sports articles (such as ice hockey) than on "summer" sports articles such as baseball or football.
Ideally, the data for training can be created by taking a random sample from a website, CMS, or other repository of documents which have been previously manually classified. If classification is to be performed on email, the training set should ideally be created from similar email.
Recap: Training Data Rules of Thumb
- Choose a small number (no more than 20) of categories.
- Merge similar categories whenever possible.
- Choose at least 100, and ideally 1000 or more, training documents per category.
- Make sure the training documents are similar to the documents which will be classified automatically.
- Train a model with cross-validation enabled. Check the cross-validation accuracy and error statistics in the log file.
Example Configuration
The optional configuration file train-classification-demo.xml contains the following configuration for the trainClassificationModel component:
<component name="trainClassificationModel" class="com.attivio.platform.transformer.ingest.document.BuildClassificationModel"> <properties> <property name="modelName" value="${classification.defaultModelFileName}" /> <property name="categoryFieldName" value="cat" /> <map name="stopWordDictionaries"> <property name="en" value="dictionaries/big_stopwords_en.csv" /> </map> <property name="lowerCase" value="true" /> <property name="crossValidationBins" value="10" /> <!-- <property name="explaining" value="true" /> --> <!-- <property name="modelExplanationName" value="classifier/models/model-explanation.txt" /> --> </properties> </component>
There are two parts to this configuration. The first part specifies how to find the label for each training document, and how the saved fields are analyzed. The second part specifies where the classification model is to be saved (the model name), the number of cross-validation bins, and the endpoints to be notified, if any.
Category Field Name and Field Analysis Parameters
Parameter | Value | Description |
---|---|---|
categoryFieldName | String (default: "cat") | The category field name. Each document in the training set must be labeled with the appropriate category (a string value) in this field. |
useAllNaturalLanguageFields | Boolean (default: true) | When true, all words in all natural language fields in the document are included in the saved fields. When false, only the words found in fields explicitly selected by the fields parameter are saved. |
fields | List | Specifies the list of fields to use when building the model when useAllNaturalLanguageFields is false. |
stopWordDictionaries | List | Specifies the stopword dictionaries (by language). Stop words are ignored when building the model. |
lowerCase | Boolean (default: false) | If true, all words are forced to lower case before being added to the model. |
explaining | Boolean (default: false) | If true, enables a human-readable version of the classification model to be saved. |
modelExplanationName | String | If explaining is true, this is the location the model explanation is saved to. |
If the field analysis parameters do not have exactly the same settings on the ClassifyDocument transformer which uses this model, the classification model will silently do a poor job of classifying documents.
All fields used when building a classification model need to be tokenized. If any of the fields used in field analysis is added after tokenization, a warning will be printed while building the classification model. Main article: Ingestion Tokenization.
Model Parameters
Parameter | Value | Description |
---|---|---|
modelName | String (no default) | URI of the model to be saved, either acs://contentStoreName/name (required for a multi-node project) or a filename. |
numExecutionThreads | int (default: -1) | Number of parallel threads used to build models. |
Building a Sample Classification Model
The main classifier configuration file conf/classifier/module.xml should not be loaded for this example. One way to accomplish this is to create a project which does not include the classifier module.
To build a classification model:
- Download the sample training data from classifier-tiny-training-data.zip. This data is a subset of the popular 20 newsgroups data set. This data consists of 100 cleaned articles (all headers have been removed) from each of the "sci.crypt" and "alt.atheism" newsgroups.
- Unpack the training data into a new directory, such as example/classifier-tiny-training-data. This will create two directories, sci.crypt and alt.atheism.
- Copy the file
conf/classifier/train-classification-demo.xml
into your project directory, and edit it so that theextractCategoryFromZipFile
document transformer is commented out, and theextractCategoryNameFromDirectory
is uncommented. - Enable (remove the comment on) the last definition for the
classification.defaultModelFileName
property in theattivio.classification.properties
properties file (labeled "Enable this line when building new models"). - Restart AIE, making sure to include
train-classification-demo.xml
in the set of configuration files. - Ingest the training data (Main article: Loading File Content), sending it to the
trainClassificationModel
workflow.
At the end of processing, the BuildClassificationModel component will perform cross-validation, build a final classification model, and write the model out to the file specified in the modelName parameter (the default value of which is "new-classification.model"). You should observe that the model file was created, and that the cross-validation accuracy reported in the log file is over 90%.