Pattern-based categorization (formerly known as Rule-Based Categorization) uses regex rules to assign incoming documents to various categories. These categories would often then be used to build facet lists. This is not technically an "entity extraction" or "classification" tool, although it superficially resembles both.
This feature requires that the entityextraction module be included when you run createproject to create the project directories.
This page refers to AIE's Rule-Based Categorization feature. This feature uses tools from the entityextraction module, but it is not a form of Entity Extraction per se. It is also not a form of Classification, although it bears a superficial similarity.
Rule-based categorization assigns a document to one or more categories based on regular-expression matches against the document's content. The regular-expression patterns are written by the user and are language-independent.
View incoming links.
Before You Begin
Ensure that the environment is prepared as follows:
1. Add a new line to the <install-dir>\conf\factbook\module.xml file to import sampleCategorization.xml to the project. The imported file contains the XML configuration of the component, workflow, and connector that are used in this example.
2. Create a new project based on the Quick Start Tutorial that includes the demo module group, the entityextraction module, and the factbook module. For Windows, the createproject command looks like this:
3. Start AIE using the AIE Agent and its Command-Line Interface (CLI):
Run the Agent in a Command Window:
Run the Command-Line Interface in a second Command Window. Note that the CLI is invoked for a specific project.
- To run the project use the start all command in the Command-Line Interface:
Setting Up the Target Files
This example reads three text files and classifies them. The files are provided in your <install_dir>\conf\factbook\content\sampleCategorization directory. They are ready to use. You don't have to set anything up, but you might want to open one of the files and examine the contents.
A fictional software company, Acme Inc., employs a custom document review "workflow" where a document can have one of four status values:
All Acme documenta are categorized by review status.
The status is simply a text notation within the body of the file, like this:
This example uses IngestDocument cat and text fields as defined in the default AIE Schema, which is present in the example project as <project_dir>\conf\schema\default.xml. We will run a regex rule against the document's text field, looking for the document's status. Then we'll copy the status into the cat field. Since the cat field is preconfigured in the AIE Schema as a facetable field, the result will be a facet list of status values on the search page.
Define a Custom File Connector
The connector is preconfigured for you in the sampleCategorization.xml file. You can view it in the Connector Editor of the AIE Administrator. Navigate to System Management > Connectors, and click on the sampleCategorization connector.
As you can see from the editor, this is a File Connector that will read text files from the appropriate directory, and will then route the resulting IngestDocuments to the categorizeVersions workflow.
Define a Custom Workflow
The categorizeVersions workflow is also preconfigured for you in the sampleCategorization.xml file. You can view it in the Workflow Editor of the AIE Administrator. Navigate to System Management > Workflows > Ingest, and click on the categorizeVersions workflow.
As you can see, the categorizeVersions workflow calls the ruleBasedCategorizer component before sending the IngestDocuments to the ingest workflow for indexing.
Define a Custom Component
AIE offers two field transformers that can be used as categorization components. These are ExtractPatterns and ExtractRegexPatterns. Examples of both are demonstrated in this section.
- ExtractPatterns is the more powerful and more complex tool, but it will soon be deprecated.
- ExtractRegexPatterns is the newer transformer. It does most of the things that users need while being much easier to configure.
For this simple example we can use either transformer interchangeably.
This is where the story gets interesting. The ruleBasedCategorizer component is also preconfigured for you in the sampleCategorization.xml file.
Note that createProject flattens out this definition into two project files: <project-dir>\conf\components\ruleBaseCategorizer.xml and <project-dir>\conf\bean\mySampleRuleBasedCategorizer.xml.
The component is based on the ExtractPatterns ingestion transformer from the entityextraction module. It uses an ExtractPatternsInfo bean that is configured to scan the text field of the IngestDocument, find the status label, and then copy the status string into the cat field.
The ExtractPatternsInfo bean within the mappings property serves as a configurable rule that determines the outcome of the categorization. The first regular expression "group" found will automatically be copied to the output field. In this case this group is represented by (\w*) in the regex pattern. (Note that the parentheses delimit the "group" and are required.)
The ExtractPatternsInfo bean takes the following configuration options:
Specifies the name of the document field to input to the Regex Extractor. Input fields can also be specified as a list of subelements in the form <entry value="input1" />.
Specifies the field into which the category values are saved.
Holds an individual regular expression to categorize the input text. The first regex "group" in the pattern will be automatically copied to the output field.
We have also provided a component based on the ExtractRegexPatterns transformer, called ruleBasedCategorizer2. This is a simpler tool that can be configured directly in the AIE Administrator.
It is also preconfigured for you in the sampleCategorization.xml file.
Navigate to System Management > Palette and search for ruleBasedCategorizer2. Click on it to open the Component Editor.
You may specify as many input fields as you like. Regex patterns are paired with output fields. All patterns will be applied to all input fields. Matching text will be written to the various output fields. If a pattern matches against multiple input fields, it will generate multiple values in its output field.
You can edit the categorizeVersions workflow to use either ruleBasedCategorizer or ruleBasedCategorizer2. They both do exactly the same thing.
ExtractPatterns vs. ExtractRegexPatterns
Note that the older document transformer, ExtractPatterns, is more complicated to use than ExtractRegexPatterns, but has more powerful capabilities:
- ExtractRegexPatterns doesn't allow the user to restrict individual regex patterns to certain fields, like ExtractPatterns does. However, this can be accomplished with multiple ExtractRegexPatterns transformers used in series, each applying regex patterns to different input fields.
- ExtractRegexPatterns doesn't allow the user to create or extract from multiple capturing groups within a single regular expression. It only allows extracting from group 1. ExtractPatterns, on the other hand, allows referencing any number of capturing groups. For more information on capturing groups, please see Pattern . This behavior can be simulated with ExtractRegexPatterns specifying multiple regex patterns - one for each capturing group.
- ExtractRegexPatterns doesn't provide support for integrated entity validation and handling of extracted patterns, unlike ExtractPatterns. If you choose to use ExtractRegexPatterns , you must separately handle entity validation and extraction with a separate transformer.
Please keep in mind that we plan to deprecate ExtractPatterns at some point in the future. At that point, we will either have enhanced ExtractRegexPatterns to be more like ExtractPatterns, or will have clear recommendations on replacement technologies. Please try to use ExtractRegexPatterns for all new code.
Feed the sample documents into AIE using the sampleCategorization connector.
Return to the Connectors page of the AIE Admin UI. Navigate to System Management > Connectors, and check the box next to the sampleCategorization connector. Then click the Start button in the table header.
There are only three files to load, so this will take only a few seconds to complete.
Now we will run a query to verify that the documents were loaded in the index, and that they were categorized during ingestion. The categorization, if successful, will have created a facet list.
Navigate to the Query > SAIL query interface.Click the "gear" icon in the upper right to open the SAIL Preferences dialog. Navigate to the Facets tab. Set the Facet Finder Mode to Results. Click the Save button.
Look for the Category display on the SAIL landing page:
The three bars represent the three known Category values. Click on any one of them to narrow the display to just that document.