Overview
Pattern-based entity extraction (formerly known as Rule-Based Entity Extraction) is used to locate entities that match predictable patterns of text, such as telephone numbers, email addresses, URLs and similar well-formulated strings. On this page we demonstrate how to set up pattern-based entity extraction using a regex rule in a workflow component fed by a File Connector.
Required Modules
This feature requires that the entityextraction module be included when you run createproject to create the project directories.
Writing Regular Expressions
This exercise uses a very simple regular expression to find very well-behaved phone numbers that look like XXX-XXX-XXXX. AIE lets you add more regex patterns (also more-complex patterns) to capture other phone-number formats.
View incoming links.
Before You Begin
Ensure that the environment is prepared as follows:
1. Install Attivio.
2. Create a new project based on the Quick Start Tutorial that includes the demo module group, the entityextraction module, and the factbook module. For Windows, the createproject command looks like this:
<install-dir>/bin/createproject -n ruleextraction -g demo -m entityextraction -o <project-dir>
3. Start AIE using the AIE Agent and its Command-Line Interface (CLI):
Run the Agent in a Command Window:
<install-dir>\bin\aie-agent.exe -d <data-agent-dir>
Run the Command-Line Interface in a second Command Window. Note that the CLI is invoked for a specific project.
<install-dir>\bin\aie-cli -p <project-dir>
- To run the project use the start all command in the Command-Line Interface:
Setting Up the Target Files
This example reads text files and examines them for phone numbers. You can use your own files, or download the three test files that are attached to this page. Unzip them in c:\documents.
Each test file is a fictional description of a person, similar to this:
Matthew Brown 652-945-4681.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque eu felis odio. Fusce augue velit, sollicitudin eget lacus nec, mattis tempus tellus. Cras mattis ultricies condimentum. Sed a urna nunc (558-548-2154). Aenean id sagittis lectus, vel tempus odio. Donec gravida ultrices orci, vel lacinia nulla euismod sed. Cras vestibulum lectus id erat eleifend mollis. Mauris tristique tincidunt erat nec accumsan. Quisque a molestie neque. Phasellus quis lacinia magna. Vivamus fermentum tristique sapien at malesuada. Nunc mi justo, aliquam vel urna ac, imperdiet volutpat sapien.
Note the embedded telephone numbers.
We will run a regex rule against the document's text field, looking for phone numbers. Then we'll copy the numbers into the phone_s field.
Define a Custom File Connector
We'll use a File Connector to load the three target files.
- Start your AIE project, and navigate to the System Management > Connectors page.
- Click New, and select the File Connector.
- Name the connector bioConnector (for "biography connector").
- Set the Start Directory to c:\documents (or where ever you put the target files).
- Use a Wildcard Include Filter of *.txt. The other default filters are not necessary to this example, so we deleted them.
- The example does not require Wildcard Exclude Filters, so we deleted them also.
- Insure that the Ingest Workflow is fileingest.
8. On the Field Mappings tab, make a dynamic mapping from the filename field to the title field.
9. Save the connector.
Define a Custom Workflow and Component
A File Connector like bioConnector simply picks up files from a directory. It is the fileIngest workflow that unpacks the files and processes content into IngestDocument fields. For this reason a File Connector must always send its output to fileIngest.
The fileIngest workflow, in turn, sends its output to the ingest workflow. In this exercise, we will create a new workflow, called phoneNumberExtraction, to insert between fileIngest and ingest. We'll also create a new component based on the ExtractRegexPatterns transformer.
- Navigate to System Management > Workflows > Ingest, and click on the New button.
- Give the workflow the name phoneNumberWorkflow.
- Click the Add New Component button. Ask for an ExtractRegexPatterns component. This opens a Component Editor.
- Name the component PhoneNumberExtractor.
- Set the input field to text.
- Set up the following rule:
- Regex pattern is \b([0-9]{3}\b-[0-9]{3}-[0-9]{4})\b
Note: Place parentheses () around the part of the pattern that you wish to extract.
- Output Field is phone_s.
- Regex pattern is \b([0-9]{3}\b-[0-9]{3}-[0-9]{4})\b
- Save the component. This puts you back in the Workflow Editor. Note that the new component has been added to the workflow.
- Click the Add Subflow button. Select the Ingest workflow.
- Save the workflow.
Modify FileIngest Workflow
Don't forget to edit the fileIngest workflow and change the final subflow from ingest to phoneNumberExtraction.
Ingest Content
Navigate to System Management > Connectors, and check the box next to the bioConnector connector. Then click the Start button in the table header.
There are only three files to load, so this will take only a few seconds to complete.
Query
Now we will run a query to verify that the documents were loaded in the index, and that phone numbers appeared during ingestion.
Navigate to the Query > SAIL query interface. Set the search string to "*:*" (without quotes). Click on the Search button.
The search page displays entries for all three ingested documents. If you click on Search Options and select Debug you'll see all available fields.