Page tree
Skip to end of metadata
Go to start of metadata

Overview

The ReplacePatterns transformer is AIE's find-and-replace tool for incoming documents.  It lets us search the incoming text for strings that match a regular expression, and then replace them with manipulated strings of our choosing.  One transformer, configured as a component in a workflow, can perform multiple string replacements on multiple fields of each incoming document.  

View incoming links.

Configure RegexInfo Mappings

A component encapsulates a transformer so that it can be included in a workflow.  In this example we'll create a component called EnforceEditorialPolicy and will wrap it around an instance of the ReplacePatterns transformer.  The use case is that some incoming documents contain words or phrases that are objectionable to the application's audience.  Certain substitutions must be made.  The editor and and the publisher each have a set of substitutions to perform under differing circumstances.

The ReplacePatterns.RegexInfo bean contains all the necessary parameters for a single regex mapping in a ReplacePatterns component.  The RegexInfo beans are packaged in a named list  The list can be referenced by name from the component.  We will define two lists (one from the editor and one from the publisher).  Our ReplacePatterns transformer can be set to use either of these lists (or any other).

This example is a list of three beans, defined in a new file, <project-dir>\conf\bean\editorMappings.xml. Together, these beans do a good job of converting historical dates from the B.C./A.D. notation to the more international B.C.E./C.E. notation.  Note that these patterns capture groups of characters (between parens), and then insert the characters in the replace formula ($1 and $2) after combining them with other text.  

<project-dir>\conf\bean\editorMappings.xml
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.springframework.org/schema/beans" 
xmlns:util="http://www.springframework.org/schema/util" 
xmlns:p="http://www.springframework.org/schema/p"
xmlns:sec="http://www.springframework.org/schema/security" 
xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd http://www.springframework.org/schema/security http://www.springframework.org/schema/security/spring-security-3.1.xsd">
  <util:list id="editorMappings">
    <bean
      class="com.attivio.platform.transformer.ingest.field.ReplacePatterns.RegexInfo"
      p:input="text" p:output="text" p:pattern="(\b\d{1,4}\b)\sB\.C\.(\W)" p:replace="$1 B.C.E.$2"
      p:dotall="true" p:multiline="true" p:replaceFirstOnly="false" />
    <bean
      class="com.attivio.platform.transformer.ingest.field.ReplacePatterns.RegexInfo"
      p:input="text" p:output="text" p:pattern="century\sB\.C\.(\s|\,|\.)" p:replace="century B.C.E.$1"
      p:dotall="true" p:multiline="true" p:replaceFirstOnly="false" />
    <bean
      class="com.attivio.platform.transformer.ingest.field.ReplacePatterns.RegexInfo"
      p:input="text" p:output="text" p:pattern="\sA\.D\.\s" p:replace=" C.E. "
      p:dotall="true" p:multiline="true" p:replaceFirstOnly="false" /> 
  </util:list>
</beans>

Well will need a second mapping to accommodate the wishes of the publisher. This will be a new file, <project-dir>\conf\bean\publisherMappings.xml.  This mapping looks for "Ireland" and substituted "Shamrock".  Similarly, it finds "Britain" and substitutes "Bulldog".

<project-dir>\conf\bean\publisherMappings.xml
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.springframework.org/schema/beans" 
xmlns:util="http://www.springframework.org/schema/util" 
xmlns:p="http://www.springframework.org/schema/p"
xmlns:sec="http://www.springframework.org/schema/security" 
xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd http://www.springframework.org/schema/security http://www.springframework.org/schema/security/spring-security-3.1.xsd">
  <util:list id="publisherMappings">
    <bean 
       class="com.attivio.platform.transformer.ingest.field.ReplacePatterns.RegexInfo"
       p:input="text" p:output="text" p:pattern="\bIreland\b"
       p:replace="Shamrock" p:dotall="true" p:multiline="true" />
    <bean
       class="com.attivio.platform.transformer.ingest.field.ReplacePatterns.RegexInfo"
       p:input="text" p:output="text" p:pattern="\bBritain\b"
       p:replace="Bulldog" p:dotall="true" p:multiline="true" />
  </util:list>
</beans>

Save these files in the <project-dir>\conf\bean\ configuration directory and use Agent-CLI to deploy the files to the the configuration server. 

RegexInfo Properties

The RegexInfo properties are as follows:

RegexInfo PropertyDescription
inputThe IngestDocument field that contains the text to process.
outputThe IngestDocument field that should receive the processed text.  The altered text will overwrite the previous value.
patternThe regex search pattern.  The example above ("\bhis\b") looks for the word "his" between word boundaries.
replaceThe regex output pattern.  It can be a simple string ("his/hers") or a complex regex expression that assembles a new value from parts of the matched string.
dotallSet to true to enable the Pattern.DOTALL flag on the regular expression Pattern .
multilineSet to true to enable the Pattern.MULTILINE flag on the regular expression Pattern .
replaceFirstOnlyIf set to true, replaces only the first match in the input string.

Configure ReplacePatterns Component

We can create a ReplacePatterns-based component called EnforceEditorialPolicy through the AIE Administrator.

In the AIE Administrator, navigate to the Palette. Click on New and search for ReplacePatterns.  Select ReplacePatterns and open a New ReplacePatterns dialog box.

Give the component a name (EnforceEditorialPolicy) and link it to a set of regex mappings (publisherMappings in this case). Save the component.

Since this item was composed in the AIE Administrator, there is no need to deploy to.  However, to store the text version of the component configuration in your project, you'll need to use the Agent-CLI's update command.

 

Configure Workflow

To employ the new ReplacePatterns component we have to insert it into an appropriate workflow. 

The ingest workflow is AIE's default ingestion path.  It branches to all the subflows that implement AIE's linguistic analysis features, and it ultimately directs processed IngestDocuments to the indexer.  As a rule of thumb, a new ReplacePatterns component that changes the text of the document should be inserted at the beginning of the ingest workflow. 

IngestWorkFlow

We can make this change in the AIE Administrator Workflow Editor. 

To make the same change in the ingest workflow in the AIE Administrator, navigate to the Workflow > Ingest list, and click on the ingest workflow. In the editor, click the Add Existing Component button, and select the EnforceEditorialPolicy component.  Use the Move Up button to position this component at the top of the list.

WorkflowEditor

Example Run

When we run the news connector (from the Factbook demo) while using the EnforceEditorialPolicy component and the publisherMappings list, all references to "Ireland" and "Britain" have been replaced by "Shamrock" and "Bulldog," respectively. This passage is from a BBC news article:

Shamrock

If we run the country connector and change EnforceEditorialPolicy to use the editorMappings, we see the historial B.C./A.D. dates being converted to B.C.E. and C.E. notation:

SriHistoryDates

 

 

  • No labels