Page tree
Skip to end of metadata
Go to start of metadata

Overview

"Scope search" extends the Advanced Query Language by letting us search for terms that occur within a specific context. 

A "scope" is a region of tokenized text delimited by scope tokens.  For instance, AIE can place "begin sentence" and "end sentence" tokens around each sentence in a field of tokenized text.  This lets us query for multiple terms within the same sentence. Alternately, we can query for terms that must be found in different sentences.  

Scope search also lets us test for the presence of a specific scope in a field.  Using entity extraction, AIE can put "begin people" and "end people" tokens around each person identified in a field.  We can then retrieve all documents that mention any person in that field.  

For instance, the following example query:

Advanced Query Language, Scope Query
text:sentence:AND(scope(date), scope(person), president)

was used to query the text field for the appearance of a date, a person, and the keyword "president", all in the same sentence. The query matched this description from a BBC news item:

"Afghan President Hamid Karzai is in India to press for more military aid, ahead of the planned withdrawal of Nato-led forces from Afghanistan in 2014.".

SCOPE vs ENTITY Operators

The SCOPE operator extends and replaces the ENTITY operator, which is deprecated.  Both are operators of the Advanced Query Language

View incoming links.

Entity Extraction Module

Requires Entity Extraction Module

Many of the examples in this section require that the entityextraction module be included when you run createProject.

Scope Highlighting in SAIL

If you are exploring scope search, note that SAIL color-codes some of the more common scopes in the search results. Hovering over each underlined term will show which entity type was matched at the given location.

The Entity Extraction module adds scope tokens to the text field of the incoming document.  It embeds scope tokens for sentences, people, companies, locations, and dates.

Target Sentences

The examples in this section refer to the two following sentences, presumed to be in the text fields of two documents.

A. Mark Smith, founder of Absent Technologies, visited New York eight times in 2005.

B. The teacher marked him absent eight times in 2005.

Sentence A features a person, a company, a location and a date.  Sentence B contains some of the same words, but the only scoped content is the date.  We'll use these similar sentences to explore various examples of scoped queries.

The following schematic shows how Sentence A might be divided into scopes by the Entity Extraction module:

Scopes of Sentence A
 [begin sentence]
   [begin people]
     Mark Smith
   [end people]
   , founder of 
   [begin company]
     Absent Technologies
   [end company]
   , visited 
   [begin location]
     New York
   [end location]
   eight times in 
   [begin date]
     2005
   [end date]
   . 
[end sentence]

Note that the scopes are neatly nested.  AIE scopes never overlap.

In contrast, sentence B has only two scopes, the sentence and a date:

Scopes of Sentence B
[begin sentence]
  The teacher marked him absent eight times in 
  [begin date]
    2005
  [end date]
  . 
[end sentence]

Viewing Scope Tokens

Scopes do have "begin" and "end" tokens, but not quite as depicted in this simplified example. We are using a simplified "schematic" view of tokens on this page because the actual tokens are complex and are subject to change as new features are added to AIE. There is no need for the AIE user to study the details of token encoding, but if you need to check a token list for token presence, absence, or location, you can use the procedure described in Field Guide to Tokens.

Sentence Scope

The Entity Extraction module embeds sentence tokens at the beginning and end of each section of text that it recognizes as a sentence.

Here's an example query that would match sentence A:

Advanced Query Language
text:sentence:AND(Smith,New York) 

This query will find all documents that have "Smith" and "New York" in the same sentence in the text field. Note that fields and scopes share a similar syntax (field text: contains scope sentence:). 

People Scope

The Entity Extraction module also embeds people tokens at the beginning and end of each section of text that it recognizes as the name of a person.

Is there a person named "Mark" in the text field? 

Advanced Query Language
text:people:Mark 

This query matches sentence A (which contains a person named Mark) but not sentence B (which contains the word "mark" but not in a people scope.  (The original word "marked" has been shortened to "mark" by stemming.) 

Or we could request all documents that mention any person at all in the text field:

Advanced Query Language
text:scope(people) 

The scope() syntax detects the presence or absence of that type of scope anywhere in the current context.

Company Scope

The Entity Extraction module also embeds company tokens at the beginning and end of each phrase that it recognizes as the name of a company. 

Return all documents where a person, Mark Smith, appears in a sentence with any company.

Advanced Query Language
sentence:AND(people:"Mark Smith",scope(company))

default-search-field

Note that we left off the text: field restriction in this example.  This query matches the default-search field of the document, which concatenates the title, author, and text fields, among others. For details see the content field of your schema file: <project-dir>\conf\schema\default.xml.

We could also write a query asking for a person named "Mark" in the same sentence as a company named "Absent".) 

Advanced Query Language
text:sentence:AND(people:Mark,company:Absent)

This query would match sentence A, but would not match the similar sentence B.  This is the power of scope queries.

Date Scope

The Entity Extraction module uses a variety of regex-based rules to recognize dates in text.  It inserts date scope tokens at the beginning and end of each recognized date. 

Date Scope Datatype is "String"

Note that the value of a date scope is a string, not a date.  The Advanced Query Language features for matching dates and date ranges do not work in this context.

Find all documents that mention the Absent Technologies company and any date, in the text field:

Advanced Query Language
text:AND(company:Absent,scope(date))

That query locates sentence A.

Find a sentence that mentions marking a person absent on some date:

Advanced Query Language
text:sentence:AND(mark,scope(people),absent,scope(date)) 

This query is a bit defective, however, in that it locates both sentence A and sentence B.  We can tune it to get better discrimination:

Advanced Query Language
text:sentence:AND(mark,NOT(people:Mark),absent,NOT(company:Absent),scope(date))

The tuned-up query looks for "mark" in a sentence that does not contain a person named "Mark," and also contains "absent" but not a company called "Absent", plus any date.  This matches sentence B but not sentence A.

Scope Negation

Note that NOT(people:Mark) and people:NOT(Mark) are two very different things.   One says, "There is no person named Mark." The other says, "There is a person, but it is not Mark."  NOT(scope(people)) says, "There is no person here at all."

XML Scopes Ingest Workflow

XML document ingestion typically uses the xmlIngest workflow with its xPathExtractorcomponent to map XML element values into IngestDocument  fields. However, AIE also provides the xmlScopesIngest workflow, which substitutes the extractScopes component for xPathExtractor

XMLWorkflows

The extractScopes transformer takes all of the elements of the XML document and concatenates them into the text field of the IngestDocument.  It wraps scope tokens around each value, naming the scope after the XML element that supplied the value.

For instance, this is a simple document description in XML:

XML Document Description
   <book id="1">
    <title>Oliver Twist</title>
    <author>
      <firstname>Charles</firstname>
      <lastname>Dickens</lastname>
    </author>
    <yearpublished>1837</yearpublished>
    <description>Oliver Twist, Fagin, Bill Sykes, and the
     Artful Dodger live by their wits in this
     dark tale of pre-Victorian England.
    </description>
    <location>London</location>
    <location>England</location>
  </book>

Note the XML elements in this book description: title, author, yearpublished, description and location.

After processing by the extractScopes component, the text field of the IngestDocument contains scopes that are based on the XML elements in the book description. (The Entity Extraction module has been active here also.)

Text field of Book #1
[begin book]
  [begin title]
      [begin people]
        Oliver Twist
      [end people]
  [end title]
  [begin author]
    [begin people]
      [begin firstname]
        Charles
      [end firstname]
      [begin lastname]
        Dickens
      [end lastname]
    [end people]
  [end author]
  [begin yearpublished]
    1837
  [end yearpublished]
  [begin description]
      [begin people]
        Oliver Twist
      [end people]
      , 
      [begin people]
        Fagin
      [end people]
      , 
      [begin people]
        Bill Sykes
      [end people]
      and the 
      [begin people]
        Artful Dodger
      [end people]
      live by their wits in this dark tale of pre-Victorian 
      [begin location]
        England
      [end location]
      . 
  [end description]
[end book] 

In combination with the Entity Extraction module, extractScopes produces an extremely rich panoply of scopes, even in this tiny example.

Small Confession

In setting up this example, we had to tweak the Entity Extraction module to get it to recognize Dickensonian character names.  AIE does not normally know that "Fagin" and "the Artful Dodger" are people. 

Here is a sample query:

Advanced Query Language
text:book:AND(title:"Oliver Twist",description:AND(people:"Oliver Twist",scope(people),scope(people)))

The query seeks a book called "Oliver Twist," where the description mentions the title character plus at least two additional people.

The xmlScopeIngest workflow offers two advantages over the more versatile xmlIngest workflow:

  • WIth the xmlScopeIngest workflow, there is no need to create a map of xPath formulas to pull values out of the xmldom and insert them into fields of the IngestDocument.  The extractScopes component takes all of the XML elements and copies them into the IngestDocument's text field.
  • By storing all of the values in a single, scoped field, extractScopes greatly reduces the size of the AIE index while still preserving the ability to query the content on a field-by-field basis.  There is some slight increase in query time, but sometimes the advantages of reducing the index size outweigh this.

XML Scopes vs Sentence Scopes

Sentence scoping does not always interact intuitively with XML scoping.  One sentence can contains multiple XML elements, for instance.  This can produce unintuitive query behavior.  

The best practice is to disable (or just ignore) sentence scoping when using XML scoping.  To disable sentence scoping, comment out the SentenceFinder bean in <project-dir>\conf\beans\finderList.xml before ingesting documents.  

 

Scoping XML Attributes

The book XML used in the previous example used a book element attribute, id, as the unique identification of the document. (An attribute is a modifier on an element.)

XML Document Description
  <book id="1">
    <title>Oliver Twist</title>
    <author> 
    ... etc. 

The previous example missed this vital piece of information because the extractScopes component does not process attributes by default.

To enable attribute extraction and scoping, edit <project-dir>\conf\schema\default.xml and modify the text field.  Add a scope.xmlAttributes property set to "true"

<project-dir>\conf\schema\default.xml
      <field name="text" type="text" tokenize="true" indexed="true" stored="true">
        <properties>
          <property name="scope.xmlAttributes" value="true"/>
          <property name="highlight.scopeMode" value="xml"/>
          <property name="highlight.fallbackField" value="teaser"/>
          ... etc. 

Restart AIE and reload the feed.  This time the attributes will be included in the text field, in scopes marked by a leading at-sign (@).

[begin book]
  [begin @id]
     1
  [end @id]
  [begin title]
    ... etc.  

To query for the document with id=1, set up the query like this:

text:@id:1

 

Entity Sentiment Module

Entity Sentiment Module

The features in this section require that the entitysentiment module be included when you run createProject to create the project directories. In addition, the classifier and sentiment modules must be installed in AIE, but do not need to be included in the project.

Entity Sentiment Highlighting in SAIL

If you are exploring entity sentiment scope search, note that SAIL uses arrow icons to indicate entity sentiment in search results:

  • Positive: Green up-arrow
  • Negative: Red down-arrow

The Entity Sentiment module adds scope tokens to the text field of the incoming document. The tokens indicate various degrees of positive or negative sentiment associated with individual entities.

The entity-sentiment scopes are as follows:

  • entsentpos: This scope contains an entity with some degree of positive sentiment. It is "stacked" with one of the three following scopes:
    • entsentpos_1: The entity has a modest positive sentiment score.
    • entsentpos_2: The entity has a medium positive sentiment score.
    • entsentpos_3: The entity has an extremely positive sentiment score.
  • entsentneg: This scope contains an entity with some degree of negative sentiment. It is "stacked" with one of the three following scopes:
    • entsentneg_1: The entity has a modest negative sentiment score.
    • entsentneg_2: The entity has a medium negative sentiment score.
    • entsentneg_3: The entity has an extremely negative sentiment score.

The "stacking" of a general and a specific token lets us search for "any positive sentiment" or for a specific level of positive sentiment. 

The following query looks for a company that has any level of positive sentiment associated with it:

Advanced Query Language
text:entsentpos:scope(company) 

To find a rave review (extremely positive) of a person, we might use this query:

Advanced Query Language
text:entsentpos_3:scope(people) 

 

Keyphrase Module

Keyphrase Module

The features in this section require that the  keyphrases module be included when you run createProject to create the project directories.

Scope Highlighting in SAIL

If you are exploring key-phrase scope search, note that SAIL colors key-phrases orange in the search results.

This color is assigned in <install-dir>\webapps\sail\resources\css\scopesearch.css.

 

The Keyphrases  module adds scope tokens to the text field of the incoming document. The tokens bracket phrases that have been determined to be statistically unlikely and therefore probably have special meaning. 

For instance, this is the title of an actual BBC news article:

Title of RSS News Article
VIDEO: Sri Lanka is making great progress

After tokenization, this becomes:

Text Field of IngestDocument
[begin sentence]
  VIDEO:
  [begin location]
     [begin keyphrase]
        Sri Lanka
     [end keyphrase]
  [end location]
  is making great progress
[end sentence] 

Let's query for any article that has a keyphrase in the title:

Advanced Query Language
title:scope(keyphrase) 

Find articles where the title contains a keyphrase that happens to contain a location.

title:keyphrase:scope(location)

Scope Facets

A scope query, such as "search title:scope(location)", can match many entities.  The Scope Facets feature lets us compile a facet list of the matching entities.  In the case of location entities, the result looks something like this:

See the Scope Facets page for more information. 

Scope Query Operators

The following query operators are legal sub queries for scope filtering:

  • AND
  • OR
  • TERM
  • REGEX
  • FUZZY
  • CONTEXT
  • PHRASE
  • NEAR
  • ONEAR
  • ENTITY
  • SCOPE
  • RANGE
  • NOT
  • *

See the Advanced Query Language page for more information.

Result Highlighting by Scope

The ScopeTeaser field expression lets you search results that have been trimmed down to the size of a certain type of scope, (typically the sentence scope).  You see the whole sentence, with the matching term(s) highlighted, but the rest of the field value is suppressed for a neater display. 

See ScopeTeaser for more information.

  • No labels