"Scope search" extends the Advanced Query Language by letting us search for terms that occur within a specific context.
A "scope" is a region of tokenized text delimited by scope tokens. For instance, AIE can place "begin sentence" and "end sentence" tokens around each sentence in a field of tokenized text. This lets us query for multiple terms within the same sentence. Alternately, we can query for terms that must be found in different sentences.
Scope search also lets us test for the presence of a specific scope in a field. Using entity extraction, AIE can put "begin people" and "end people" tokens around each person identified in a field. We can then retrieve all documents that mention any person in that field.
For instance, the following example query:
was used to query the text field for the appearance of a date, a person, and the keyword "president", all in the same sentence. The query matched this description from a BBC news item:
"Afghan President Hamid Karzai is in India to press for more military aid, ahead of the planned withdrawal of Nato-led forces from Afghanistan in 2014.".
SCOPE vs ENTITY Operators
The SCOPE operator extends and replaces the ENTITY operator, which is deprecated. Both are operators of the Advanced Query Language.
View incoming links.
Entity Extraction Module
Requires Entity Extraction Module
Many of the examples in this section require that the createProject.module be included when you run
Scope Highlighting in SAIL
If you are exploring scope search, note that SAIL color-codes some of the more common scopes in the search results. Hovering over each underlined term will show which entity type was matched at the given location.
The Entity Extraction module adds scope tokens to the text field of the incoming document. It embeds scope tokens for sentences, people, companies, locations, and dates.
The examples in this section refer to the two following sentences, presumed to be in the text fields of two documents.
A. Mark Smith, founder of Absent Technologies, visited New York eight times in 2005.
B. The teacher marked him absent eight times in 2005.
Sentence A features a person, a company, a location and a date. Sentence B contains some of the same words, but the only scoped content is the date. We'll use these similar sentences to explore various examples of scoped queries.
The following schematic shows how Sentence A might be divided into scopes by the Entity Extraction module:
Note that the scopes are neatly nested. AIE scopes never overlap.
In contrast, sentence B has only two scopes, the sentence and a date:
Viewing Scope Tokens
Scopes do have "begin" and "end" tokens, but not quite as depicted in this simplified example. We are using a simplified "schematic" view of tokens on this page because the actual tokens are complex and are subject to change as new features are added to AIE. There is no need for the AIE user to study the details of token encoding, but if you need to check a token list for token presence, absence, or location, you can use the procedure described in Field Guide to Tokens.
The Entity Extraction module embeds sentence tokens at the beginning and end of each section of text that it recognizes as a sentence.
Here's an example query that would match sentence A:
This query will find all documents that have "Smith" and "New York" in the same sentence in the text field. Note that fields and scopes share a similar syntax (field text: contains scope sentence:).
The Entity Extraction module also embeds people tokens at the beginning and end of each section of text that it recognizes as the name of a person.
Is there a person named "Mark" in the text field?
This query matches sentence A (which contains a person named Mark) but not sentence B (which contains the word "mark" but not in a people scope. (The original word "marked" has been shortened to "mark" by stemming.)
Or we could request all documents that mention any person at all in the text field:
The scope() syntax detects the presence or absence of that type of scope anywhere in the current context.
The Entity Extraction module also embeds company tokens at the beginning and end of each phrase that it recognizes as the name of a company.
Return all documents where a person, Mark Smith, appears in a sentence with any company.
Note that we left off the text: field restriction in this example. This query matches the default-search field of the document, which concatenates the title, author, and text fields, among others. For details see the content field of your schema file: <project-dir>\conf\schema\default.xml.
We could also write a query asking for a person named "Mark" in the same sentence as a company named "Absent".)
This query would match sentence A, but would not match the similar sentence B. This is the power of scope queries.
The Entity Extraction module uses a variety of regex-based rules to recognize dates in text. It inserts date scope tokens at the beginning and end of each recognized date.
Date Scope Datatype is "String"
Note that the value of a date scope is a string, not a date. The Advanced Query Language features for matching dates and date ranges do not work in this context.
Find all documents that mention the Absent Technologies company and any date, in the text field:
That query locates sentence A.
Find a sentence that mentions marking a person absent on some date:
This query is a bit defective, however, in that it locates both sentence A and sentence B. We can tune it to get better discrimination:
The tuned-up query looks for "mark" in a sentence that does not contain a person named "Mark," and also contains "absent" but not a company called "Absent", plus any date. This matches sentence B but not sentence A.
Note that NOT(people:Mark) and people:NOT(Mark) are two very different things. One says, "There is no person named Mark." The other says, "There is a person, but it is not Mark." NOT(scope(people)) says, "There is no person here at all."
XML Scopes Ingest Workflow
XML document ingestion typically uses the xmlIngest workflow with its xPathExtractorcomponent to map XML element values into IngestDocument fields. However, AIE also provides the xmlScopesIngest workflow, which substitutes the extractScopes component for xPathExtractor.
The extractScopes transformer takes all of the elements of the XML document and concatenates them into the text field of the IngestDocument. It wraps scope tokens around each value, naming the scope after the XML element that supplied the value.
For instance, this is a simple document description in XML:
Note the XML elements in this book description: title, author, yearpublished, description and location.
After processing by the extractScopes component, the text field of the IngestDocument contains scopes that are based on the XML elements in the book description. (The Entity Extraction module has been active here also.)
In combination with the Entity Extraction module, extractScopes produces an extremely rich panoply of scopes, even in this tiny example.
In setting up this example, we had to tweak the Entity Extraction module to get it to recognize Dickensonian character names. AIE does not normally know that "Fagin" and "the Artful Dodger" are people.
Here is a sample query:
The query seeks a book called "Oliver Twist," where the description mentions the title character plus at least two additional people.
The xmlScopeIngest workflow offers two advantages over the more versatile xmlIngest workflow:
- WIth the xmlScopeIngest workflow, there is no need to create a map of xPath formulas to pull values out of the xmldom and insert them into fields of the IngestDocument. The extractScopes component takes all of the XML elements and copies them into the IngestDocument's text field.
- By storing all of the values in a single, scoped field, extractScopes greatly reduces the size of the AIE index while still preserving the ability to query the content on a field-by-field basis. There is some slight increase in query time, but sometimes the advantages of reducing the index size outweigh this.
XML Scopes vs Sentence Scopes
Sentence scoping does not always interact intuitively with XML scoping. One sentence can contains multiple XML elements, for instance. This can produce unintuitive query behavior.
The best practice is to disable (or just ignore) sentence scoping when using XML scoping. To disable sentence scoping, comment out the SentenceFinder bean in <project-dir>\conf\beans\finderList.xml before ingesting documents.
Scoping XML Attributes
The book XML used in the previous example used a book element attribute, id, as the unique identification of the document. (An attribute is a modifier on an element.)
The previous example missed this vital piece of information because the extractScopes component does not process attributes by default.
To enable attribute extraction and scoping, edit <project-dir>\conf\schema\default.xml and modify the text field. Add a scope.xmlAttributes property set to "true"
Restart AIE and reload the feed. This time the attributes will be included in the text field, in scopes marked by a leading at-sign (@).
To query for the document with id=1, set up the query like this:
Entity Sentiment Module
Entity Sentiment Module
The features in this section require that the createProject to create the project directories. In addition, the classifier and sentiment modules must be installed in AIE, but do not need to be included in the project.module be included when you run
Entity Sentiment Highlighting in SAIL
If you are exploring entity sentiment scope search, note that SAIL uses arrow icons to indicate entity sentiment in search results:
- Positive: Green up-arrow
- Negative: Red down-arrow
The Entity Sentiment module adds scope tokens to the text field of the incoming document. The tokens indicate various degrees of positive or negative sentiment associated with individual entities.
The entity-sentiment scopes are as follows:
- entsentpos: This scope contains an entity with some degree of positive sentiment. It is "stacked" with one of the three following scopes:
- entsentpos_1: The entity has a modest positive sentiment score.
- entsentpos_2: The entity has a medium positive sentiment score.
- entsentpos_3: The entity has an extremely positive sentiment score.
- entsentneg: This scope contains an entity with some degree of negative sentiment. It is "stacked" with one of the three following scopes:
- entsentneg_1: The entity has a modest negative sentiment score.
- entsentneg_2: The entity has a medium negative sentiment score.
- entsentneg_3: The entity has an extremely negative sentiment score.
The "stacking" of a general and a specific token lets us search for "any positive sentiment" or for a specific level of positive sentiment.
The following query looks for a company that has any level of positive sentiment associated with it:
To find a rave review (extremely positive) of a person, we might use this query:
The features in this section require that the createProject to create the project directories.module be included when you run
Scope Highlighting in SAIL
If you are exploring key-phrase scope search, note that SAIL colors key-phrases orange in the search results.
This color is assigned in <install-dir>\webapps\sail\resources\css\scopesearch.css.
The Keyphrasesmodule adds scope tokens to the text field of the incoming document. The tokens bracket phrases that have been determined to be statistically unlikely and therefore probably have special meaning.
For instance, this is the title of an actual BBC news article:
After tokenization, this becomes:
Let's query for any article that has a keyphrase in the title:
Find articles where the title contains a keyphrase that happens to contain a location.
A scope query, such as "search title:scope(location)", can match many entities. The Scope Facets feature lets us compile a facet list of the matching entities. In the case of location entities, the result looks something like this:
See the Scope Facets page for more information.
Scope Query Operators
The following query operators are legal sub queries for scope filtering:
See the Advanced Query Language page for more information.
Result Highlighting by Scope
The ScopeTeaser field expression lets you search results that have been trimmed down to the size of a certain type of scope, (typically the sentence scope). You see the whole sentence, with the matching term(s) highlighted, but the rest of the field value is suppressed for a neater display.
See ScopeTeaser for more information.