Overview
AIE users ingest documents with publication dates in many different formats. Users issue queries over ranges of dates. They want to display dates in formats that are familiar to their viewers.
There are several situations in AIE where date formatting becomes important. This page summarizes those issues and places them in perspective.
View incoming links.
AIE Architecture Diagram
This diagram summarizes date formats mapped into AIE's general architecture. These topics are explored in detail below.
Normalized Dates
AIE normalizes all incoming dates into a numeric format that makes range matching efficient. In practice, this means that AIE converts dates into java.util.Date objects during ingestion. A java.util.Date object is a thin wrapper around a 64-bit precision integer representing elapsed milliseconds since January 1, 1970, 00:00:00 GMT.
The index contains the numeric value only, not the entire java.util.Date object.
A query result wraps the numeric value in a java.util.Date object again. AIE's policy is to return this Java object to the client program unless it simply isn't possible or appropriate to do so. (You can't send a Java object over the JSON REST API, for instance.)
AIE serializes date formats using ISO 8601.
Connectors
Connectors encapsulate Scanners, which read in documents from a wide variety of repositories, such as file systems, databases, Web sites, email servers, and content-management tools. A connector reads in a raw source document and processes it into an IngestDocument . The connector then sends the IngestDocument to a workflow for ingestion processing.
The raw documents usually contain date and timestamp metadata, such as publication dates and expiration dates. Some AIE scanners can automatically normalize some of these dates into java.util.Date objects, but this does not happen in all cases. Some dates are copied into IngestDocument fields as strings or numeric values for conversion later.
Connectors write an ingestion timestamp in each IngestDocument. This is automatically created as a java.util.Date.
Ingestion Workflow
Field transformers are stages in an ingestion workflow that process field values in IngestDocuments. In this context, a transformer might read an integer date value from an IngestDocument, process it into the equivalent java.util.Date object, and attach the date object to the document in place of, or in addition to, the original value.
AIE's default ingest processing applies date transformation at three locations:
- The advteExtractor transformer, which is part of the Advanced Text Extraction Module (ATEM).
- The dateParser transformer of the standard ingestInit workflow.
- The indexer itself.
advteExtractor
The Advanced Text Extraction Module (ATEM) is invoked when you use the File Connector to ingest files of mixed file types. The ATEM can identify and convert hundreds of types of files. Part of this process is recognizing incoming field values that describe a date or timestamp. The advteExtractor transformer performs this step.
The advteExtractor looks for document fields identified as dates in the file <install-dir>\conf\advancedtextextraction\advancedtextextraction-metadata.xml.
It then applies various patterns to the field values, and automatically converts them into java.util.Date objects when a value matches a known date format. The patterns are expressed in the Java SimpleDateFormat syntax. The supported patterns are:
EEE, d MMM yyyy HH:mm:ss Z MM/dd/yyyy hh:mm aaa EEE MM/dd/yyyy hh:mm:ss aaa EEE, MMM dd, yyyy hh:mm a EEE MMM dd HH:mm:ss yyyy EEE MM/dd/yyyy hh:mm a EEE, MMM dd, yyyyhh:mm a EEE, MMM dd, yyyyhh:mma EEE, MMM dd, yyyy hh:mm:ss a EEE, dd MMM yyyy HH:mm:ss EEE, dd MMM yyyy HH:mm:ss z dd MMM yyyy HH:mm:ss Z dd MMM yyyy HH:mm MM/dd/yyyy HH:mm:ss MM/dd/yyyy MM/dd/yy EEE dd-MM-yy hh:mm:ss a
These ATEM date patterns cannot be customized by the user.
dateParser
The default AIE ingest workflow sends IngestDocuments to the ingestInit workflow. There the dateParser component examines the document for fields identified as dates in the AIE Schema. (Your project's schema is <project-dir>\conf\schema\default.xml.) It compares the field values to a series of date formats. When a value matches a date format, dateParser converts the value into a java.util.Date object.
The dateParser recognizes the following date patterns by default:
yyyy-MM-dd'T'HH:mm:ss yyyy-MM-dd yyyy-MM-dd HH:mm:ss. yyyy-MM-dd'T'HH:mm:ss. yyyy-MM-dd'T'HH:mm:ss.SSSZ. yyyy-MM-dd'T'HH:mm:ssZ.
By editing the project schema, you can control which fields are processed into dates, and you can provide AIE with additional date templates to fit your specific date format.
This is the definition of the date field from <project-dir>\conf\schema\default.xml:
<field name="date" type="date" indexed="true" stored="true" sort="true" default="NOW" > <properties> <property name="workflow.date.format" value="MM/dd/yyyy"/> </properties> </field>
The workflow.date.format property in the AIE Schema lets you add and extend the set of date patterns that AIE recognizes.
Property | Type | Default | Description |
---|---|---|---|
string | "" | Date and time pattern suitable for java.text.SimpleDateFormat. Add this format to the list of default date formats that AIE can parse into java.util.Date objects during ingestion. | |
workflow.date.defaultTimezone | string | "UTC" | Default timezone for dates if not specified in the date/time itself. |
Date fields in the default AIE Schema recognize dates that fit the "MM/dd/yyyy" pattern.
Indexer
"The Indexer"
AIE documentation often refers to "the indexer." This usage does not correspond to a specific component of the software, but to the process of creating index records from IngestDocuments. This task is performed by the Attivio Index Engine.
The indexer workflow is the last stop before AIE sends an IngestDocument to the Index Engine for processing into the Universal Index.
At that point, the indexer examines each IngestDocument for fields that are defined in the AIE Schema. If a date field contains a java.util.Date object, the numeric core of this object passes into the index. If not, the indexer attempts to cast the field's value into a Date. If this attempt does not succeed, the document is dropped and is not added to the index.
Index
The records in the Universal Index contain normalized numeric date values in the precision required by the field definitions in the schema.
Querying Dates and Times
Date formats used in queries must conform to the expectations of AIE's three distinct query languages.
Date queries require field names
To search for a date, you must use a field name, as in date:"2014-07-17". This tells AIE to convert the date string into a 64-bit number that will match that field in the index. If you omit the field name, AIE will try to match the string against the document's content field. The content field concatenates the document's title, author and text values, plus the values of all *_s and *_text dynamic fields, into a single field value. It does not contain any dates. Therefore, unfielded date queries won't match any documents.
Date Matching Conventions
AIE observes certain conventions when matching query date/time requests with indexed documents.
- Leaving off the time portion of the date in the query string defaults the time to midnight. For example, searching for date "2009-03-12" to 2009-03-13" is equivalent to searching for date "2009-03-12T00:00:00" to "2009-03-13T00:00:00".
- When querying a date field with a precision of days using a date string with higher precision (hours or finer), the date string truncates, removing the hours, minutes, seconds and milliseconds.
- For instance, a query for "2012-03-29 05:12:31" will match a document containing only "2012-03-29".
- In the same situation, a range query for "2012-01-30 14:20:00" to "2012-02-15 19:00:00" is interpreted as "2012-01-30" to "2012-02-15".
- The document date stored in the index has no time zone, and is presumed to be GMT.
- In the Simple Search Language, the query string does not support time zones. Times in simple queries are assumed to be GMT. The Advanced Query Language does allow specifying a timezone.
What to do when a date query does not work as expected:
- By default, java prints dates as the equivalent local time, but does not print the time zone. Be sure you are clear about what data is in the record.
- Retrieve the document using another query (for example, by document ID).
- Verify that the time in the stored document is what you expect.
- Remember that the date string used for queries does NOT support time zones, and is interpreted as GMT time. To specify the time based on the local time zone, use the advanced query language date syntax (DATE(date_string[, date_format[, date_timezone]])) or convert it yourself in the code underneath the UI.
Simple Query Language
Simple Query Language is designed to support keyword queries generated by untrained users, similar to the query features of Google or Yahoo. It also supports a selection of more sophisticated wildcard, field-specific, and range queries.
You must specify a date value in a Simple Query Language query in UTC format, wrapped in double quotes, because a fully-qualified date contains special characters. Valid formats are "YYYY-MM-DDThh:mm:ss" and "YYYY-MM-DD hh:mm:ss".
Ranges over dates are expressed with this syntax (where "date:" designates the date field):
Example Query | Explanation |
---|---|
date:["2007-01-01" TO "2007-01-04"] | Match documents with date values between midnight Jan 1 and midnight Jan 4, 2007. |
date:["2007-01-01 00:00:00" TO "2007-01-01 00:00:00"] | Match date values with full precision. |
Advanced Query Language
The Advanced Query Language provides sophisticated tools for use from a client program where the query can be assembled by software. The Advanced Query Language is therefore capable of much more powerful and precise queries than is the case with the Simple Query Language.
Advanced query languages provides a DATE operator for specifying date values in custom formats. The syntax is as follows:
DATE(date_string[, date_format[, date_timezone]])
If not specified, date_format is the standard ISO-8601 format ("yyyy-MM-dd'T'HH:mm:ss"). If not specified, date_timezone is assumed to be UTC.
Examples:
Date Expression |
---|
DATE("1983-12-01T00:00:00") |
DATE("12/01/1983", "MM/dd/yyyy") |
DATE("12/01/1983", "MM/dd/yyyy", EST) |
Query Workflow
Although an AIE user could create a component to alter the format of dates in a query, this is not normally necessary. AIE expects you to formulate your date requests correctly before submitting them to the query workflow.
Response Workflow
Matching documents return through the response workflow as SearchDocuments . Dates in SearchDocuments are encapsulated as java.util.Date objects. These are not the same java.util.Date objects as were used in the ingestion workflow. A default toString() rendering of a date value from a SearchDocument resembles "Fri Jun 08 18:06:26 PDT 2012".
It is not normally necessary to modify date formats in the response workflow, because that transformation is usually done in the client program that displays the results to the user. If this becomes necessary, however, it is easy to write a response transformer in Java and insert it into the response workflow. See the example here.
API Response Formats
The various AIE API channels sometimes serialize dates slightly differently.
Java Server API
The Java Server API passes the java.util.Date object directly to its client programs. This gives the client the maximum freedom and convenience to work with dates in their native format.
HTTP REST API
The JSON REST API cannot pass Java objects back to the HTTP browser, so it serializes the Date into a JSON String using date format yyyy-MM-dd'T'HH:mm:ss.SSSZ
.
Reformatting Response Data
The DateFormat Field Expression lets us reformat a date field in the search results. It is intended for use by search applications using the Java Client API or the JSON REST API.
We can also reformat dates in a SearchDocument by using a ResponseTransformer in the defaultResponse workflow. See Creating Custom Response Transformers.