Overview
The Advanced Text Extraction Workflow extracts various metadata properties from input documents. This page describes those properties and how to configure them.
View incoming links.
The standard processing feedback fields are also populated as described in Configure the AIE Schema
Native Metadata Properties
The native metadata properties are listed in the following configuration file: <installation_dir>\conf\advancedtextextraction\advancedtextextraction-metadata.xml. In this configuration file, a mapping between native metadata properties and valid AIE field names is maintained.
This mapping has to be done due to the fact that the native metadata properties do not follow any particular naming convensions and thus their names may be invalid with regard to the AIE field naming scheme.
A lot of the fields to which the metadata properties are mapped, are included into the AIE schema defined for the Advanced Text Extraction module. This schema can be found in <installation_dir>\conf\advancedtextextraction\advancedtextextraction-schema.xml.
When working on your AIE project, you are able to specify exactly which fields you want in your own schema, under <project_dir>\conf\advancedtextextraction\advancedtextextraction-schema.xml.
It is highly recommended that you identify and use only the subset of fields that you feel are necessary for your application, to minimize the affect of having a large schema on the AIE performance.
Supported Metadata Properties
The sections below elaborate on two sets of metadata:
- The "common" set of field names that apply to all document types. These are system-level fields which should not be removed from your schema.
- Metadata properties (and respective fields) that are specific to particular supported document types. These are the fields you want to choose your own set of fields from, by editing your schema.
Common Metadata
The following table summarizes the "common" properties extracted by the Advanced Text Extraction Workflow of the Attivio Intelligence Engine (AIE):
Field name | Data type | Description | Notes |
---|---|---|---|
ancestorids | String | The list of ancestor document ID's (stored on a child document, the outermost ancestor first and the immediate parent last) | This is useful when tracking the full parent/child lineage of document hierarchies, for example, in a Word document within a zip archive that happens to be attached to an email message. The ancestorids field for the Word document start swith the ID of the email message document and ends with the ID of the zip archive document, thus preserving the hierarchy of relationships. |
childpath | String | The name or path of a child document (by which it is referred to in the parent document). | See also childtype. |
childtype | String | The type of child document: attachment, entry, embedding. | See also childpath. |
doctype | String | The document type | The supported document types are summarized in the advancedtextextraction-doctypes.xml file. The 'type' attribute value is stored as the doctype field value. |
fileext | String | The extension of the original filename (if any) |
|
filename | String | The short filename of the document |
|
mimetype | String | The MIME type of the document | The supported document types are summarized in the doc-types-config.xml file. The 'mimetype' attribute value is stored as the mimetype field value. |
parentdoctype | String | The parent document type (if any) | The supported document types are summarized in the doc-types-config.xml file. The 'parenttype' attribute value is stored as the parentdoctype field value. |
parentid | String | The ID of the parent document (if any) | Holds the ID of the parent document, if any. In contrast to ancestorids, this allows for efficient querying of parent/child relationships on a single level of a document hierarchy. For a given sub-document, it allows you to quickly find its parent, and for a parent document, it allows you to quickly find all of its child documents. |
parentmimetype | String | The parent MIME type (if any) | The supported document types are summarized in the doc-types-config.xml file. The 'parentmimetype' attribute value is stored as the parentmimetype field value. |
sourcepath | String | The original filepath of the document (if any) |
|
sourceuri | String | The original URI of the document (if any) |
|
text | Text | The text content extracted from the document |
|
The Specific Properties
The following table summarizes all of the field names that are supported by AIE for documents going through the Advanced Text Extraction Workflow.
Note that AIE maps the metadata properties extracted, to a consistent, normalized set of Attivio fields. This is due to the fact that the native metadata properties extracted do not follow any particular pattern or naming scheme.
Field name | Data type | Description | Version added in |
---|---|---|---|
abstract | String | the document abstract |
|
acceptlanguage | String | MIME/email-related property | AIE 3.0 |
account | String | the account information |
|
actualwork | Long | the actual work (typically, for an Outlook task) |
|
address | String | the address (e.g. for an Outlook contact) |
|
albumtitle | String | the album title (e.g. for an MP3 file) |
|
alternaterecipientallowed | Boolean | the alternate recipient allowed value (for an email message) |
|
anniversary | String | the anniversary (typically for an Outlook contact) |
|
application | String | the name of the application that was used to create the document |
|
appversion | String | the version of the application that was used to create the document |
|
assistant | String | the assistant's name |
|
assistantphonenum | String | the person's assistant's telephone number (typically for an Outlook contact) |
|
attachment | String | attachment | AIE 3.0 |
attachments | String | the list of attachments in an email message |
|
attendees | String | the list of attendees (e.g. for an Outlook meeting request) |
|
attrhidden | Boolean | the 'attr hidden' header value (for a MIME message) |
|
attrreadonly | Boolean | the 'attr read-only' header value (for a MIME message) |
|
attrsystem | Boolean | the 'attr system' header value (for a MIME message) |
|
author | String | the document's author or authors |
|
authorization | String | the authorization |
|
autoforwarded | String | the 'auto-forwarded' header value (for an email message) |
|
backupdate | Date | the document backup date |
|
basefilelocation | String | the document base file location |
|
bcc | String | the blank carbon copy field in an email message |
|
billinginfo | String | the billing information (typically, for an Outlook task) |
|
billto | String | bill to |
|
birthday | String | the birthday (typically for an Outlook contact) |
|
businessaddress | String | the business address (typically for an Outlook contact) |
|
businessaddresscity | String | the contact's business address city (typically for an Outlook contact) |
|
businessaddresscountry | String | the contact's business address country (typically for an Outlook contact) |
|
businessaddresspobox | String | the contact's business address P.O. Box (typically for an Outlook contact) |
|
businessaddresspostalcode | String | the contact's business address postal code (typically for an Outlook contact) |
|
businessaddressstate | String | the contact's business address state (typically for an Outlook contact) |
|
businessaddressstreet | String | the contact's business address street (typically for an Outlook contact) |
|
businesscity | String | the person's business city (typically for an Outlook contact) | AIE 3.0 |
businesscountry | String | the person's business country (typically for an Outlook contact) | AIE 3.0 |
businessfaxnum | String | the person's business fax number (typically for an Outlook contact) |
|
businessphonenum | String | the person's business telephone number (typically for an Outlook contact) |
|
businessphonenum2 | String | the person's alternative telephone number (typically for an Outlook contact) |
|
businesspostalcode | String | the person's business postal code (typically for an Outlook contact) | AIE 3.0 |
businessstate | String | the person's business state (typically for an Outlook contact) | AIE 3.0 |
businessstreet | String | the person's business street (typically for an Outlook contact) | AIE 3.0 |
businessstreet2 | String | the person's alternative business street (typically for an Outlook contact) | AIE 3.0 |
callbackphonenum | String | the person's callback telephone numbner (typically for an Outlook contact) |
|
carphonenum | String | the person's car telephone number (typically for an Outlook contact) |
|
cat | String | the document category |
|
cc | String | the carbon copy field in an email message |
|
ccme | Boolean | the 'CC me' property (for a MIME message) |
|
checkedby | String | the name of the person that checked the document |
|
client | String | the client |
|
clientsubmittime | Date | the email client message submit time |
|
comments | Text | comments |
|
company | String | the company name |
|
companyphonenum | String | the company telephone number (typically for an Outlook contact) |
|
completeddate | Date | the completion date |
|
contacts | String | the list of contacts (e.g. for an Outlook task) |
|
contentbase | String | MIME/email-related property | AIE 3.0 |
contentlanguage | String | MIME/email-related property | AIE 3.0 |
contentlocation | String | MIME/email-related property | AIE 3.0 |
contenttransferencoding | String | MIME/email-related property | AIE 3.0 |
contenttype | String | the content type |
|
conversationindex | String | the conversation index for a MIME message |
|
conversationtopic | String | the conversation topic for a MIME message |
|
creationdate | Date | the date the document was created |
|
creatorentryid | String | the 'creator entry ID' header field (for an email message) |
|
date | Date | the date on which the document was last modified |
|
days | Integer | the number of days assigned (typically, for an Outlook task) |
|
deferreddeliverytime | String | the 'deferred delivery time' header value |
|
deleteaftersubmit | Boolean | the 'delete after submit' header value (for an email message) |
|
department | String | the department |
|
description | Text | the description |
|
destination | String | the destination |
|
displayasemail | String | the 'display as' value for a person's email address (typically for an Outlook contact) |
|
displayasemail2 | String | the 'display as' value for a person's alternative email address (typically for an Outlook contact) |
|
displayasemail3 | String | the 'display as' value for a person's second alternative email address (typically for an Outlook contact) |
|
disposition | String | the disposition |
|
division | String | the division |
|
docnumber | Integer | the document's number |
|
docrevnumber | String | the document revision number (typically a number but may be a string literal, depending on the versioning scheme) |
|
docsecurity | Integer | the document security value (typically on MS Office documents) |
|
doctype | String | the document type |
|
domainkeysignature | String | MIME/email-related property | AIE 3.0 |
duedate | Date | the due date (typically, for an Outlook task) |
|
editminutes | Integer | the number of minutes the document was last edited for |
|
editor | String | the document editor |
|
String | the person's email (typically for an Outlook contact) |
| |
email2 | String | the person's alternative email address (typically for an Outlook contact) |
|
email3 | String | the person's second alternative email address (typically for an Outlook contact) |
|
entryid | String | entry ID | AIE 3.0 |
entrytype | String | the entry type |
|
expirationdate | Date | the email message's expiration date |
|
familyname | String | the person's family name (typically for an Outlook contact) | AIE 3.0 |
fileas | String | the 'file as' property |
|
firstname | String | the person's first name (typically for an Outlook contact) | AIE 3.0 |
flagstatus | String | the flag status (for email messages) |
|
flagsts | Long | the flag sts value |
|
footers | Text | the document footers |
|
forwardto | String | the 'forward to' value (e.g. in an email message) |
|
fullname | String | the full name (typically for Outlook contacts) |
|
gender | String | the gender (typically for an Outlook contact) |
|
group | String | the group |
|
headers | Text | the document headers |
|
headingpairs | String | the heading pairs |
|
homeaddress | String | the home address (typically for an Outlook contact) |
|
homeaddresscity | String | the person's home address city (typically for an Outlook contact) |
|
homeaddresscountry | String | the person's home address country (typically for an Outlook contact) |
|
homeaddresspobox | String | the person's home address P.O. Box (typically for an Outlook contact) |
|
homeaddresspostalcode | String | the person's home address postal code (typically for an Outlook contact) |
|
homeaddressstate | String | the person's home address state (typically for an Outlook contact) |
|
homeaddressstreet | String | the person's home address street (typically for an Outlook contact) |
|
homefaxnum | String | the person's home fax number (typically for an Outlook contact) |
|
homephone | String | the person's home phone number | AIE 3.0 |
homephonenum | String | the person's home telephone number (typically for an Outlook contact) |
|
homephonenum2 | String | the person's alternative home telephone number (typically for an Outlook contact) |
|
hours | Integer | the number of hours assigned (typically, for an Outlook task) |
|
imaddress | String | the instant messenger address (typically for an Outlook contact) |
|
importance | String | the importance (typically for Outlook tasks) |
|
inetmailoverrideformat | Long | the 'Internet mail override format' header value (for a MIME message) |
|
injectioninfo | String | MIME/email-related property | AIE 3.0 |
internetarticlenumber | Long | the Internet article number (for a MIME message) |
|
internetcpid | Long | the 'Internet CPID' header value (for a MIME message) |
|
internetfreebusyaddress | String | the Internet free busy address |
|
internetmessageid | String | the 'Internet message ID' header value (for a MIME message) |
|
isdnphonenum | String | the contact's ISDN telephone number (typically for an Outlook contact) |
|
jobtitle | String | the job title (typically for Outlook contacts) |
|
keywords | String | the keywords |
|
language | String | the language |
|
lastmodifier | String | the name of the person who last saved the document |
|
lastmodifierentryid | String | the last modifier's entry ID (for an email message) |
|
lastprinteddate | Date | the date the document was last printed |
|
latestdeliverytime | Date | the latest delivery time (for an email message) |
|
leadperformer | String | the lead performer (e.g. on an MP3 file) |
|
lines | String | MIME/email-related property | AIE 3.0 |
linksdirty | String | the 'links dirty' flag, typically in MS Office documents |
|
linksuptodate | String | the 'links are up-to-date' flag, typically in MS Office documents |
|
location | String | the location |
|
mailstatus | Long | the mail status | AIE 3.0 |
mailstop | String | the mail stop |
|
manager | String | the manager |
|
matter | String | the matter |
|
messageflag | String | the message flag (for email messages) |
|
messageid | String | MIME/email-related property | AIE 3.0 |
messagelocaleid | Long | the 'message locale ID' header value (for an email message) |
|
middlename | String | the person's middle name (typically for an Outlook contact) | AIE 3.0 |
mileage | String | the mileage (typically, for an Outlook task) |
|
mimeversion | String | MIME/email-related property | AIE 3.0 |
minutes | Integer | the number of minutes assigned (typically, for an Outlook task) |
|
mobilephonenum | String | the person's mobile telephone number (typically for an Outlook contact) |
|
msgclass | String | the message class (for a MIME message) |
|
msgcodepage | Long | the 'message codepage' header value (for a MIME message) |
|
msgeditorformat | Long | the 'message editor format' header value (for a MIME message) |
|
msgflag | Long | the message flag value |
|
name | String | the name value (e.g. for an Outlook task) |
|
newsgroups | String | the list of newsgroups (for newsgroup postings) |
|
nickname | String | the nickname (typically for an Outlook contact) |
|
nntppostingdate | Date | MIME/email-related property | AIE 3.0 |
nntppostinghost | String | MIME/email-related property | AIE 3.0 |
normalizedsubject | String | normalized subject | AIE 3.0 |
ntsecuritydescriptor | String | the NT security descriptor (for an email message) |
|
numchars | Integer | the number of characters in the document |
|
numcharswithspaces | Integer | the number of characters in the document, including spaces |
|
numhiddenslides | Integer | the number of hidden slides in the document (e.g. in a PowerPoint presentation) |
|
numlines | Integer | the number of lines in the document |
|
nummmclips | Integer | the number of multimedia clips in the document (e.g. in a PowerPoint presentation) |
|
numnotes | Integer | the number of notes in the document |
|
numpages | Integer | the number of pages in the document |
|
numparagraphs | Integer | the number of paragraphs in the document |
|
numslidenotes | Integer | the number of slide notes in the document (e.g. in a PowerPoint presentation) |
|
numslides | Integer | the number of slides in the document (e.g. in a PowerPoint presentation) |
|
numwords | Integer | the number of words in the document |
|
office | String | the office |
|
operator | String | the operator |
|
optionalattendees | String | the list of optional attendees |
|
organization | String | MIME/email-related property | AIE 3.0 |
originatordeliveryreportrequested | Boolean | the 'originator delivery report requested' header value |
|
otheraddress | String | the other address (typically for an Outlook contact) |
|
otheraddresscity | String | the person's other address city (typically for an Outlook contact) |
|
otheraddresscountry | String | the person's other address country (typically for an Outlook contact) |
|
otheraddresspobox | String | the person's other address P.O. Box (typically for an Outlook contact) |
|
otheraddresspostalcode | String | the person's other postal code (typically for an Outlook contact) |
|
otheraddressstate | String | the person's other state (typically for an Outlook contact) |
|
otheraddressstreet | String | the person's other address street (typically for an Outlook contact) |
|
otherfaxnum | String | the person's other fax number (typically for an Outlook contact) |
|
otherphonenum | String | the person's other telephone number (typically for an Outlook contact) |
|
owner | String | the owner |
|
pagerphonenum | String | the person's pager telephone number (typically for an Outlook contact) |
|
path | String | MIME/email-related property | AIE 3.0 |
percentcomplete | String | the percent complete (typically, for an Outlook task) |
|
personalhomepage | String | the contact's personal home page (typically for an Outlook contact) |
|
presentationformat | String | the presentation format (e.g. for a PowerPoint presentation) |
|
primaryphonenum | String | the person's primary telephone number (typically for an Outlook contact) |
|
priority | Long | the priority (for an email message) |
|
profession | String | the person's profession (typically for an Outlook contact) |
|
profileconnectflags | Long | the 'profile connect flags' header value (for a MIME message) |
|
progid | String | MIME/email-related property | AIE 3.0 |
project | String | the project |
|
purpose | String | the purpose |
|
radiophonenum | String | the person's radio telephone number (typically for an Outlook contact) |
|
rcvdbyflags | Long | the 'rcvd by flags' header value (for a MIME message) |
|
rcvdrepresentingaddrtype | String | the 'rcvd representing addrtype' header value |
|
rcvdrepresentingemailaddress | String | the 'rcvd representing email address' header value |
|
rcvdrepresentingentryid | String | the 'rcvd representing entry ID' header value |
|
rcvdrepresentingflags | Long | the 'rcvd representing flags' header value (for a MIME message) |
|
rcvdrepresentingname | String | the 'rcvd representing name' header value |
|
rcvdrepresentingsearchkey | String | the 'rcvd representing search key' header value |
|
readreceiptrequested | Boolean | the 'mail read receipt requested' header value |
|
received | String | the 'received' property |
|
receivedbyaddrtype | String | the 'received by addrtype' header value |
|
receivedbyemailaddress | String | the 'received by email address' header value |
|
receivedbyentryid | String | the 'received by entry ID' header value |
|
receivedbyname | String | the 'received by name' header value |
|
receivedbysearchkey | String | the 'received by search key' header value |
|
receiveddate | Date | the date on which the email message was received |
|
receivedfrom | String | the value of the 'received from' header for email messages |
|
recipientreassignmentprohibited | String | the 'recipient reassignment prohibited' header value |
|
recordedby | String | the name of the person who recorded the information contained in the document |
|
recordeddate | Date | the date on which the information contained in the document was recorded |
|
reference | String | the reference |
|
remindertopic | String | the reminder topic |
|
replyrequested | String | the 'reply requested' header value (for an email message) |
|
replytime | String | the 'reply time' header value (for an email message) |
|
reporttag | String | the 'report tag' header value (for an email message) |
|
requiredattendees | String | the list of required attendees |
|
responserequested | String | the 'response requested' header value (for an email message) |
|
returnpath | String | MIME/email-related property | AIE 3.0 |
revisionnotes | Text | the revision notes |
|
rtfbody | String | rtf body | AIE 3.0 |
rtfembeddedbody | String | the 'RTF embedded body' property |
|
rtfinsync | Boolean | the 'RTF in sync' header value |
|
rtfsyncbodycount | String | the 'RTF sync body count' header value |
|
rtfsyncbodycrc | String | the 'RTF sync body crc' header value |
|
rtfsyncbodytag | String | the 'RTF sync body tag' header value |
|
rtfsyncprefixcount | String | the 'RTF sync prefix count' header value |
|
rtfsynctrailingcount | String | the 'RTF sync trailing count' header value |
|
scalecrop | String | the scale crop |
|
searchkey | String | the 'search key' header value |
|
section | String | the section |
|
senderaddrtype | String | the 'sender addrtype' header value |
|
senderemailaddress | String | the 'sender email address' header value |
|
senderentryid | String | the 'sender entry ID' header value |
|
senderflags | Long | the 'sender flags' header value (for a MIME message) |
|
sendername | String | the 'sender name' header value |
|
sendersearchkey | String | the 'sender search key' header value |
|
sensitivity | String | the sensitivity (typically for Outlook messages) |
|
sentdate | Date | the date on which the message was sent (typically for email messages) | AIE 3.0 |
sentonbehalfof | String | the 'sent on behalf of' header value (for email messages) |
|
sentrepresentingaddrtype | String | the 'sent representing addrtype' header value |
|
sentrepresentingemailaddress | String | the 'sent representing email address' header value |
|
sentrepresentingentryid | String | the 'sent representing entry ID' header value |
|
sentrepresentingflags | Long | the 'sent representing flags' header value (for a MIME message) |
|
sentrepresentingname | String | sent representing name | AIE 3.0 |
sentrepresentingsearchkey | String | the 'sent representing search key' header value |
|
shareddoc | String | whether the document is shared |
|
size | Long | the size of the document in bytes |
|
sourcemodifieddate | Date | the source modified date (typically on PDF documents) |
|
spouse | String | the spouse's name (typically for an Outlook contact) |
|
startdate | Date | the start date (typically, for an Outlook task) |
|
status | String | the status |
|
submissionid | String | the submission ID (for a MIME message) |
|
submittime | String | the submit time of an email message |
|
telexphonenum | String | the person's telex telephone number (typically for an Outlook contact) |
|
threadindex | String | MIME/email-related property | AIE 3.0 |
template | Text | the document template |
|
threadtopic | String | MIME/email-related property | AIE 3.0 |
title | String | the document's title |
|
titleofparts | String | the title of parts |
|
to | String | the list of email message recipients ('to') |
|
totalwork | Long | the total work (typically, for an Outlook task) |
|
tracknumber | Integer | the track number (e.g. on an MP3 file) |
|
transportmessageheaders | Text | the 'transport message headers' value |
|
trustsender | Long | the 'trust sender' header value (for an email message) |
|
ttyttdphonenum | String | the person's TTY/TTD telephone number (typically for an Outlook contact) |
|
typist | String | the typist's name |
|
useragent | String | MIME/email-related property | AIE 3.0 |
versiondate | Date | the version date |
|
versionnotes | Text | the version notes |
|
versionnumber | String | the version number (typically a number but may be a string literal, depending on the versioning scheme) |
|
watermark_text | Text | the watermark text |
|
webpage | String | the webpage | AIE 3.0 |
webpageaddress | String | the Web page address (typically for Outlook contacts) |
|
weeks | Integer | the number of weeks assigned (typically, for an Outlook task) |
|
workphone | String | the person's work phone number | AIE 3.0 |
xaccountkey | String | MIME/email-related property | AIE 3.0 |
xcomplaintsto | String | MIME/email-related property | AIE 3.0 |
xfolder | String | MIME/email-related property | AIE 3.0 |
xhttpuseragent | String | MIME/email-related property | AIE 3.0 |
xmimeole | String | MIME/email-related property | AIE 3.0 |
xmozillastatus | String | MIME/email-related property | AIE 3.0 |
xmozillastatus2 | String | MIME/email-related property | AIE 3.0 |
xref | String | MIME/email-related property | AIE 3.0 |
xtrace | String | MIME/email-related property | AIE 3.0 |
xuidl | String | MIME/email-related property | AIE 3.0 |
year | Integer | the year |
|