Quick Start
The following exercise builds on the Attivio Quick Start Tutorial and takes a closer look at processing languages beyond English.
1. Deploy the Attivio Factbook project
Open the Attivio Quick Start Tutorial in a new tab in your browser, follow the instructions for Steps 1–6 to deploy and start the Factbook project and run its connectors, and return to this page when complete. If you already have the project deployed, just follow Step 5 to start it and Step 6 to run its connectors.
2. Review language and languages fields
When we created the Factbook project, we included the demo group of modules (-g demo
) which includes alm (Advanced Linguistics Module). Without any further changes to our project, we are able to apply sophisticated text analytics to English content only. Below, we will outline the necessary modifications to allow Attivio to recognize over fifty languages and apply language-specific text analytics.
Locale vs. Language
Locale values are two-letter language-identification codes that are temporarily attached to documents, fields, and field values during ingestion. (A typical locale code would be "en", meaning English.) The locale properties are consulted during text analysis to determine which language-specific tools should be applied to a particular field value. The locale properties are normally invisible and ephemeral. A field's locale may be accessed programmatically, but is not visible when a field or field value is copied or displayed. Locale properties are not stored in the Attivio index.
The language and languages fields are string-valued fields with human-readable values ("English" instead of "en"). They are set as a byproduct of locale detection. These fields are indexed and stored, and can be queried or displayed as facets.
If you write a custom workflow component that sets locale or language values, the “best practice” is to set both values. This is what Attivio's language identifier component does.
Let's examine the language and languages fields in our Factbook index before we enable the project for some additional languages.
Navigate to http://localhost:17000/searchui/ (or click the Administrator UI's Query > Search UI menu link) in your web browser to open the Search UI web application.
Log in with username aieadmin
and password attivio
when prompted.
Click the Go button to submit the default *:*
query to view all documents in the index.
At the top right of the screen, click the Details > On button to display full details for the documents.
You can see both the language and languages field populated for each document in the index.
Let's use some of the Simple Query Language we learned in the Attivio Quick Start to see if there are any documents in the index in any language besides English.
Enter -languages:English
in the search box and click Go. This will return any document which contains content in any language except English. It is likely your search will return 0 documents unless you happened to ingest some news which contained some non-English text.
3. Ingest Spanish and Japanese documents
Download the attached 富士山.pdf and Encierro_(tauromaquia).pdf files and save them to C:\temp\
. Execute the following steps to ingest these files.
Go to http://localhost:17000/admin/connectors to open the Attivio Admin UI.
Click the
button to create a new connector.Search for and select Generic File System. Then press OK
The New Connector dialog will open up. Enter values for the following fields:
- Name: pdfs
- Start Directory: C:\temp
- Wildcard Include Filter: *.pdf (delete the other items in the list)
- Maximum directory depth: 0
Click on the Field Mappings tab.
Create the following Static Field Values entry:
table | pdfs |
Click Save & Test
The Save & Test feature allows you preview documents that would be fed in by the connector. Click on eitherdocument id in the left panel and that document's details will be displayed to the right. Notice our static field value has been applied.
Click OK
Click Save
Run the connector by selecting it and clicking the
button in the top toolbar.Click Start in the confirmation dialog window.
Once the connector finishes, you will see the number of documents it has ingested.
4. Review the ingested PDFs
Let's take a look at the fields of our PDFs now that they've been ingested. Return to the Search UI and execute a query for table:pdfs. If not still applied, click the Details > On button to display full details for the documents.
Even though our documents are in Japanese and Spanish respectively, they have not been identified as such. As mentioned above, the alm module which was included in our Factbook project only supports English out-of-the-box. In order to support additional languages, the next few steps are required. Continue with the next few steps to modify our project to enable it to identify non-English documents as well as apply specific text-analytics for Japanese and Spanish.
5. Install the alm-languages module
Download the latest version of the alm-languages module from Advanced Linguistics Languages Download and follow the instructions to install the external module.
Validate the installation was successful be executing the following command. You should see a version of alm-languages listed.
<install-dir>\bin\aie-exec.exe modulemanager -l Name Version User Installed On Description --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- alm-languages 1.0.0 localuser 2018-06-09T19:41:38 Provides non-english language advanced linguistics capabilities, including language detection and entity extraction ...
Unlike some other modules, the alm-languages module does not need to be added explicitly to a project. It simply contains libraries, models, dictionaries and other artifacts to assist Attivio in processig additional languages.
6. Install additional languages license
Additional languages beyond English are licensed individually. A License key for additional languages can be obtained from sales@attivio.com or your Attivio Sales Representative. Once you have a license which includes additional languages, the file is installed by placing the rlp-license.xml
file into the <install-dir>/lib/basisTech/licenses
directory. To determine which features are enabled, view the license file contents.
For example, a license which includes Japanese and Spanish will include something like the following:
... <license> <module>RLP</module> <language>Spanish</language> <function>Tokenizer, Base Noun Phrase Detector, Part of Speech, Sentence Boundary Detection, Morphological Analyzer, Named Entity Extractor</function <license_key>...</license_key> </license> ... <license> <module>RLP</module> <language>Japanese</language> <function>Tokenizer, Base Noun Phrase Detector, Part of Speech, Sentence Boundary Detection, Morphological Analyzer, Named Entity Extractor</function> <license_key>...</license_key> </license> ...
7. Download and install specific language model jars
Download the jar files for Spanish and Japanese and install them into Attivio.
Download the lm-0.2-ja-BT.jar
model file from the Japanese (ja) page.
Download the lm-0.2-es-BT.jar
model file from the Spanish (es) page.
Place the language model files in the <install-dir>/lib/ directory (they do not require unpacking).
8. Modify the LanguageModelService component
The final step to enable Japanese and Spanish is to modify the LanguageModelService component.
Open <project-dir>/conf/components/languageModelService.xml
in a text editor.
As instructed on the Japanese (ja) and Spanish (es) pages, add the language specific mappings to the configuration so that it it appears as follows:
<?xml version="1.0" encoding="UTF-8"?> <component xmlns="http://www.attivio.com/configuration/type/componentType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="languageModelService" class="com.attivio.platform.service.LanguageModelService" xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType classpath:/xsd/type/componentType.xsd"> <properties> <map name="models"> <map name="en"> <property name="1" value="languagemodel/lm/en/1grams.bin"/> <property name="2" value="languagemodel/lm/en/2grams.bin"/> <property name="3" value="languagemodel/lm/en/3grams.bin"/> <property name="4" value="languagemodel/lm/en/4grams.bin"/> <property name="5" value="languagemodel/lm/en/5grams.bin"/> </map> <map name="es"> <property name="1" value="languagemodel/lm/es-BT/1grams.bin" /> <property name="2" value="languagemodel/lm/es-BT/2grams.bin" /> <property name="3" value="languagemodel/lm/es-BT/3grams.bin" /> <property name="4" value="languagemodel/lm/es-BT/4grams.bin" /> <property name="5" value="languagemodel/lm/es-BT/5grams.bin" /> </map> <map name="ja"> <property name="1" value="languagemodel/lm/ja/1grams.bin" /> <property name="2" value="languagemodel/lm/ja/2grams.bin" /> <property name="3" value="languagemodel/lm/ja/3grams.bin" /> <property name="4" value="languagemodel/lm/ja/4grams.bin" /> <property name="5" value="languagemodel/lm/ja/5grams.bin" /> </map> </map> </properties> </component>
Save and close the file.
9. Deploy project changes
We've made several changes to our Attivio installation and project configuration. We must open the CLI and deploy these changes and restart our project to apply these changes.
Open the CLI
<install-dir>\bin\aie-cli.exe -p C:\attivio\projects\Factbook
Preserve the changes we made via the Admin UI (the pdfs connector we created). Type the update
command and hit Enter
.
Deploy our changes and trigger a restart of the node. Type the deploy command and hit Enter
.
Once the CLI reports the node as RUNNING again, move onto the next step. You can monitor the project by executing the status
command.
10. Run the pdfs connector again
Return to http://localhost:17000/admin/connectors and run the pdfs connector once again.
11. Review the ingested PDFs again
Let's take another look at our PDFs now that they've been ingested with Japanese and Spanish enabled. Return to the Search UI and execute a query for table:pdfs. If not still applied, click the Details > On button to display full details for the documents.
You should see that the Spanish document has been identified as Spanish:
You should also see that the Japanese document has been identified as primarily Japanese and even detected some English and French present in the document.
Summary
In this tutorial, we've observed how Attivio applies advanced linguistic processing and text analytics for English by using the alm module. We reviewed the required steps for enabling additional languages, specifically, adding the alm-languages module, acquiring and installing a replacement license which includes the desired languages and installing and configuring the language-specific models. This only scratches the surface of all the linguistic and natural language processing possible with Attivio.
What Next?
Now that we've covered the basics of enabling advanced linguistics for English and other languages, you may be interested in the following:
- Learn more about the Advanced Linguistics Module.
- Learn more about Loading File Content like the PDF files we ingested in this tutorial.