Page tree
Skip to end of metadata
Go to start of metadata

Overview

A language model gives the probability for sequences of tokens in a language, trained over a large corpus of text. Language models are used in a number of AIE's components, including Key-Phrase Extraction and "More-like-this" functionality. In this document, we will look at how to configure a language model when it is used in a component, and how to create customized language models for additional languages or domains.

Support for building and using language models is provided by the optional 'languagemodel' module. AIE ships with English language model files. Pre-built language models for other languages may be downloaded here: Dutch (nl), French (fr), German (de), Italian (it), Spanish (es), Portuguese (pt). For other languages, please contact sales@attivio.com.

There is a large literature on the construction and use of language models, and the many difficult and subtle issues that arise when dealing with them. We will take a more practical approach here, and describe the API and configuration of the language model module, without delving deeply into the details.

Required Modules

These features require that the languagemodel module be included when you run createproject to create the project directories.

View incoming links.

Language Model Configuration

Usually, we set up a language model as a service; since language models contain large amounts of data, it would be wasteful to load the language model data more than once (per AIE instance).

In the file <install_dir>\conf\languagemodel\module.xml, a language model is set up as a service in the following way:

<services>
  <service name="languageModelService"/>
</services>

The name of the service refers to a component defined elsewhere in the same file:

<component name="languageModelService"
           class="com.attivio.platform.service.LanguageModelService" >
  <properties>
    <map name="models">
      <map name="en">
        <property name="1" value="languagemodel/lm/en/1grams.bin" />
        <property name="2" value="languagemodel/lm/en/2grams.bin" />
        <property name="3" value="languagemodel/lm/en/3grams.bin" />
        <property name="4" value="languagemodel/lm/en/4grams.bin" />
        <property name="5" value="languagemodel/lm/en/5grams.bin" />
      </map>
      <!-- add other language model files here -->
    </map>
  </properties>
</component>

Note that the configuration of the languageModelService is a map from language names (really their ISO-639-1 2-letter codes) to a map from a token length to a model file. For English, one file contains probabilities for one-token sequences (1grams.bin), another file contains probabilities for two-token sequences (2grams.bin) – up to five-token sequences (5grams.bin). Not all of these files need to be loaded – only the files up to the maximum order used by any of the components using the languageModelService.

For every language that wants to be part of the languageModelService, it is necessary to define a map for that language. In order to add the French language model to this languageModelService, the following XML snippet should be added to the above configuration, just above where it says "add other language model files here":

<map name="fr">
  <property name="1" value="languagemodel/lm/fr/1grams.bin" />
  <property name="2" value="languagemodel/lm/fr/2grams.bin" />
  <property name="3" value="languagemodel/lm/fr/3grams.bin" />
  <property name="4" value="languagemodel/lm/fr/4grams.bin" />
  <property name="5" value="languagemodel/lm/fr/5grams.bin" />
</map>

Where, of course, the file names point to the correct French language model files. Since language models have no other configuration options, there's nothing else to set up.

Using The Language Model Service

  1. Ensure that the languagemodel configuration files are loaded, preferably by including 'languagemodule' in the list of modules supplied to createproject.
  2. For each language (besides English), add a map section to the models map for the languageModelService component. Copying the "en" map and replacing "en" with the appropriate language code should suffice.

Note that the language model service provided by the languagemodel module is not presently usable by end users. The language model service is used by a number of AIE components, including Key-Phrase Extraction and "More-like-this" functionality.

  • No labels