Page tree
Skip to end of metadata
Go to start of metadata

Overview

Attivio supports text analytics of the Finnish language through the Advanced Linguistics Module

Required Modules

These features also require that you include the alm module when you run createproject to create the project directories.

Advanced Linguistics Module installs as a part of the default Attivio platform installer and requires a license key file which determines what linguistics features are enabled. A License key can be obtained from sales@attivio.com or your Attivio Sales Representative.

View incoming links.

Finnish Base and Advanced Modules

The Advanced Linguistics Module contains the Base Module for the Finnish language. Note that some additional configuration changes may be required.

Finnish: Base Module

The Finnish Base Module includes all Finnish linguistics tools (see list below) except for statistical entity extraction. This module must be explicitly authorized by your ALM license file.

Finnish: Advanced Module

Statistical entity extraction is not supported in Finnish text, and so there is no Finnish Advanced Module.

Finnish Language Features in Attivio

Attivio supports the following linguistic analysis features for the Finnish language.

  • Segmentation: Segmentation is not required for Finnish text.
  • Lemmatization: The ALM does not support lemmatization for the Finnish language, but it does support stemming (see below).
  • Decompounding: Decompounding is not performed on Finnish text.
  • Trainable Classification and Sentiment Analysis: Classification and Sentiment Analysis are available for the Finnish language. Requires licenses for the Classifier and Sentiment modules.
  • Classification Model: Classification for the Finnish language is supported, but additional training data or dictionaries may be required. Requires license for the Classifier module.
  • Sentiment Analysis Model: Sentiment Analysis for the Finnish language is supported, but additional training data or dictionaries may be required. Requires license for the Sentiment module.
  • Entity-Sentiment Analysis: Entity-sentiment analysis is available for the Finnish language, but a professional-services engagement is required. Requires licenses for the Classifier, Sentiment, and Entity-Sentiment modules.
  • Key Phrase Extraction: Key-phrase extraction is supported in the Finnish language and you must download the Finnish language model (see below).
  • More-like-this: More-like-this querying is supported in the Finnish language and you must download the Finnish language model (see below).
  • OCR Module: Optical character recognition is supported for the Finnish language. Requires license for the OCR module.
  • Dictionary-Based Entity Extraction: Dictionary-based entity extraction in the Finnish language is supported, but additional training data or dictionaries may be required.
  • First+Last Name Extraction: First+Last Name Extraction is available for the Finnish language, but a professional-services engagement would be required.
  • Statistical Entity Extraction: Statistical entity extraction is not supported for the Finnish language.
  • Spelling Correction: Spelling correction is supported for the Finnish language.

Additional Finnish Language Resources

Various linguistics features of the Attivio platform, including Dictionary-based Entity Extraction, Synonym Expansion, Acronym Expansion, Lemmatization, Keyphrase extraction, and More-Like-This extraction, require one or more language-specific resources.

To fully exploit the linguistics functionality on Finnish texts, download and unpack one or more of the resources described below. Language model (jar) files are typically placed in the <install_dir>\lib directory (without unpacking), while dictionaries typically unpack in the <install_dir>\conf\dictionaries directory. Use archive extraction programs such as jar, gzip, or winzip to unpack the files after download.

Note that the ISO-639-1 2-letter code for the Finnish language is "fi".

Main article: Configuring Languages in Attivio.

Language Model

Language models are used for extracting key phrases (main article: Key-Phrase Extraction) and extracting queries for related documents (main article: More-Like-This Support).

Place the language model in the <install_dir>/lib/ directory (it does not require unpacking):

lm-0.2-fi.jar

In addition, you must create a custom definition of the LanguageModelService component with a map entry for "fi":

<project_dir>/conf/components/languageModelService.xml
<component xmlns="http://www.attivio.com/configuration/type/componentType" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    name="languageModelService" 
    class="com.attivio.platform.service.LanguageModelService" 
    xsi:schemaLocation="http://www.attivio.com/configuration/type/componentType http://www.attivio.com/configuration/type/componentType.xsd ">
  <properties>
    <map name="models">
      ...
      <map name="fi">
        <property name="1" value="languagemodel/lm/fi/1grams.bin" />
        <property name="2" value="languagemodel/lm/fi/2grams.bin" />
        <property name="3" value="languagemodel/lm/fi/3grams.bin" />
        <property name="4" value="languagemodel/lm/fi/4grams.bin" />
        <property name="5" value="languagemodel/lm/fi/5grams.bin" />
      </map>
      ...
      <!-- add other language model files here -->
    </map>
  </properties>
</component>

Stopwords

Stopwords are typically common words (such as "the" or "a" in English) or words that are not desirable in a certain query context, such as the phrase: "I need information about." Several Attivio components use a stopword list. 

Main article:  Stopword Removal

The following stopword lists are available for download below, and they are also installed by default in  <install_dir>/ conf/dictionaries/ :

  • A small stopword list, containing approximately 50 words: 50_stopwords_fi.txt .
  • A stopword list containing approximately 100 words: 100_stopwords_fi.txt .
  • A large stopword list, containing approximately 500 words: 500_stopwords_fi.txt. Reviewing this list before use is recommended as it contains common words such as "seuran", " pelaa", and "Elokuva".

Lemmatization

The Advanced Linguistics Module does not support lemmatization for Finnish.

Stemming

Stemming is the process of procedurally removing inflectional affixes from words, indexing the resulting stem at ingest time, and matching using the stem at query time. This process can significantly increase recall, especially in highly inflected languages such as Finnish.  

Stemming and Spelling Correction

Since stemming is not guaranteed to produce a correctly-spelled word, it can interact badly with spelling correction. Stemming is disabled by default.  

The default RLPLinguisticsConfig for Finnish In <install_dir>\conf\basistech\module.xml directs Attivio to index (stack) stems for Finnish, but ALM stemming is disabled by default. T o enable stemming in Finnish using the ALM, uncomment the following line in each of the files <install_dir>\conf\basistech\rlpSingleLangContext.xml , <install_dir>\conf\basistech\rlpSingleLangContext.xmlEE , and <install_dir>\conf\basistech\rlpSingleLangQueryContext.xml :

    ...
    <languageprocessor>Stemmer</languageprocessor>
    ...


Synonym Expansion

No synonym expansion resources are currently available for Finnish.

Acronym Expansion

No acronym expansion resources are currently available for Finnish.

Entity Resources

Entity Extraction for Finnish uses data files which are part of the Advanced Linguistics Module.

No additional lists of entities (names of people, companies, or locations) are currently available for Finnish.

  • No labels