Overview
Attivio supports text analytics of the Turkish language through the Advanced Linguistics Module.
Required Modules
These features also require that you include the alm module when you run createproject to create the project directories.
Advanced Linguistics Module installs as a part of the default Attivio platform installer and requires a license key file which determines what linguistics features are enabled. A License key can be obtained from sales@attivio.com or your Attivio Sales Representative.
View incoming links.
Turkish Base and Advanced Modules
The Advanced Linguistics Module contains the Base Module for the Turkish language. Note that some additional configuration changes may be required.
Turkish: Base Module
The Turkish Base Module includes all Turkish linguistics tools (see list below). This module must be explicitly authorized by your ALM license file.
Turkish: Advanced Module
Statistical entity extraction is not available for Turkish, and therefore there is no Turkish Advanced Module.
Turkish Language Features in Attivio
Attivio supports the following linguistic analysis features for the Turkish language.
- Segmentation: Segmentation is not required for Turkish text.
- Lemmatization: The ALM does not support lemmatization for the Turkish language, but it does support stemming (see below).
- Decompounding: Decompounding is not performed on Turkish text.
- Trainable Classification and Sentiment Analysis: Classification and Sentiment Analysis are available for the Turkish language. Requires licenses for the Classifier and Sentiment modules.
- Classification Model: Classification for the Turkish language is supported, but additional training data or dictionaries may be required. Requires license for the Classifier module.
- Sentiment Analysis Model: Sentiment Analysis for the Turkish language is supported, but additional training data or dictionaries may be required. Requires license for the Sentiment module.
- Entity-Sentiment Analysis: Entity-sentiment analysis is available for the Turkish language, but a professional-services engagement is required. Requires licenses for the Classifier, Sentiment, and Entity-Sentiment modules.
- Key Phrase Extraction: Key-phrase extraction is supported in Turkish, and requires downloading and installing the Turkish language model (see below).
- More-like-this: More-like-this querying is supported in Turkish, and requires downloading and installing the Turkish language model (see below).
- OCR Module: Optical character recognition is supported for the Turkish language. Requires license for the OCR module.
- Dictionary-Based Entity Extraction: Dictionary-based entity extraction in the Turkish language is supported, but additional training data or dictionaries may be required.
- First+Last Name Extraction: First+Last Name Extraction is available for the Turkish language, but a professional-services engagement is required.
- Statistical Entity Extraction: Statistical entity extraction is not currently supported for the Turkish language.
- Spelling Correction: Spelling correction is supported for the Turkish language.
Additional Turkish Language Resources
Various linguistics features of the Attivio platform, including Dictionary-based Entity Extraction, Synonym Expansion, Acronym Expansion, Lemmatization, Keyphrase extraction, and More-Like-This extraction, require one or more language-specific resources.
To fully exploit the linguistics functionality on Turkish texts, download and unpack one or more of the resources described below. Language model (jar) files are typically placed in the <install_dir>\lib directory (without unpacking), while dictionaries typically unpack in the <install_dir>\conf\dictionaries directory. Use archive extraction programs such as jar, gzip, or winzip to unpack the files after download.
Note that the ISO-639-1 2-letter code for the Turkish language is "tr".
Main article: Configuring Languages in Attivio.
Language Model
Language models are used for extracting key phrases (main article: Key-Phrase Extraction) and extracting queries for related documents (main article: More-Like-This Support).
Place the language model in the <install_dir>/lib/ directory (it does not require unpacking):
In addition, you must create a custom definition of the LanguageModelService component with a map entry for "tr":
<component name="languageModelService" class="com.attivio.platform.service.LanguageModelService" override="true" > <properties> <map name="models"> ... <map name="tr"> <property name="1" value="languagemodel/lm/tr/1grams.bin" /> <property name="2" value="languagemodel/lm/tr/2grams.bin" /> <property name="3" value="languagemodel/lm/tr/3grams.bin" /> <property name="4" value="languagemodel/lm/tr/4grams.bin" /> <property name="5" value="languagemodel/lm/tr/5grams.bin" /> </map> ... <!-- add other language model files here --> </map> </properties> </component>
Stopwords
Stopwords are typically common words (such as "the" or "a" in English) or words that are not desirable in a certain query context, such as the phrase: "I need information about." Several Attivio components use a stopword list.
Main article: Stopword Removal
The following stopword lists are available for download below, and they are also installed by default in <install_dir>/ conf/dictionaries/ :
- A small stopword list, containing approximately 50 words: 50_stopwords_tr.txt
- A stopword list containing approximately 100 words: 100_stopwords_tr.txt .
- A large stopword list, containing approximately 500 words: 500_stopwords_tr.txt. Reviewing this list before use is recommended as it contains common words such as "Fakültesi", " Futbol", and "teleskopla" .
Lemmatization
The Advanced Linguistics Module does not support lemmatization for Turkish.
Stemming
Stemming is the process of procedurally removing inflectional affixes from words, indexing the resulting stem at ingest time, and matching using the stem at query time. This process can significantly increase recall, especially in highly-inflected languages such as Turkish.
Stemming and Spelling Correction
Since stemming is not guaranteed to produce a correctly spelled word, it can interact badly with spelling correction. Stemming is disabled by default.
The default RLPLinguisticsConfig for Turkish In <install_dir>\conf\basistech\module.xml directs Attivio to index (stack) stems for Turkish, but the ALM stemming module is disabled by default. To enable stemming using the ALM, uncomment the following line in each of the files <install_dir>\conf\basistech\rlpSingleLangContext.xml , <install_dir>\conf\basistech\rlpSingleLangContext.xmlEE , and <install_dir>\conf\basistech\rlpSingleLangQueryContext.xml :
... <languageprocessor>Stemmer</languageprocessor> ...
Synonym Expansion
No synonym expansion resources are currently available for Turkish.
Acronym Expansion
No acronym expansion resources are currently available for Turkish.
Entity Resources
No additional lists of entities (names of people, companies, or locations) are currently available for Turkish.