Please use this identifier to cite or link to this item:
Type of publication: Straipsnis Clarivate Analytics Web of Science ar/ir Scopus / Article in Clarivate Analytics Web of Science or / and Scopus (S1)
Field of Science: Informatika / Informatics (N009)
Author(s): Vaičiūnas, Airenas;Kaminskas, Vytautas;Raškinis, Gailius
Title: Statistical language models of Lithuanian based on word clustering and morphological decomposition
Is part of: Informatica: international journal. Vilnius : Institute of mathematics and informatics, Vol. 15, no.4 (2004)
Extent: p. 565-580
Date: 2004
Keywords: Language models;Morphology;N-grams;Class-based models;Inflections;Interpolation;Perplexity reduction;Out-of-vocabulary words
Abstract: This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%
Affiliation(s): Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml11.61 kBXMLView/Open

MARC21 XML metadata

Show full item record
Export via OAI-PMH Interface in XML Formats
Export to Other Non-XML Formats

Citations 1

checked on Sep 12, 2020

Page view(s)

checked on Jan 7, 2020


checked on Jan 7, 2020

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.