Please use this identifier to cite or link to this item:https://hdl.handle.net/20.500.12259/41608
Type of publication: Straipsnis Clarivate Analytics Web of Science ar/ir Scopus / Article in Clarivate Analytics Web of Science or / and Scopus (S1)
Field of Science: Informatika / Informatics (N009)
Author(s): Vaičiūnas, Airenas;Kaminskas, Vytautas;Raškinis, Gailius
Title: Statistical language models of Lithuanian based on word clustering and morphological decomposition
Is part of: Informatica: international journal. Vilnius : Institute of mathematics and informatics, Vol. 15, no.4 (2004)
Extent: p. 565-580
Date: 2004
Keywords: Language models;Morphology;N-grams;Class-based models;Inflections;Interpolation;Perplexity reduction;Out-of-vocabulary words
Abstract: This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%
Internet: https://www.mii.lt/informatica/pdf/INFO566.pdf
Affiliation(s): Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml11.61 kBXMLView/Open

MARC21 XML metadata

Show full item record
Export via OAI-PMH Interface in XML Formats
Export to Other Non-XML Formats

WEB OF SCIENCETM
Citations 1

9
checked on Sep 12, 2020

Page view(s)

144
checked on Jan 7, 2020

Download(s)

12
checked on Jan 7, 2020

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.