Statistical language models of Lithuanian based on word clustering and morphological decomposition

Vaičiūnas, Airenas; Kaminskas, Vytautas; Raškinis, Gailius

Use this url to cite publication: https://hdl.handle.net/20.500.12259/41608

Statistical language models of Lithuanian based on word clustering and morphological decomposition

Type of publication

Straipsnis Web of Science ir Scopus duomenų bazėje / Article in Web of Science and Scopus database (S1)

Author(s)

Author	Affiliation
Vaičiūnas, Airenas	Vytauto Didžiojo universitetas / Vytautas Magnus University	LT
Kaminskas, Vytautas	Vytauto Didžiojo universitetas / Vytautas Magnus University	LT
Raškinis, Gailius	Vytauto Didžiojo universitetas / Vytautas Magnus University	LT

Title

Statistical language models of Lithuanian based on word clustering and morphological decomposition

[en]

Is part of

Informatica: international journal. Vilnius : Institute of mathematics and informatics, Vol. 15, no.4 (2004)

Date Issued

Date
2004

Publisher

Vilnius : Institute of mathematics and informatics

Is Referenced by

Science Citation Index Expanded (Web of Science)

INSPEC

Zentralblatt MATH (zbMATH)

Scopus

Extent

p. 565-580

URI

URI
https://www.mii.lt/informatica/pdf/INFO566.pdf
https://hdl.handle.net/20.500.12259/41608

Field of Science

Keywords (en)

Abstract (en)

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.

Type of document

type::text::journal::journal article::research article

Language

Anglų / English (en)

Coverage Spatial

Lietuva / Lithuania (LT)

Owning collection

Universiteto mokslo publikacijos / University Research Publications

ISSN (of the container)

0868-4952

WOS

WOS:000226037500009

Other Identifier(s)

VDU02-000002083

Vytauto Didžiojo universitetas / Vytautas Magnus University

Journal	IF	AIF	AIF (min)	AIF (max)	Cat	AV	Year	Quartile
INFORMATICA	0.26	0.847	0.665	1.029	2	0.346	2004	Q4

Journal	IF	AIF	AIF (min)	AIF (max)	Cat	AV	Year	Quartile
INFORMATICA	0.26	0.847	0.665	1.029	2	0.346	2004	Q4