Cache-based statistical language models of English and highly inflected Lithuanian

Vaičiūnas, Airenas; Raškinis, Gailius

Use this url to cite publication: https://hdl.handle.net/20.500.12259/55742

Cache-based statistical language models of English and highly inflected Lithuanian

Type of publication

Straipsnis Web of Science ir Scopus duomenų bazėje / Article in Web of Science and Scopus database (S1)

Author(s)


Vaičiūnas, Airenas	Vytauto Didžiojo universitetas / Vytautas Magnus University	LT
Raškinis, Gailius	Vytauto Didžiojo universitetas / Vytautas Magnus University	LT

Title

Cache-based statistical language models of English and highly inflected Lithuanian

Other Title

Statistiniai kalbos modeliai, naudojantys trumpalaikę atmintį, anglų ir lietuvių kalboms

Is part of

Informatica: international journal. Vilnius : Institute of mathematics and informatics, 2006, Vol. 17, no. 1

Date Issued

Date Issued
2006

Publisher

Vilnius : Institute of mathematics and informatics

Is Referenced by

Science Citation Index Expanded (Web of Science)

INSPEC

Zentralblatt MATH (zbMATH)

Scopus

Extent

p. 111-124

URI

URI
https://www.mii.lt/informatica/pdf/INFO617.pdf
https://hdl.handle.net/20.500.12259/55742

Field of Science

Informatika / Inform...

Keywords

Language models

N-grams

Cache models

Dynamic interpolation...

Perplexity reduction

Inflected language

Free word order langu...

Lithuanian

Abstract

This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3-gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36% and 43% perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16% perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.

Type of document

type::text::journal::journal article::research article

Language

Anglų / English (en)

Coverage Spatial

Lietuva / Lithuania (LT)

Options

Cache-based statistical language models of English and highly inflected Lithuanian