Feature exploration for authorship attribution of Lithuanian parliamentary speeches

Kapočiūtė-Dzikienė, Jurgita; Utka, Andrius; Šarkutė, Ligita

doi:10.1007/978-3-319-10816-2

Use this url to cite publication: https://hdl.handle.net/20.500.12259/56634

Feature exploration for authorship attribution of Lithuanian parliamentary speeches

Type of publication

Straipsnis konferencijos medžiagoje Scopus duomenų bazėje / Article in conference proceedings in Scopus database (P1a2)

Author(s)

Author	Affiliation
Kapočiūtė-Dzikienė, Jurgita	Taikomosios informatikos katedra / Department of Applied Informatics	LT
Utka, Andrius	Lituanistikos katedra / Department of Lithuanian Studies	LT
Šarkutė, Ligita	Kauno technologijos universitetas	LT

Title

Feature exploration for authorship attribution of Lithuanian parliamentary speeches

[en]

Is part of

Text, speech and dialogue : 17th international conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014 : proceedings. New York : Springer, 2014

Date Issued

Date
2014

Publisher

New York : Springer, 2014

Publisher (trusted)

Is Referenced by

Extent

p. 93-100

URI

URI
https://hdl.handle.net/20.500.12259/56634

DOI

10.1007/978-3-319-10816-2

Field of Science

Keywords (en)

Abstract (en)

This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and debates. Due to our aim to keep as many interfering factors as possible to a minimum, all datasets were composed by selecting candidates having the same political views (avoiding ideology-based classification) from the overlapping parliamentary terms (avoiding topic classification task). Experiments revealed that content-based features are more useful compared with the function words or part-of-speech tags; moreover, lemma n-grams (sometimes used in concatenation with morphological information) outperform word or document-level character n-grams. Due to the fact that Lithuanian is highly inflective, morphologically and vocabulary rich; moreover, we were dealing with the normative language; therefore morphological tools were maximally helpful.