Authorship attribution and author profiling of Lithuanian literary texts

Kapočiūtė-Dzikienė, Jurgita; Utka, Andrius; Šarkutė, Ligita

Use this url to cite publication: https://hdl.handle.net/20.500.12259/30687

Authorship attribution and author profiling of Lithuanian literary texts

Type of publication

Straipsnis konferencijos medžiagoje Scopus duomenų bazėje / Article in conference proceedings in Scopus database (P1a2)

Author(s)

Author	Affiliation
Kapočiūtė-Dzikienė, Jurgita	Taikomosios informatikos katedra / Department of Applied Informatics	LT
Utka, Andrius	Lituanistikos katedra / Department of Lithuanian Studies	LT
Šarkutė, Ligita	Kauno technologijos universitetas	LT

Title

Authorship attribution and author profiling of Lithuanian literary texts

[en]

Is part of

RANLP 2015 : 10th international conference on recent advances in natural language processing, BSNLP 2015 : 5th workshop on Balto-Slavic natural language processing, 10–11 September 2015, Hissar, Bulgaria : proceedings. Shoumen, Bulgaria : INCOMA Ltd., 2015

Date Issued

Date
2015

Publisher

Shoumen, Bulgaria : INCOMA Ltd., 2015

Is Referenced by

Scopus

Extent

p. 96-105

URI

URI
http://bsnlp-2015.cs.helsinki.fi/bsnlp2015-book.pdf
https://eltalpykla.vdu.lt/1/30687
https://hdl.handle.net/20.500.12259/30687

Field of Science

Keywords (lt)

Keywords (en)

Abstract (en)

In this work we are solving authorship attribution and author profiling tasks (by focusing on the age and gender dimensions) for the Lithuanian language. This paper reports the first results on literary texts, which we compared to the results, previously obtained with different functional styles and language types (i.e., parliamentary transcripts and forum posts). Using the Naïve Bayes Multinomial and Support Vector Machine methods we investigated an impact of various stylistic, character, lexical, morpho-syntactic features, and their combinations; the different author set sizes of 3, 5, 10, 20, 50, and 100 candidate authors; and the dataset sizes of 100, 300, 500, 1,000, 2,000, and 5,000 instances in each class. The highest 89.2% accuracy in the authorship attribution task using a maximum number of candidate authors was achieved with the Naïve Bayes Multinomial method and document-level character tri-grams. The highest 78.3% accuracy in the author pro- filing task focusing on the age dimension was achieved with the Support Vector Machine method and token lemmas. An accuracy reached 100% in the author profiling task focusing on the gender dimension with the Naïve Bayes Multinomial method and rather small datasets, where various lexical, morpho-syntactic, and character feature types demonstrated a very similar performance.