Feature exploration for authorship attribution of Lithuanian parliamentary speeches
Author | Affiliation | |
---|---|---|
LT | ||
LT | ||
Kauno technologijos universitetas | LT |
Date |
---|
2014 |
This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and debates. Due to our aim to keep as many interfering factors as possible to a minimum, all datasets were composed by selecting candidates having the same political views (avoiding ideology-based classification) from the overlapping parliamentary terms (avoiding topic classification task). Experiments revealed that content-based features are more useful compared with the function words or part-of-speech tags; moreover, lemma n-grams (sometimes used in concatenation with morphological information) outperform word or document-level character n-grams. Due to the fact that Lithuanian is highly inflective, morphologically and vocabulary rich; moreover, we were dealing with the normative language; therefore morphological tools were maximally helpful.
Journal | Cite Score | SNIP | SJR | Year | Quartile |
---|---|---|---|---|---|
Lecture Notes in Computer Science | 1.5 | 0.756 | 0.354 | 2014 | Q2 |