Please use this identifier to cite or link to this item:https://hdl.handle.net/20.500.12259/51286
Type of publication: Konferencijų tezės nerecenzuojamuose leidiniuose / Conference theses in non-peer-reviewed publications (T2)
Field of Science: Informatika / Informatics (N009)
Author(s): Ciganaitė, Greta;Mackutė-Varoneckienė, Aušra;Krilavičius, Tomas
Title: Text document clustering using language-independent techniques
Is part of: Data analysis methods for software systems : 7th international workshop, December 3-5, 2015, Druskininkai, Lithuania : [abstracts book]. Vilnius : Vilnius university, 2015
Extent: p. 15-16
Date: 2015
Keywords: Tekstinių dokumentų klasterizavimas;Duomenų gavyba;Data mining;Document clustering
ISBN: 9789986680581
Abstract: Clustering is a technique for grouping objects by their similarity. Document clustering is used for topic extraction, filtering and fast information retrieval. However, due to the high dimensionality, clustering of documents is rather slow and computationally intensive. In case of highly inflective languages, such as Lithuanian, it becomes even problematic. We investigate language-independent document clustering for Lithuanian and Azeri languages. Bag-of-words (BOW) is used for documents representation. We propose four feature selection models based on the terms frequencies in the corpora. The best results have been achieved by the model where features which occur less than amin times (or more than amax times) in the whole corpora are eliminated from feature set as non-informative. The importance of features in defined feature subset is evaluated by the weights of term frequency-inverse document frequency (TFIDF). Results show that it is enough to use only 2% – 5.6% of features of initial feature set to get the best clustering results. Hierarchical and flat clustering algorithms based on documents similarity were applied and precision of results was evaluated. Cosine distance was selected as the best distance measure. Many experiments with well-known Euclidean distance were made, but this measure is inappropriate due to the sparsity of the feature matrix. Best clustering results were reached by using spherical k-means algorithm (F-score value approx. 0.8 for both languages)
Internet: http://www.mii.lt/datamss/files/liks_mii_drusk_2015_abstract_last_1.pdf
Affiliation(s): Baltijos pažangių technologijų institutas
Informatikos fakultetas
Taikomosios informatikos katedra
Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml8.45 kBXMLView/Open

MARC21 XML metadata

Show full item record
Export via OAI-PMH Interface in XML Formats
Export to Other Non-XML Formats

Page view(s)

210
checked on Jan 6, 2020

Download(s)

10
checked on Jan 6, 2020

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.