Please use this identifier to cite or link to this item:https://hdl.handle.net/20.500.12259/44592
Type of publication: Konferencijų tezės nerecenzuojamuose leidiniuose / Conference theses in non-peer-reviewed publications (T2)
Field of Science: Matematika / Mathematics (N001)
Author(s): Mackutė-Varoneckienė, Aušra;Krilavičius, Tomas
Title: Informative feature selection for document clustering
Is part of: Data analysis methods for software systems : 6th international workshop, Druskininkai, Lithuania, December 4-6, 2014 : [abstracts book]. Vilnius : Vilnius University
Extent: p. 37-37
Date: 2014
ISBN: 9789986680505
Abstract: Document clustering incorporates a number of data mining techniques, and to achieve good clustering results, all of them should be well attuned. Full text document clustering problem is mostly challenging due to the data dimensionality. The number of terms in corpora may constitute hundreds of thousands and a particular document may contain only hundreds of terms. Usually, documents – terms matrices are very sparse due to high diversity of terms in different documents. One of the important tasks is to find the feature subset which includes the most informative features. Efficient feature selection helps to cope with the data dimensionality. We investigate a number of thresholding feature selection methods based on the document frequency (DF), term frequency based document frequency (TFDF), term frequency-inverse document frequency (TFIDF), term strength (TS) and term contribution (TC), and their performance for Lithuanian, Russian and Azeri (Azerbaijani) languages. F-score, purity and entropy are used to evaluate performance of different feature selection methods. Results show that feature selection based on TFDF performs better than other feature selection methods for all languages and it is enough to select up to 7% informative features of all feature set to obtain best clustering results. In future we will investigate effectiveness of Principal Component Analysis (PCA) technique for feature selection and possible combinations of feature selection methods commonly used in document clustering with other feature reduction techniques. Acknowledgement This research was funded by a grant (No. VP1-3.1-ŠMM-10-V-02-025) from the ESFA
Internet: https://hdl.handle.net/20.500.12259/44592
Affiliation(s): Baltijos pažangių technologijų institutas, Vilnius
Informatikos fakultetas
Taikomosios informatikos katedra
Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml6.65 kBXMLView/Open

MARC21 XML metadata

Show full item record
Export via OAI-PMH Interface in XML Formats
Export to Other Non-XML Formats

Page view(s)

154
checked on Mar 30, 2020

Download(s)

12
checked on Mar 30, 2020

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.