Informative feature selection for document clustering

Mackutė-Varoneckienė, Aušra; Krilavičius, Tomas

Use this url to cite publication: https://hdl.handle.net/20.500.12259/44592

Informative feature selection for document clustering

Type of publication

Konferencijų tezės nerecenzuojamame leidinyje / Conference theses in non-peer-reviewed publication (T2)

Author(s)

Author	Affiliation
Mackutė-Varoneckienė, Aušra	Taikomosios informatikos katedra / Department of Applied Informatics	LT

Title

Informative feature selection for document clustering

[en]

Is part of

Data analysis methods for software systems : 6th international workshop, Druskininkai, Lithuania, December 4-6, 2014 : [abstracts book]. Vilnius : Vilnius University

Date Issued

Date
2014

Publisher

Vilnius : Vilnius University

Extent

p. 37-37

URI

URI
https://hdl.handle.net/20.500.12259/44592

Field of Science

Abstract (en)

Document clustering incorporates a number of data mining techniques, and to achieve good clustering results, all of them should be well attuned. Full text document clustering problem is mostly challenging due to the data dimensionality. The number of terms in corpora may constitute hundreds of thousands and a particular document may contain only hundreds of terms. Usually, documents – terms matrices are very sparse due to high diversity of terms in different documents. One of the important tasks is to find the feature subset which includes the most informative features. Efficient feature selection helps to cope with the data dimensionality. We investigate a number of thresholding feature selection methods based on the document frequency (DF), term frequency based document frequency (TFDF), term frequency-inverse document frequency (TFIDF), term strength (TS) and term contribution (TC), and their performance for Lithuanian, Russian and Azeri (Azerbaijani) languages. F-score, purity and entropy are used to evaluate performance of different feature selection methods. Results show that feature selection based on TFDF performs better than other feature selection methods for all languages and it is enough to select up to 7% informative features of all feature set to obtain best clustering results. In future we will investigate effectiveness of Principal Component Analysis (PCA) technique for feature selection and possible combinations of feature selection methods commonly used in document clustering with other feature reduction techniques. Acknowledgement This research was funded by a grant (No. VP1-3.1-ŠMM-10-V-02-025) from the ESFA.

Type of document

type::text::conference output::conference proceedings::conference paper

Language

Anglų / English (en)

Coverage Spatial

Lietuva / Lithuania (LT)

ISBN (of the container)

9789986680505

Other Identifier(s)

VDU02-000017069