Informative feature selection for document clustering
Date |
---|
2014 |
Document clustering incorporates a number of data mining techniques, and to achieve good clustering results, all of them should be well attuned. Full text document clustering problem is mostly challenging due to the data dimensionality. The number of terms in corpora may constitute hundreds of thousands and a particular document may contain only hundreds of terms. Usually, documents – terms matrices are very sparse due to high diversity of terms in different documents. One of the important tasks is to find the feature subset which includes the most informative features. Efficient feature selection helps to cope with the data dimensionality. We investigate a number of thresholding feature selection methods based on the document frequency (DF), term frequency based document frequency (TFDF), term frequency-inverse document frequency (TFIDF), term strength (TS) and term contribution (TC), and their performance for Lithuanian, Russian and Azeri (Azerbaijani) languages. F-score, purity and entropy are used to evaluate performance of different feature selection methods. Results show that feature selection based on TFDF performs better than other feature selection methods for all languages and it is enough to select up to 7% informative features of all feature set to obtain best clustering results. In future we will investigate effectiveness of Principal Component Analysis (PCA) technique for feature selection and possible combinations of feature selection methods commonly used in document clustering with other feature reduction techniques. Acknowledgement This research was funded by a grant (No. VP1-3.1-ŠMM-10-V-02-025) from the ESFA.