DSpace university logo mark
Advanced Search
Japanese | English 

NAOSITE : Nagasaki University's Academic Output SITE > Faculty of Engineering > Articles in academic journal >

Clustering Documents with Maximal Substrings

File Description SizeFormat
LNBIP102_19.pdf260.83 kBAdobe PDFView/Open

Title: Clustering Documents with Maximal Substrings
Authors: Masada, Tomonari / Takasu, Atsuhiro / Shibata, Yuichiro / Oguri, Kiyoshi
Issue Date: 20-May-2012
Publisher: Springer Verlag
Citation: Lecture Notes in Business Information Processing, 102, pp.19-34; 2012
Abstract: This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction.
Keywords: Bayesian modeling / Document clustering / Maximal substring / Suffix array / Unsupervised method
URI: http://hdl.handle.net/10069/29352
ISSN: 18651348
DOI: 10.1007/978-3-642-29958-2_2
Rights: © 2012 Springer-Verlag. / The original publication is available at www.springerlink.com.
Type: Journal Article
Text Version: author
Appears in Collections:Articles in academic journal

Citable URI : http://hdl.handle.net/10069/29352

All items in NAOSITE are protected by copyright, with all rights reserved.


Valid XHTML 1.0! Copyright © 2006-2015 Nagasaki University Library - Feedback Powerd by DSpace