|
Borås Academic Digital Archive (BADA) >
Forskningspublikationer / Research Publications >
Institutionen Biblioteks- och informationsvetenskap / Swedish School of Library and Information Science (BHS) >
Artiklar och rapporter / Articles and reports (BHS) >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2320/9820
|
| Title: | Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints |
| Authors: | Darányi, Sándor Wittek, Peter Dobreva, Milena |
| Department: | University of Borås. Swedish School of Library and Information Science Other |
| Issue Date: | 2011 |
| Journal Title: | International Journal on Digital Libraries |
| ISSN: | 1432-5012 |
| Publication type: | article, peer reviewed scientific |
| Keywords: | Kernel methods text classification support vector machines semantic enrichment Hilbert spaces digital libraries text categorization machine learning analogical information representation wavelet analysis |
| Subject Category: | Subject categories::Social Sciences::Other Social Sciences not elsewhere specified::Other Social Sciences not elsewhere specified Subject categories::Engineering and Technology::Computer and Information Science::Computer Science::Computer Science Subject categories::Humanities::Specific Languages::General Language Studies and Linguistics::Language Technology (Computational Linguistics) |
| Strategic Research Area: | Library and information science |
| Abstract: | Digital libraries increasingly bene t from re-
search on automated text categorization for improved
access. Such research is typically carried out by using
standard test collections. In this paper we present a
pilot experiment of replacing such test collections by
a set of 6000 objects from a real-world digital repos-
itory, indexed by Library of Congress Subject Head-
ings, and test support vector machines in a supervised
learning setting for their ability to reproduce the exist-
ing classi cation. To augment the standard approach,
we introduce a combination of two novel elements: us-
ing functions for document content representation in
Hilbert space, and adding extra semantics from lexical
resources to the representation. Results suggest that
wavelet-based kernels slightly outperformed traditional
kernels on classi cation reconstruction from abstracts
and vice versa from full-text documents, the latter out-
come due to word sense ambiguity. The practical imple-
mentation of our methodological framework enhances
the analysis and representation of speci c knowledge relevant to large-scale digital collections, in this case
the thematic coverage of the collections. Representation
of speci c knowledge about digital collections is one of
the basic elements of the persistent archives and the less
studied one (compared to representations of digital ob-
jects and collections). Our research is an initial step in
this direction developing further the methodological ap-
proach and demonstrating that text categorisation can
be applied to analyse the thematic coverage in digital
repositories. |
| URI: | http://hdl.handle.net/2320/9820 |
| Appears in Collections: | Artiklar och rapporter / Articles and reports (BHS)
|
All items in Borås Academic Digital Archive are protected by copyright, with all rights reserved.
|