University of Borås

Borås Academic Digital Archive (BADA) >
Forskningspublikationer / Research Publications >
Institutionen Biblioteks- och informationsvetenskap / Swedish School of Library and Information Science (BHS) >
Artiklar och rapporter / Articles and reports (BHS) >

Please use this identifier to cite or link to this item: http://hdl.handle.net/2320/9820

Files in This Item:

File Description SizeFormat
ijdl11using-0519.pdf329.15 kBAdobe PDFView/Open
Title: Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints
Authors: Darányi, Sándor
Wittek, Peter
Dobreva, Milena
Department: University of Borås. Swedish School of Library and Information Science
Other
Issue Date: 2011
Journal Title: International Journal on Digital Libraries
ISSN: 1432-5012
Publication type: article, peer reviewed scientific
Keywords: Kernel methods
text classification
support vector machines
semantic enrichment
Hilbert spaces
digital libraries
text categorization
machine learning
analogical information representation
wavelet analysis
Subject Category: Subject categories::Social Sciences::Other Social Sciences not elsewhere specified::Other Social Sciences not elsewhere specified
Subject categories::Engineering and Technology::Computer and Information Science::Computer Science::Computer Science
Subject categories::Humanities::Specific Languages::General Language Studies and Linguistics::Language Technology (Computational Linguistics)
Strategic Research Area: Library and information science
Abstract: Digital libraries increasingly bene t from re- search on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-world digital repos- itory, indexed by Library of Congress Subject Head- ings, and test support vector machines in a supervised learning setting for their ability to reproduce the exist- ing classi cation. To augment the standard approach, we introduce a combination of two novel elements: us- ing functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classi cation reconstruction from abstracts and vice versa from full-text documents, the latter out- come due to word sense ambiguity. The practical imple- mentation of our methodological framework enhances the analysis and representation of speci c knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of speci c knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital ob- jects and collections). Our research is an initial step in this direction developing further the methodological ap- proach and demonstrating that text categorisation can be applied to analyse the thematic coverage in digital repositories.
URI: http://hdl.handle.net/2320/9820
Appears in Collections:Artiklar och rapporter / Articles and reports (BHS)

SFX Query

All items in Borås Academic Digital Archive are protected by copyright, with all rights reserved.

 

DSpace Software Copyright © 2002-2010  The DSpace Foundation