|
Large Scale Learning with String Kernels
In applications of bioinformatics and text processing, such as splice site recognition and spam detection, large amounts of training sequences are available and needed to achieve sufficiently high prediction performance on classification or regression tasks. Although kernel-based methods such as SVMs often achieve state-of-the-art results, training and evaluation times may be prohibitively large. When single kernel computation time is already linear (w.r.t. the input sequences) it seems difficult to achieve further speed ups. In this work we describe an efficient technique for computing linear combinations of string kernels using sparse data structures such as explicit maps, sorted arrays and suffix tries, trees or arrays [5]. As computing linear combinations of kernels make up the dominant part of SVM training and evaluation, speeding up their computation is essential. Considering the recently proposed and successfully used linear time string kernels, like the Spectrum kernel [2] and the Weighted Spectrum kernel [3] we show that one can accelerate SVM training by factors of 7 and 60 times, respectively, while requiring considerably less memory. Our method allows us to train string kernel SVMs on sets as large as 10 million sequences [4]. Moreover, using these techniques the evaluation on new sequences is often several thousand times faster, allowing us to apply the classifiers on genome-sized data sets with seven billion test examples [6]. The presented algorithms are implemented in our Machine Learning toolbox SHOGUN for which the source code is publicly available at [1] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press, 1997. [2] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, page...
Video Length: 0
Date Found: October 13, 2010
Date Produced: December 29, 2007
View Count: 2
|