Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Size: px

Start display at page:

Download "Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent"

Vanessa McDonald
5 years ago
Views:

1 Journal of Computational Information Systems 9: 15 (2013) Available at Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng LI, Mo YU School of Computer Science and Technology, Harbin Institute of Technology, Harbin , China Abstract Semi-supervised learning tries to employ a large collection of unlabeled data and a few labeled examples for improving generalization performance, which has been proved meaningful in real-world applications. The bottleneck of exiting semi-supervised approaches lies in over long training time due to the large scale unlabeled data. In this article we introduce a novel method for semi-supervised linear support vector machine based on average stochastic gradient descent, which significantly enhances the training speed of S3VM over existing toolkits, such as SVMlight-TSVM, CCCP-TSVM and SVMlin. We evaluate our method on text categorization and sentiment classification respectively, which indicates its efficiency on large scale semi-supervised tasks. Keywords: Semi-supervised Learning; Stochastic Gradient Descent; Support Vector Machines 1 Introduction Supervised learning is employed in many real-world tasks, such as text categorization, web page classification, spam mail detection and image processing. However, their performance depends on the scale of labeled data. While in some cases, such as image processing, biology, natural language processing, annotations are extremely exhausting, while unlabeled data can be relatively easier to obtain. In order to utilize additional unlabeled data, semi-supervised learning was introduced. The goal of semi-supervised learning is to improve generalization performance by using labeled data and large amount of unlabeled data. Most methods current used include: EM with generative mixture models, self-training, cotraining, transductive support vector machines (TSVMs) and graph-based methods. Generally, the bottleneck of all the approaches lies in over long training time, which cannot meet the training requirement on large scale data. So far few work has utilized ultra-large-scale unlabeled data, because of unacceptable training time. Project supported by National High Technology Research and Development Program of China (863 Program) (No. 2011AA01A207) and the National Nature Science Foundation of China (No ) and (No ). Corresponding author. address: zhouxin@mtlab.hit.edu.cn (Xin ZHOU) / Copyright 2013 Binary Information Press DOI: /jcisP0590 August 1, 2013

2 6252 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) This article aims at reducing the training time of semi-supervised learning. In this article, we take TSVMs as an example. Semi-supervised support vector machines (S3VMs), also called TSVMs [1], are a method of improving the generalization accuracy of SVMs [2] by using unlabeled data. Stochastic gradient descent (SGD) shows amazing performance for large-scale problems in supervised learning [3], we introduce a large scale training method for semi-supervised support vector machine using stochastic gradient descent. Our method reduces training time significantly, and achieves better performance. 2 Related Work S3VM [1] is an extension of standard support vector machines with unlabeled data. The goal is to find labels of the unlabeled data, so that a linear boundary with the maximum margin still can be used on both the original labeled data and the unlabeled data (now labeled). However, to find an exact transductive SVM solution is an NP-hard problem. Major effort has focused on efficient approximation algorithms. At the beginning, method [4] can t deal with more than a few hundred unlabeled points. Joachims [5] proposed a heuristic optimization algorithm implemented in SVMlight-TSVM, which improves the object function by iteratively switching the labels of two unlabeled points. However, too many iterations needed to reach the minimum makes it intractable for large scale problems in practice. De Bie and Cristianini [6] regard the transductive learning problem as semi-definite programming (SDP). Xu and Schuurmans [7] present a similar multi-class version of SDP formulation, which results in multi-class SVM for semi-supervised learning. But the computational cost is still expensive. TSVM proposed by Chapelle and Zien [8] performs gradient search in the primal space. The overall worst complexity of is O((L + U) 3 ), which is still not suitable for large scale data set. Collobert et al. [9] optimize the hard TSVM directly, using an approximate optimization procedure known as concave-convex procedure (CCCP) which significantly speeds up the training time. Sindhwani and Keerthi [10] proposed a fast algorithm for linear S3VMs, which uses multiple switching trick implemented in SVMlin, suitable for large scale text applications. Stochastic gradient descent (SGD), also referred to as stochastic approximation algorithms [11] has been shown to have great promise for large scale learning. In the stochastic approximation literature, the averaging technique comes with great theoretical promises [12]. Averaged stochastic gradient descent (ASGD) was proved that it can get good enough result in one pass through all data [13]. 3 Semi-supervised SVM with Stochastic Gradient Descent 3.1 Semi-supervised support vector machines In semi-supervised learning, a training data set can be viewed as a labeled set {(x i, y i ) 1 i L}, x R n, y {1, 1} and an unlabeled set {x i L + 1 i L + U}.

3 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) SVMs have a decision function: f θ (x) = ω Φ(x) + b (1) where θ=(ω, b) are the parameters of the model, and Φ( ) is the chosen feature map. The Semi-supervised SVM (S3VM), also called TSVM, was first introduced by Vapnik [1] and implemented by different algorithms. S3VM was based on the cluster assumption: the examples in the same cluster should have the same class level, which is also the key to other successful semisupervised learning methods. The assumption implies that the classifiers should avoid putting their decision boundaries through high density regions [8]. S3VM achieved this goal by making the margin (distance) between the examples and the decision boundary of a classifier as far as possible. The idea is to find an SVM separating the training set under constraints which force the unlabeled examples to be as far as possible from the margin. This is encoded with the following optimization problem: Subject to: Let and min ω,b 1 2 ω2 + C L ξ i + C i=1 y i f θ (x i ) 1 ξ i, L+U i=l+1 ξ i i = 1,..., L f θ (x i ) 1 ξ i, i = L + 1,..., L + U ξ i 0, i = 1,, L + U l(y i f θ (x i )) = [1 y i f θ (x i )] + = ξ i, ξ 0, i = 1,, L l ( f θ (x i ) ) = [1 f θ (x i ) ] + = ξ i, ξ 0, i = L + 1,, L + U Then the problem above can be proved to be equivalent to min ω,b 1 2 ω2 + C L l(y i f θ (x i )) + C i=1 L+U i=l+1 l ( f θ (x i ) ) (2) l( ) and l ( ) are loss functions for S3VMs. These functions are typically chosen to be hinge loss function (3) and the symmetric hinge loss function (4) respectively according to Joachims [5]. H 1 (t) = max(0, 1 t) (3) H 2 ( t ) = max(0, 1 t ) (4) Chapelle and Zien [8] proposed a smooth version of hinge loss (5) for unlabeled data. Then Ramp Loss function (6) was used for unlabeled examples by Collobert et al. [9] and show good performance. S(t) = exp( 3t 2 ) (5) R s (t) = min(1 s, max(0, 1 t)) (6) Here we also use Ramp Loss for unlabeled data and chose s=0.3 to be same as Collobert et al. [9]. However we will use Logistic Loss (7) for labeled examples since Logistic Loss is suitable for large sparse data. Logloss(t) = log(1 + exp(t)) (7)

4 6254 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) Now we can rewrite (2) as min ω,b 1 2 ω2 + C L Logloss(y i f θ (x i )) + C i=1 L+U i=l+1 R s ( f θ (x i ) ) (8) This is the minimization problem we now consider in the rest of the paper. 3.2 Stochastic gradient descent algorithms Stochastic gradient descent (SGD), also referred to as stochastic approximation algorithms [11], has been extensively applied to many machine learning schemes, like support vector machines, neural networks. SGD updates the weight vector ω in the online setting. The standard SGD algorithm is shown as follow. (1) Initialize ω 0. (2) for t=1, 2,... a. Draw z t = (x t, y t ) randomly from D. b. ω t+1 = ω t η t ω Q(z t, ω t ). The stochastic gradient descent algorithms are particularly suitable for large scale applications, where the number of data points and the problem dimensionality are both large. Large scale experiments with stochastic gradient descent achieved good performance [3]. Additionally, the optimal prediction performance can be achieved with only a small number of iterations over the training data. In order to accelerate the convergence speed of SGD, averaged stochastic gradient descent (ASGD) was proposed by Polyak and Juditsky [12]. ASGD performs the normal stochastic gradient updating of ω t just like the standard SGD and recursively computes the average ω t as (9) ω t+1 = ω t η t ω Q(z t, ω t ) ω t+1 = t t + 1 ω t + 1 t + 1 ω t (9) A smart selection of the gains η t (10) can help achieving the promised performance [13]. η t = η 0 (1 + λη 0 t) 0.75 (10) The gain η 0 is set by observing the performance running on a subset of the labeled samples. Polyak and Juditsky [12] proved that if there are enough training samples, ASGD can obtain the parameters as good as the empirical optimal in just one epoch of the data. Besides, ASGD is extremely easy to implement compared to second order SGD. Since S3VM is designed to use labeled data and large scale unlabeled data, SGD algorithms are well suited for S3VM. Due to the advantages above of ASGD, we choose ASGD as the optimization approach. We follow concrete implementation method from Xu [13].

5 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) ASGD for S3VM To use unlabeled examples, well known SVMlight [5] assigned positive and negative lables to unlabeled data with the same fraction as found in the labeled data and then switch the labels heuristically. CCCP-TSVM [9] labels the unlabeled data both positive and negative. In this paper we proposed a method like self-learning as follow. (1) use ASGD to train a model on labeled data with logistic loss (2) use the model above to label the unlabeled data (3) mix up the labeled and unlabeled data, then shuffle (4) for each epoch (a pass through all data) a. use ASGD to train a model on labeled data and unlabeled data with logistic loss and ramp loss respectively b. use the model above to label the unlabeled data In our experiment we set C = 1, L C = 1 U determined. empirically, λ varies in different tasks and is easy to The main point we want to emphasize is the advantage in terms of training time consumption comparison between our method and exited approaches. Our method can train S3VMs in linear time. The complexity of ASGD-S3VM is O(T (L + U)) (T means the number of iterations, L and U represent the number of labeled and unlabeled data respectively). From our experience, it will reach the minimum in five iterations. So our method can deal with very large scale data. 4 Experiment and Analysis This section reports experimental results on two typical tasks, both of which are large scale. We take SVMlight-TSVM, CCCP-TSVM, and stat-of-the-art SVMlin as the baselines. Furthermore, we compare our method with standard SVM running on pure labeled examples. For different tasks, we use related measurements to evaluate the results of algorithms respectively. All the experiments were run on a machine with six 64 bit Xeon (R) processors (1.87GHz) and 256G memory. 4.1 RCV1 experiment The first experiment is a text categorization task, whose data set is from Reuters prepared by Lewis et al. [14] to classify CCAT (Corporate/Industrial) and NOT CCAT categories. The features are constructed using the bag of words technique, weighted with a TF.IDF scheme and normalized to 1. The partition of the data is the same as [2]. As other S3VMs are intractable for such large data set, we only perform experiments using 1000 labeled examples. The measurement is simply accuracy.

6 6256 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) Table 1: Accuracy (%) comparison method labeled unlabeled size size SVM SVMlight-TSVM SVMlin CCCP-TSVM ASGD-S3VM 1k k 1k k 2k k 5k k 10k We then varied the number of unlabeled examples U, and reported the test accuracy for each selection of U. The accuracy comparison against different S3VMs can be seen in Table 1. In general, the accuracies of all the methods are improved with the increasing of unlabeled examples. Compared to the standard SVM, all the S3VMs get higher accuracy. This shows that unlabeled data can improve the results on this problem. From table 1 we can see SVMlight-TSVM outperforms in this task, ASGD-S3VM can get comparable results with other S3VM algorithms top results. Table 2 shows the training time of all the S3VM algorithms with respect to the number of unlabeled examples. In the case of using 1000 labeled points and unlabeled points, ASGD- S3VM approximately times faster than SVMlight-TSVM, 5230 times faster than CCCP- TSVM, 100 times faster than SVMlin. Moreover, the training time of ASGD-S3VM grows slightly as the number of the unlabeled examples increase, whereas the other three algorithms cost too much. labeled size unlabeled size Table 2: Training time (s) comparison method SVMlight-TSVM SVMlin CCCP-TSVM ASGD-S3VM 1k 1k k 2k k 5k k 10k NLP&CC 2012 evaluation experiment The second experiment is a sentiment analysis task whose data set is from (NLP&CC Evaluation 2012). What we do is subjectivity identification on Chinese micro blog. To compare with the best result of (NLP&CC Evaluation 2012) we choose the same evaluation criterion, F-measure, which is defined as (13) system correct P recision = (11) system proposed

7 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) Recall = system correct gold 2 P recision Recall F measure = (13) P recision + Recall The number of our labeled set is the same as the first of (NLP&CC Evaluation 2012) [15]. Each Chinese micro blog is represented by the basic features including part of speech, TF-IDF, emoticon and Hownet. All the features are normalized to 1. We run all the S3VM algorithms and got similar conclusions with last task. In brief we just report the results of ASGD-S3VM and SVMlight-TSVM as SVMlight-TSVM outperforms CCCP- TSVM and SVMlin. From table 3, we can get similar conclusion that unlabeled data greatly improves performance. We can see accuracy and F-measure rise up as the unlabeled examples increase. However, if we still add more unlabeled data, both criterions grow slightly. Table 4 shows that the training time consumption of ASGD-S3VM can outperform SVMlight- TSVM by order of magnitudes amazingly. Table 5 shows that our method outperforms the best result of the NLP&CC Evaluation in F-measure and recall. It is worth noting that their method uses some complex and additional features like subjective word and opinion word, which help achieve good result obviously, whereas we just use the simple basic features. This means that our semi-supervised method learns potential knowledge from unlabeled data effectively. labeled size unlabeled size Table 3: Accuracy (%) and F1 comparison standard SVM SVMlight-TSVM ASGD-S3VM ACC(%) F1 ACC(%) F1 ACC(%) F1 2k k 5k k 10k k 20k (12) Table 4: Training time (s) comparison labeled size unlabeled size Training time(s) SVMlight-TSVM ASGD-S3VM 2k 5k k 10k k 20k Table 5: Our best result vs first of evaluation Precision Recall F1 First of Evaluation ASGD-S3VM

8 6258 X. Zhou et al. /Journal of Computational Information Systems 9: 15 (2013) Conclusion In this article we have proposed an efficient method for large scale linear semi-supervised support vector machine based on averaged stochastic gradient descent. Our method significantly improve the training speed of S3VM over existing approaches, such as SVMlight-TSVM, CCCP-TSVM and SVMlin. Our method is more practical in real applications, when labeled data is scarce and plenty of unlabeled data is easily available. As our future work, we will apply this approach to non-linear setting. References [1] V. Vapnik, The Nature of Statistical Learning Theory, Springer, second edition, [2] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers, in: Proc. the fifth annual workshop on Computational learning theory, 1992, pp [3] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proc. COMPSTAT, 2010, pp [4] Bennett K, Demiriz A, Semi-supervised support vector machines, Advances in Neural Information processing systems (1999) [5] Joachims T, Transductive inference for text classification using support vector machines, in: Proc. Machine learning-international workshop then conference, 1999, pp [6] Cristianini, T.D.B.N, Convex methods for transduction, Advances in neural information processing systems 16 (2004) 73. [7] Xu L, Schuurmans D, Unsupervised and semi-supervised multi-class support vector machines, in: Proc. the national conference on artificial intelligence, 1999, pp [8] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proc. the Tenth International Workshop on Artificial Intelligence and Statistics, [9] Collobert R, Sinz F, Weston J, et al. Large scale transductive SVMs, The Journal of Machine Learning Research 7 (2006) [10] Sindhwani V, Keerthi S S, Large scale semi-supervised linear SVMs in: Proc. the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006, pp [11] Kushner, Yin, Stochastic approximation algorithms and applications, NewYork: Springer-Verlag, [12] Polyak, Boris T., and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization (1992) [13] Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent, arxiv preprint arxiv: , [14] Lewis, David D., et al. Rcv1: A new benchmark collection for text categorization research, The Journal of Machine Learning Research 5 (2004) [15] Xiao Zhou, Zhenyu Zhou, Fang Li, LTLAB at Chinese Microblog Sentiment Analysis Track, in: Proc.1st NLP&CC Evaluation, 2012.

Trading Convexity for Scalability

Trading Convexity for Scalability Léon Bottou leon@bottou.org Ronan Collobert, Fabian Sinz, Jason Weston ronan@collobert.com, fabee@tuebingen.mpg.de, jasonw@nec-labs.com NEC Laboratories of America A Word