Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Similar documents
Multi-view Laplacian Support Vector Machines

How Good is a Kernel When Used as a Similarity Measure?

Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Support Vector Machine (SVM) and Kernel Methods

Machine Learning : Support Vector Machines

Introduction to Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

A Randomized Algorithm for Large Scale Support Vector Learning

Support Vector Machine (SVM) and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

ML (cont.): SUPPORT VECTOR MACHINES

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

Kobe University Repository : Kernel

Sample width for multi-category classifiers

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

An RKHS for Multi-View Learning and Manifold Co-Regularization

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS798: Selected topics in Machine Learning

Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear & nonlinear classifiers

Support Vector Learning for Ordinal Regression

An introduction to Support Vector Machines

A Randomized Algorithm for Large Scale Support Vector Learning

Machine Learning. Support Vector Machines. Manfred Huber

Convex Optimization in Classification Problems

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Learning with Rejection

Jeff Howbert Introduction to Machine Learning Winter

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

Multi-View Point Cloud Kernels for Semi-Supervised Learning

Linear & nonlinear classifiers

Unsupervised and Semisupervised Classification via Absolute Value Inequalities

Support Vector Machines Explained

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines.

Formulation with slack variables

Support Vector Machines for Classification: A Statistical Portrait

Brief Introduction to Machine Learning

Adaptive Sampling Under Low Noise Conditions 1

Non-linear Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Adaptive Mixture Discriminant Analysis for. Supervised Learning with Unobserved Classes

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Midterm: CS 6375 Spring 2015 Solutions

Support Vector Machines (SVMs).

Semi-Supervised Learning through Principal Directions Estimation

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Learning Kernel Parameters by using Class Separability Measure

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Support Vector Machines with Example Dependent Costs

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

PAC-Bayesian Generalization Bound for Multi-class Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Statistical Pattern Recognition

SVM and Kernel machine

Unlabeled Data: Now It Helps, Now It Doesn t

Lecture 10: Support Vector Machine and Large Margin Classifier

Unsupervised Classification via Convex Absolute Value Inequalities

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CMU-Q Lecture 24:

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machines and Kernel Algorithms

Introduction to Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CLOSE-TO-CLEAN REGULARIZATION RELATES

Co-Training with Insufficient Views

Kernel methods and the exponential family

Efficient Semi-supervised and Active Learning of Disjunctions

Convex Methods for Transduction

Introduction to Support Vector Machines

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Kernel Methods and Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Generalization error bounds for classifiers trained with interdependent data

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Support Vector Machine & Its Applications

Transcription:

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Sandor Szedma 1 and John Shawe-Taylor 1 1 - Electronics and Computer Science, ISIS Group University of Southampton, SO17 1BJ, United Kingdom Abstract. In this presentation we show the semi-supervised learning with two input sources can be transformed into a maximum margin problem to be similar to a binary SVM. Our formulation exploits the unlabeled data to reduce the complexity of the class of the learning functions. In order to measure how the complexity is decreased we use the Rademacher Complexity Theory. The corresponding optimization problem is convex and it is efficiently solvable for large-scale applications as well. 1 Introduction Semi-supervised learning belongs to the main directions of the recent machine learning research. The exploitation of the unlabeled data is an attractive approach either to extend the capability of the nown methods or to derive novel learning devices. In this presentation we give a synthesis of some earlier approaches and show an optimization framewor with good statistical generalization capability. Our idea consists of: Multiview learning, when two or more sources of the inputs are given with the same output, see in papers of Blum et al. 4] and Dasgupta et al. 5], Maximum margin learning, where the unlabeled cases within the margin considered as errors, see in Bennett et al. 3], Reduction of the learning class complexity, where the closeness of the learning functions is assumed on the unlabeled data, see in the conference paper about seleton learning of Lugosi et al. 6] and Balcan et al. 1], Combination of the multiview and maximum margin learning,which introduces new constraints to the optimization, see in Meng et al. 7]. Blending these ideas we arrive at an optimization framewor, which can be solved efficiently at the level of computational complexity of a binary SVM problem. In the first part the optimization problem is presented and then the generalization performance is analyzed. We apply the Rademacher complexity to give an upper bound on the expected error. At the end, experimental results will illustrate the usefulness of the approach. The underlying theory is summarized in Appendix A and B. Further details can be found in 2]. This wor is supported by PASCAL Networ of Excellence (IST-2002-506778) European Community IST Programmes. The authors want to than Gabor Lugosi for very fruitful discussions. 479

2 Optimization Framewor sign(f ). In order to exploit the unlabeled In the semi-supervised, multiview learning, with two views, we are given a compound sample S L of pairs of outputs and inputs {(y i, (x 1 i, x2 i )) : y i { 1, 1}, x i X, i =1,...,m L, =1, 2} independently and identically generated by an unnown multivariate distribution P(Y,X), and a compound sample S U = {(u 1 i, u2 i )) : u i X, i =1,...,, =1, 2} of unlabeled cases independently and identically chosen from the marginal distribution P(X) ofp(y,x). Furthermore, there is given a set of embedding of the inputs into Hilbert spaces by the functions φ : X H φ,=1, 2. The image vectors of these mappings are called feature vectors in the sequel. The objective is to find linear functions f (x )=w T φ (x )+b which can predict the potential label value for any labeled and unlabeled cases. The decision function is then defined as 1 2 data in the optimization problem we choose particular solutions of the SVM subproblems where the predictors give similar solutions on the unlabeled data. The matrix Y is a diagonal matrix of the labels {y i }, and the matrices X and U comprise the labeled and unlabeled inputs in their rows for =1, 2. Applying the embedding functions φ on X and U gives matrices with the feature vectors in their rows, otherwise all vectors are column vectors. The optimization problem formulating our learning framewor is given by wt w + 1 T C ξ + C η 1 T (η + + η ) 1 min 2 w.r.t. (w,b, ξ ), =1, 2, (η +, η ), s.t. ( 1) 1 (φ (U )w + b )=η + η, Synthesis: Subproblems: Y(φ (X )w + b ) 1 ξ, ξ 0, (η +, η ) 0, =1, 2. We use the acronym SVM 2K for the problem (1) in the sequel. Introducing Lagrangian multipliers α for any to the SVM subproblems and β to the synthesis constraints then we can express the normal vector of the separating hyperplanes for any by w =φ (X ) T, φ (U ) T ]g, where g T =(α T Y, ( 1) β T ), then unfolding the Karush-Kuhn-Tucer conditions we can derive the corresponding dual problem. The ernel matrices in the dual have the structure ] ] ] φ (X K = )φ (X ) T φ (X )φ (U ) T K LL φ (U )φ (X ) T φ (U )φ (U ) T = K LU K L =. K UL K UU Thus, the dual reads as 1 min α1,α 2,β 2 gt K g 1T α s.t. 1 T g =0, where g T =(αt Y ( 1) β T ), 0 α C, C η β C η,=1, 2. K U (1) 480

3 Theoretical Foundation To illuminate the theoretical bacground of the presented learning method we use the theory of the Rademacher Complexity, see in 2]. We assume the notation taen from Section 2. 3.1 Analyzing the Semisupervised Multiview Learning For SVM 2K, the two feature sets are ( (φ 1 (X 1 ), (φ 1 (U 1 ) ) and ( (φ 2 (X 2 ), (φ 2 (U 2 ) ). An application of Theorem 3, see in the Appendix A, shows that f D (u,f 1,f 2 ) def = E u ( 1) 1 (w T φ(u )+b ) ] 1 1 T (η + + η )+ 2C tr( ) K +3 n(2/δ) 2 =: D. with probability at least 1 δ. We have assumed that w 2 +b 2 C2 for some prefixed C and any =1, 2. Hence, the class of functions we are considering when applying SVM 2K to this problem can be restricted to { F C,D = f f :(x ) 1 2 g T K g + b 2 C2, =1 ( K (x )g + b ), } =1, 2, f D (u,f 1,f 2 ) D, where K (x )= φ(x ) T φ(x ) T, φ(x ) T φ(u ) T ]. The class F C,D is clearly closed under negation. Applying the usual Rademacher techniques for margin bounds on generalization we obtain the following result. Theorem 1. Fix δ (0, 1) and let F C,D be the class of functions described above. Let (X ) labeled and (U ) unlabeled samples be drawn independently according to a probability distribution P(X) for =1, 2. Then with probability at least 1 δ over random draws of labeled samples of size m L, every f F CD satisfies P (x,y) D (sign(f(x)) y) 1 1 T ξ 2m + ˆR ln(2/δ) l (F C,D )+3. L 2m L It therefore remains to compute the empirical Rademacher complexity of F C,D, which is the critical discriminator between the bounds for the individual SVMs and that of the SVM 2K. The details are presented in the Appendix B. 4 Experiments In the experiment we used the image dataset 2 being commonly used for generic object recognition. In the cross-validation 5% percent of the cases are randomly 2 Available at http://www.robots.ox.ac.u/ vgg/data/ 481

chosen as labeled cases and all others were unlabeled. This random selection of the labeled cases was ten times repeated. Table 1 shows the mean and the standard deviation of the accuracies and the Rademacher complexities computed for each image class by using SVM woring on the concatenation of two feature sets and the same ones in case of the SVM 2K. It shows that the smaller Rademacher complexity can increase the mean and decrease the standard deviation of the accuracies in classification. Image classes Airplanes Faces Motorbies SVM on mean(std) 91.4(2.8) 96.8(1.3) 94.1(1.3) two features Rad. comp. 588.2 339.1 574.3 SVM 2K mean(std) 91.9(2.1) 97.4(1.3) 94.3(1.1) Rad. comp. 236.4 34.4 197.4 Table 1: Accuracies(%), standard deviation and estimation of Rademacher Complexities of the SVM acting on two feature sets and the SVM 2K in three image classes. 5% of the cases were labeled. A Appendix: Short Introduction of Rademacher Complexity Theory We begin with the definitions required for Rademacher complexity, see for example Bartlett and Mendelson 2] (see also 8] for an introductory exposition). Definition 2. For a sample S = {x 1,,x l } generated by a distribution D on a set X and a real-valued function class F with a domain X, the empirical Rademacher complexity of F is the random variable ˆR l (F) =E σ sup 2 l f F l ] σ i f(x i ) x 1,,x l i=1 where σ = {σ 1,,σ l } are independent uniform {±1}-valued Rademacher random variables. The Rademacher complexity of F is R l (F) =E S ˆRl (F) ] = E S,σ sup 2 l f F l ] σ i f(x i ) i=1 We use E D to denote expectation with respect to a distribution D and E S when the distribution is the uniform (empirical) distribution on a sample S. Theorem 3. Fix δ (0, 1) and let F be a class of functions mapping from S to 0, 1]. Let(x i ) l i=1 be drawn independently according to a probability distribution 482

D. Then with probability at least 1 δ over random draws of samples of size l, every f F satisfies ln (2/δ) E D f (x)] E S f (x)]+r l (F)+3 E S f (x)]+ ˆR ln (2/δ) l (F)+3 Given a training set S the class of functions that we will primarily be considering are linear functions with bounded norm { x l α i κ (x i,x):α Kα B 2} {x w, φ (x) : w B} = F B, i=1 where φ is the feature mapping corresponding to the ernel κ and K is the corresponding ernel matrix for the sample S. The following result bounds the Rademacher complexity of linear function classes. Theorem 4. 2] If κ : X X R is a ernel, and S = {x 1,,x l } is a i.i.d. sample from X, then the empirical Rademacher complexity of the class F B satisfies ˆR l (F) 2B l tr (K). B Appendix: Empirical Rademacher Complexity of F C,D We define an auxiliary function of the weight vectors w =(w,b ), =1, 2, D( w 1, w 2 ) def = E D ( 1) 1 (φ (U )w + b ) ], and the Rademacher complexity of the class F C,D ] 2 ml ˆR l (F C,D ) = E σ sup f FC,D m L i=1 σ if (x i ) = E σ sup w C, =1,2 2 ( ) m L =1 σt K (x )g + b ]. D( w 1, w 2) D Based on the reversed version of the basic Rademacher complexity theorem where the roles of the empirical and true expectations are swapped: Theorem 5. Fix δ (0, 1) and let F be a class of functions mapping from S to 0, 1]. Let(x i ) l i=1 be drawn independently according to a probability distribution D. Then with probability at least 1 δ over random draws of samples of size l, every f F satisfies ln (2/δ) E S f (x)] E D f (x)]+r l (F)+3 E D f (x)]+ ˆR ln (2/δ) l (F)+3. The proof tracs that of Theorem 3 but is omitted through lac of space. 483

For weight vectors { w } satisfying D( w 1, w 2 ) D, an application of Theorem 5 shows that with probability at least 1 δ we have def ˆD( w 1, w 2 ) = E S ( 1) 1 (φ(u )w + b ) ] D + 2C tr(k )+3 n(2/δ) 2 1 1 T (η + + η )+ 2C tr(k )+3 ln(2/δ) 2 =: ˆD. The above result shows that the Rademacher complexity of F C,D with probability greater than 1 δ satisfies ˆR l (F C,D ) E σ sup 2 σ ] T φ (X )w + 1b ], m L w C, =1,2, ˆD( w 1, w 2) ˆD where σ { 1, +1} ml. Note that the expression in square bracets is concentrated under the uniform distribution of Rademacher variables. Hence, we can estimate the complexity for randomly chosen instantiation ˆσ of the the Rademacher variables σ. We now must find the value of { w } that maximizes 1 max m L σt φ (X ) w + ] σt 1b = 1 m L ( σt K L g ) + b s.t. g T K g + b 2 C2,=1, 2, 1 1 T ( ( 1) 1( K U g + b ) ˆD. The expected value of the objective function computed on randomly chosen ˆσ s is the estimate of the Rademacher complexity. References 1] M.F. Balcan and A. Blum. A pac-style model for learning from labeled and unlabeled data. In COLT, pages 111 126, 2005. 2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: ris bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. 3] K. Bennett and A. Demirez. Semi-supervised support vector machines. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems, volume 12, pages 368 374. MIT Press, Cambridge, MA, 1998. 4] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92 100. 1998. 5] S. Dasgupta, M.L. Littman, and D. McAllester. Pac generalization bounds for co-training. In Advances in Neural Information Processing Systems (NIPS). 2001. 6] G. Lugosi and M. Pinter. A data-dependent seleton estimate for learning. In In Proceedings of the 9th Annual Conference on Computational Learning Theory, New Yor,, pages 51 58. Association for Computing Machinery, 1996. 7] H. Meng, J. Shawe-Taylor, S. Szedma, and J.R.D. Farquhar. Support vector machine to synthesise ernels. In Sheffield Machine Learning Worshop Proceedings, Lecture Notes in Computer Science. Springer, 2005. 8] J. Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. 484