What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing, bioinformatics Semi-supervised Learning: learning from a combination of both labeled and unlabeled data
Comparing Supervised learning algorithms require enough labeled training data to learn reasonably accurate classifiers. Unsupervised learning methods are employed to discover structure in unlabeled data Semi-supervised learning allows taking advantage of the strengths of both
Why should it be useful? Unlabeled data can help in two different ways Identify data structure Find a meaningful representation of complicated high dimensional data through a first unsupervised learning step. Cluster assumption which can be stated in two equivalent ways: Two points which can be connected by a high density path (i.e. in the same cluster) are likely to be of the same label. Decision boundary should lie in a low density region.
A Toy Dataset (Two Moons)
Learning from Examples Input space X, and output space Y = {1, 1}. Training set S = {z 1 = (x 1, y 1 ),..., z l = (x l, y l )} in Z = X Y drawn i.i.d. from some unknown distribution. Classifier f : X Y. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 2/31
Transductive Setting Input space X = {x 1,..., x n }, and output space Y = {1, 1}. Training set S = {z 1 = (x 1, y 1 ),..., z l = (x l, y l )}. Classifier f : X Y. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 3/31
Intuition about classification: Manifold Local consistency. Nearby points are likely to have the same label. Global consistency. Points on the same structure (typically referred to as a cluster or manifold) are likely to have the same label. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 4/31
Algorithm 1. Form the affinity matrix W defined by W ij = exp( x i x j 2 /2σ 2 ) if i j and W ii = 0. 2. Construct the matrix S = D 1/2 W D 1/2 in which D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of W. 3. Iterate f(t + 1) = αsf(t) + (1 α)y until convergence, where α is a parameter in (0, 1). 4. Let f denote the limit of the sequence {f(t)}. Label each point x i as y i = sgn(f i ). Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 6/31
Convergence Theorem. The sequence {f(t)} converges to f = β(i αs) 1 y, where β = 1 α. Proof. Suppose F (0) = Y. By the iteration equation, we have t 1 f(t) = (αs) t 1 Y + (1 α) (αs) i Y. (1) i=0 Since 0 < α < 1 and the eigenvalues of S in [ 1, 1], lim t (αs)t 1 = 0, and lim t t 1 i=0 (αs) i = (I αs) 1. (2) Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 7/31
Regularization Framework Cost function Q(f) = 1 [ n 2 W ij ( 1 Dii f i 1 Djj f j ) 2 + µ n ( fi y i ) 2 ] i,j=1 i=1 Smoothness term. Measure the changes between nearby points. Fitting term. Measure the changes from the initial label assignments. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 8/31
Regularization Framework Theorem. f = arg min f F Q(f). Proof. Differentiating Q(f) with respect to f, we have Q f = f Sf + µ(f y) = 0, (1) f=f which can be transformed into f 1 1 + µ Sf µ y = 0. (2) 1 + µ Let α = 1/(1 + µ) and β = µ/(1 + µ). Then (I αs)f = βy. (3) Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 9/31
Two Variants Substitute P = D 1 W for S in the iteration equation. Then f = (I αp ) 1 y. Replace S with P T, the transpose of P. Then f = (I αp T ) 1 y, which is equivalent to f = (D αw ) 1 y. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 10/31
Toy Problem 1.5 (a) t = 10 1.5 (b) t = 50 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 (c) t = 100 1.5 (d) t = 400 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 11/31
Toy Problem Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 12/31
Handwritten Digit Recognition (USPS) 0.45 0.4 0.35 k NN (k = 1) SVM (RBF kernel) consistency variant (1) variant (2) 0.3 test error 0.25 0.2 0.15 0.1 0.05 10 20 30 40 50 60 70 80 90 100 # labeled points Dimension: 16x16. Size: 9298. (α = 0.95) Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 13/31
Handwritten Digit Recognition (USPS) 0.45 0.4 consistency variant (1) variant (2) 0.35 0.3 test error 0.25 0.2 0.15 0.1 0.05 0.7 0.75 0.8 0.85 0.9 0.95 0.99 values of parameter α Size of labeled data: l = 50. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 14/31
Text Classification (20-newsgroups) 0.8 0.7 k NN (k = 1) SVM (RBF kernel) consistency variant (1) variant (2) 0.6 test error 0.5 0.4 0.3 0.2 0.1 10 20 30 40 50 60 70 80 90 100 # labeled points Dimension: 8014. Size: 3970. (α = 0.95) Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 15/31
Text Classification (20-newsgroups) 0.5 0.45 consistency variant (1) variant (2) 0.4 test error 0.35 0.3 0.25 0.2 0.7 0.75 0.8 0.85 0.9 0.95 0.99 values of parameter α Size of labeled data: l = 50. Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 16/31