Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob Nowak Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 1 / 36
Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 2 / 36
Supervised Learning Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 3 / 36
Semi-Supervised Learning Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 4 / 36
Prediction Problems The feature space X = R d The label space Y = {0, 1} or R Samples (X, Y ) X Y P XY X: feature vector Y : label Goal: construct a predictor f : X Y to minimize R(f) E (X,Y ) PXY [loss(y, f(x))] Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 5 / 36
Learning from Data The optimal predictor f = argmin f E (X,Y ) PXY [loss(y, f(x))] depends on P XY, which is often unknown. However, we can learn a good predictor from a training set Supervised Learning: {(X i, Y i )} n i=1 iid P XY {(X i, Y i )} n i=1 ˆf n Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 6 / 36
Semi-Supervised Learning In many applications in science and engineering, labeled data are scarce, but unlabeled data are abundant and cheap. {(X i, Y i )} n i=1 Semi-Supervised Learning (SSL): iid P XY, {X j } m iid j=1 P X, m n {(X i, Y i )} n i=1, {X j } m j=1 ˆf m,n Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 7 / 36
Example: Handwritten Digits Recognition Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 8 / 36
Example: Handwritten Digits Recognition many unlabeled data + a few labeled data knowledge of manifold/cluster + a few labels in each manifold/cluster is sufficient to design a good predictor Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 9 / 36
Common Assumptions in SSL Cluster assumption: f is constant or smooth on connected high density regions. Manifold assumption: Support set of P X lies on low-dimensional manifolds. f is smooth wrt geodesic distance on manifolds. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 10 / 36
Mathematical Formalization Generic Learning Classes: P XY = { P X P Y X : P X P X, P Y X P Y X } Linked Learning Classes: P XY = { P X P Y X : P X P X, P Y X P Y X (P X ) P Y X } Link: unlabeled data may inform design of predictor SSL can yield faster rate of error convergence than supervised learning: sup E[R( ˆf m,n )] inf sup E[R(f n )] P XY f n P XY ˆfn : predictor based on n labeled examples ˆfm,n : based on n labeled and m unlabeled examples Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 11 / 36
The Value of Unlabeled Data Castelli and Cover 95 (classification): assume identifiable mixture p(x) = p(x Y = 0)p(Y = 0) + p(x Y = 1)p(Y = 1) Learn decision regions from (the many) unlabeled examples Label decision regions from (the few) labeled examples Main result: sup E[R( ˆf,n )] R Ce αn P XY What about more general cluster or manifold assumptions? Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 12 / 36
Do Unlabeled Data Help in General? No. Lafferty & Wasserman (2007) fix complexity of P XY, let n grow given enough labeled data, unlabeled data is superfluous (no faster rates of convergence for SSL). Yes. Niyogi (2008) let complexity of P XY grow with n given finite n, the complexity of the learning problem can be such that supervised learning fails, while SSL has small expected error. Both are correct, capturing two extremes. Finite sample bounds give a more complete picture. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 13 / 36
Decision Regions Our assumption: P XY (p X, f) supp(p X ) = i C i, union of compact sets with γ separation marginal density p X bounded away from zero, smooth in C i regression function f(x) Hölder-α smooth on each support set Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 14 / 36
Adaptive Supervised Learning If γ γ 0 > 0 and n, SL will discover decision regions, eventually the excess risk of SL (squared loss) is minimax (SSL has no advantage): sup E[R( ˆf n )] R Cn 2α 2α+d P XY But, if n fixed and γ 0, eventually SL will mix up decision regions and mess up: cn 1 d inf f n sup E[R(f n )] R P XY Unlabeled data identify decision regions, mess up later (smaller γ) Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 15 / 36
Unlabeled Data: Now it helps, now it doesn t (Singh, Nowak & Zhu, NIPS 2008) margin SSL SL SSL upper lower helps? n 1 d γ n 2α 2α+d n 2α 2α+d no m 1 d γ < n 1 d n 2α 2α+d n 1 d yes m 1 d γ < m 1 d n 1 d n 1 d no γ < m 1 d n 2α 2α+d n 1 d yes Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 16 / 36
An SSL Algorithm Given n labeled examples and m unlabeled examples, 1 Use unlabeled data to infer log(n) decision regions Ĉi plug in your favorite manifold clustering algorithm that detects abrupt change in support, density, dimensionality, etc. carve up ambient space into Ĉi: Voronoi each Ĉi has to be big enough, n/ log 2 (n) labeled examples, m/ log 2 (n) unlabeled examples 2 Use the labeled data in Ĉi to train SL ˆf i 3 If a test point x Ĉi, predict ˆf i (x ) Similar to cluster & label. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 17 / 36
Building Blocks: Local Covariance Matrix For x {x i } n+m i=1, find its [log(n + m)] nearest neighbors (in Euclidean distance) The local covariance matrix of the neighbors Σ x = 1 [log(n + m)] 1 Σ x captures local geometry (x j µ x )(x j µ x ) j Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 18 / 36
Building Blocks: Local Covariance Matrix Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 19 / 36
A Distance Between Σ 1 and Σ 2 Hellinger distance H 2 (p, q) = 1 ( ) 2 p(x) q(x) dx 2 H(p, q) symmetric, in [0, 1] Let p = N(0, Σ 1 ), q = N(0, Σ 2 ). We define d H(Σ 1, Σ 2 ) = Σ 1 1 4 Σ 2 1 4 1 2 2 Σ 1 + Σ 2 1 2 (computed in common subspace) Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 20 / 36
Hellinger Distance Comment H(Σ 1, Σ 2 ) similar 0.02 density 0.28 dimension 1 orientation 1 * smoothed version: Σ + ɛi Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 21 / 36
A Sparse Subset Two close points will have similar neighbors small H even they are on different manifolds Compute a sparse subset of m = points (red dots): m log(m) Start from an arbitrary x 0 Remove its log(m) nearest neighbors Let x 1 be the next nearest neighbor, repeat Include all labeled data Random sampling might work too Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 22 / 36
A Sparse Graph on the Sparse Subset Sparse nearest neighbor graph on the sparse subset, use Mahalanobis distance to trace the manifold d 2 (x, y) = (x y) Σ 1 x (x y) Gaussian edge weight on sparse edges Combines locality and shape w ij = e H 2 (Σxi,Σx j ) 2σ 2 Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 23 / 36
A Sparse Graph on the Sparse Subset Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 24 / 36
A Sparse Graph on the Sparse Subset Red=large w, yellow=small w Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 25 / 36
Multi-Manifold Separation as Graph Cut Cut the graph W = [w ij ] into k [log(n)] parts Each part has at least n/ log 2 (n) labeled examples, m / log 2 (n) unlabeled examples Formally: RatioCut with size constraints Let the k parts be A 1,..., A k cut(a i, Ā i ) = s A w i,t Āi st cut(a 1,..., A k ) = k i=1 cut(a i, Āi) Minimize cut directly tend to produce very unbalanced parts RatioCut(A 1,..., A k ) = k cut(a i,āi) i=1 A i However, this balancing heuristic may not satisfy our size constraints Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 26 / 36
RatioCut Approximated by Spectral Clustering Well-known RatioCut approximation (without size constraints) [e.g., von Luxburg 2006] Define k indicator vectors h 1,..., h k { 1/ Aj if i A h ij = j 0 otherwise Matrix H has columns h 1,..., h k, H H = I RatioCut(A 1,..., A k ) = 1 2 tr(h LH) min H tr(h LH) subject to H H = I Relax elements of H to R [Rayleigh-Ritz] h 1,..., h k are the first k eigenvectors of L. Un-relax H to hard partition: k-way clustering Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 27 / 36
RatioCut Approximated by Spectral Clustering The spectral clustering algorithm (without size constraints): 1 Unnormalized Laplacian L = D W 2 First k eigenvectors v 1,..., v k of L 3 V matrix: v 1,..., v k as columns, n + m rows 4 New representation of x i : the ith row of V 5 Cluster x i into k clusters with k-means. The clusters define A 1,..., A k. Next: enforce size constraints in k-means. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 28 / 36
Standard (Unconstrained) k-means k-means clusters x 1... x N into k clusters with center C 1... C k : min C 1...C k N i=1 min h=1...k ( ) 1 2 x i C h 2 Introduce indicator matrix T, T ih = 1 if x i belongs to C h min C,T s.t. N k i=1 h=1 T ( 1 ih 2 x i C h 2) k h=1 T ih = 1, T 0 Local optimum found by starting from arbitrary C 1... C k and iterating: 1 Update T i : assign each point x i to its closest center C h 2 Update centers C 1... C k by the mean of points assigned to that center Note: each step reduces the objective. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 29 / 36
Size Constrained k-means (Bradley, Bennett, Demiriz. 2000) Size constraints: cluster h must have at least τ h points min C,T s.t. N k i=1 h=1 T ( 1 ih 2 x i C h 2) k h=1 T ih = 1, T 0 N i=1 T ih τ h, h = 1... k. Solving T looks like a difficult integer problem Surprise: efficient integer solution found by Minimum Cost Flow linear program Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 30 / 36
Minimum Cost Flow A graph with supply nodes (b i > 0) and demand nodes (b i < 0); directed edge i j has unit transportation cost c ij, traffic variable y ij 0; Meet demand with minimum transportation cost min y s.t. i j c ijy ij j y ij j y ji = b i Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 31 / 36
Minimum Cost Flow with Two Constraints Each cluster has at least n/ log 2 (n) labeled examples, m / log 2 (n) unlabeled examples Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 32 / 36
Example Cuts Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 33 / 36
SSL Example: Two Squares Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 34 / 36
SSL Example: Two Squares Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 35 / 36
SSL Example: Dollar Sign Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 36 / 36