Semi-Supervised Learning by Multi-Manifold Separation

Similar documents
Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled data: Now it helps, now it doesn t

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011

Spectral Clustering on Handwritten Digits Database

Spectral Clustering. Zitao Liu

Manifold Regularization

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

A graph based approach to semi-supervised learning

Spectral Clustering. Guokun Lai 2016/10

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

Seeing stars when there aren't many stars Graph-based semi-supervised learning for sentiment categorization

MATH 567: Mathematical Techniques in Data Science Clustering II

Fast learning rates for plug-in classifiers under the margin condition

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Online Manifold Regularization: A New Learning Setting and Empirical Study

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Graphs in Machine Learning

Plug-in Approach to Active Learning

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Manifold Coarse Graining for Online Semi-supervised Learning

Graphs, Geometry and Semi-supervised Learning

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Graphs in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Machine Learning for Data Science (CS4786) Lecture 11

Semi-Supervised Learning with Graphs

Semi-Supervised Learning

How to learn from very few examples?

MATH 567: Mathematical Techniques in Data Science Clustering II

What is semi-supervised learning?

Semi-Supervised Learning of Speech Sounds

Intelligent Systems:

Distribution-Free Distribution Regression

Active and Semi-supervised Kernel Classification

Semi-supervised Learning

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi Supervised Distance Metric Learning

Statistical Pattern Recognition

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Recap from previous lecture

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning for Signal Processing Bayes Classification and Regression

Graph-Based Semi-Supervised Learning

L11: Pattern recognition principles

Spectral Techniques for Clustering

Introduction to Machine Learning (67577) Lecture 3

Global vs. Multiscale Approaches

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Large Scale Semi-supervised Linear SVMs. University of Chicago

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Graphs in Machine Learning

Logic and machine learning review. CS 540 Yingyu Liang

Nonlinear Dimensionality Reduction

Lecture 3: Pattern Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering and Gaussian Mixture Models

c 4, < y 2, 1 0, otherwise,

Regularizing objective functionals in semi-supervised learning

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Iterative Laplacian Score for Feature Selection

Statistical Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

L26: Advanced dimensionality reduction

Distribution-specific analysis of nearest neighbor search and classification

Final Exam, Machine Learning, Spring 2009

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Decision trees COMS 4771

Persistent homology and nonparametric regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Metric-based classifiers. Nuno Vasconcelos UCSD

1 Matrix notation and preliminaries from spectral graph theory

Statistical Machine Learning Hilary Term 2018

Convex Optimization in Classification Problems

Introduction to Machine Learning

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Lecture Notes 1: Vector spaces

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Machine Learning (CS 567) Lecture 5

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

How can a Machine Learn: Passive, Active, and Teaching

CSE446: Clustering and EM Spring 2017

Nonparametric Bayesian Methods

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Diffuse interface methods on graphs: Data clustering and Gamma-limits

Scalable robust hypothesis tests using graphical models

Clustering using Mixture Models

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Transcription:

Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob Nowak Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 1 / 36

Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 2 / 36

Supervised Learning Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 3 / 36

Semi-Supervised Learning Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 4 / 36

Prediction Problems The feature space X = R d The label space Y = {0, 1} or R Samples (X, Y ) X Y P XY X: feature vector Y : label Goal: construct a predictor f : X Y to minimize R(f) E (X,Y ) PXY [loss(y, f(x))] Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 5 / 36

Learning from Data The optimal predictor f = argmin f E (X,Y ) PXY [loss(y, f(x))] depends on P XY, which is often unknown. However, we can learn a good predictor from a training set Supervised Learning: {(X i, Y i )} n i=1 iid P XY {(X i, Y i )} n i=1 ˆf n Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 6 / 36

Semi-Supervised Learning In many applications in science and engineering, labeled data are scarce, but unlabeled data are abundant and cheap. {(X i, Y i )} n i=1 Semi-Supervised Learning (SSL): iid P XY, {X j } m iid j=1 P X, m n {(X i, Y i )} n i=1, {X j } m j=1 ˆf m,n Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 7 / 36

Example: Handwritten Digits Recognition Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 8 / 36

Example: Handwritten Digits Recognition many unlabeled data + a few labeled data knowledge of manifold/cluster + a few labels in each manifold/cluster is sufficient to design a good predictor Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 9 / 36

Common Assumptions in SSL Cluster assumption: f is constant or smooth on connected high density regions. Manifold assumption: Support set of P X lies on low-dimensional manifolds. f is smooth wrt geodesic distance on manifolds. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 10 / 36

Mathematical Formalization Generic Learning Classes: P XY = { P X P Y X : P X P X, P Y X P Y X } Linked Learning Classes: P XY = { P X P Y X : P X P X, P Y X P Y X (P X ) P Y X } Link: unlabeled data may inform design of predictor SSL can yield faster rate of error convergence than supervised learning: sup E[R( ˆf m,n )] inf sup E[R(f n )] P XY f n P XY ˆfn : predictor based on n labeled examples ˆfm,n : based on n labeled and m unlabeled examples Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 11 / 36

The Value of Unlabeled Data Castelli and Cover 95 (classification): assume identifiable mixture p(x) = p(x Y = 0)p(Y = 0) + p(x Y = 1)p(Y = 1) Learn decision regions from (the many) unlabeled examples Label decision regions from (the few) labeled examples Main result: sup E[R( ˆf,n )] R Ce αn P XY What about more general cluster or manifold assumptions? Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 12 / 36

Do Unlabeled Data Help in General? No. Lafferty & Wasserman (2007) fix complexity of P XY, let n grow given enough labeled data, unlabeled data is superfluous (no faster rates of convergence for SSL). Yes. Niyogi (2008) let complexity of P XY grow with n given finite n, the complexity of the learning problem can be such that supervised learning fails, while SSL has small expected error. Both are correct, capturing two extremes. Finite sample bounds give a more complete picture. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 13 / 36

Decision Regions Our assumption: P XY (p X, f) supp(p X ) = i C i, union of compact sets with γ separation marginal density p X bounded away from zero, smooth in C i regression function f(x) Hölder-α smooth on each support set Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 14 / 36

Adaptive Supervised Learning If γ γ 0 > 0 and n, SL will discover decision regions, eventually the excess risk of SL (squared loss) is minimax (SSL has no advantage): sup E[R( ˆf n )] R Cn 2α 2α+d P XY But, if n fixed and γ 0, eventually SL will mix up decision regions and mess up: cn 1 d inf f n sup E[R(f n )] R P XY Unlabeled data identify decision regions, mess up later (smaller γ) Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 15 / 36

Unlabeled Data: Now it helps, now it doesn t (Singh, Nowak & Zhu, NIPS 2008) margin SSL SL SSL upper lower helps? n 1 d γ n 2α 2α+d n 2α 2α+d no m 1 d γ < n 1 d n 2α 2α+d n 1 d yes m 1 d γ < m 1 d n 1 d n 1 d no γ < m 1 d n 2α 2α+d n 1 d yes Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 16 / 36

An SSL Algorithm Given n labeled examples and m unlabeled examples, 1 Use unlabeled data to infer log(n) decision regions Ĉi plug in your favorite manifold clustering algorithm that detects abrupt change in support, density, dimensionality, etc. carve up ambient space into Ĉi: Voronoi each Ĉi has to be big enough, n/ log 2 (n) labeled examples, m/ log 2 (n) unlabeled examples 2 Use the labeled data in Ĉi to train SL ˆf i 3 If a test point x Ĉi, predict ˆf i (x ) Similar to cluster & label. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 17 / 36

Building Blocks: Local Covariance Matrix For x {x i } n+m i=1, find its [log(n + m)] nearest neighbors (in Euclidean distance) The local covariance matrix of the neighbors Σ x = 1 [log(n + m)] 1 Σ x captures local geometry (x j µ x )(x j µ x ) j Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 18 / 36

Building Blocks: Local Covariance Matrix Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 19 / 36

A Distance Between Σ 1 and Σ 2 Hellinger distance H 2 (p, q) = 1 ( ) 2 p(x) q(x) dx 2 H(p, q) symmetric, in [0, 1] Let p = N(0, Σ 1 ), q = N(0, Σ 2 ). We define d H(Σ 1, Σ 2 ) = Σ 1 1 4 Σ 2 1 4 1 2 2 Σ 1 + Σ 2 1 2 (computed in common subspace) Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 20 / 36

Hellinger Distance Comment H(Σ 1, Σ 2 ) similar 0.02 density 0.28 dimension 1 orientation 1 * smoothed version: Σ + ɛi Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 21 / 36

A Sparse Subset Two close points will have similar neighbors small H even they are on different manifolds Compute a sparse subset of m = points (red dots): m log(m) Start from an arbitrary x 0 Remove its log(m) nearest neighbors Let x 1 be the next nearest neighbor, repeat Include all labeled data Random sampling might work too Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 22 / 36

A Sparse Graph on the Sparse Subset Sparse nearest neighbor graph on the sparse subset, use Mahalanobis distance to trace the manifold d 2 (x, y) = (x y) Σ 1 x (x y) Gaussian edge weight on sparse edges Combines locality and shape w ij = e H 2 (Σxi,Σx j ) 2σ 2 Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 23 / 36

A Sparse Graph on the Sparse Subset Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 24 / 36

A Sparse Graph on the Sparse Subset Red=large w, yellow=small w Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 25 / 36

Multi-Manifold Separation as Graph Cut Cut the graph W = [w ij ] into k [log(n)] parts Each part has at least n/ log 2 (n) labeled examples, m / log 2 (n) unlabeled examples Formally: RatioCut with size constraints Let the k parts be A 1,..., A k cut(a i, Ā i ) = s A w i,t Āi st cut(a 1,..., A k ) = k i=1 cut(a i, Āi) Minimize cut directly tend to produce very unbalanced parts RatioCut(A 1,..., A k ) = k cut(a i,āi) i=1 A i However, this balancing heuristic may not satisfy our size constraints Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 26 / 36

RatioCut Approximated by Spectral Clustering Well-known RatioCut approximation (without size constraints) [e.g., von Luxburg 2006] Define k indicator vectors h 1,..., h k { 1/ Aj if i A h ij = j 0 otherwise Matrix H has columns h 1,..., h k, H H = I RatioCut(A 1,..., A k ) = 1 2 tr(h LH) min H tr(h LH) subject to H H = I Relax elements of H to R [Rayleigh-Ritz] h 1,..., h k are the first k eigenvectors of L. Un-relax H to hard partition: k-way clustering Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 27 / 36

RatioCut Approximated by Spectral Clustering The spectral clustering algorithm (without size constraints): 1 Unnormalized Laplacian L = D W 2 First k eigenvectors v 1,..., v k of L 3 V matrix: v 1,..., v k as columns, n + m rows 4 New representation of x i : the ith row of V 5 Cluster x i into k clusters with k-means. The clusters define A 1,..., A k. Next: enforce size constraints in k-means. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 28 / 36

Standard (Unconstrained) k-means k-means clusters x 1... x N into k clusters with center C 1... C k : min C 1...C k N i=1 min h=1...k ( ) 1 2 x i C h 2 Introduce indicator matrix T, T ih = 1 if x i belongs to C h min C,T s.t. N k i=1 h=1 T ( 1 ih 2 x i C h 2) k h=1 T ih = 1, T 0 Local optimum found by starting from arbitrary C 1... C k and iterating: 1 Update T i : assign each point x i to its closest center C h 2 Update centers C 1... C k by the mean of points assigned to that center Note: each step reduces the objective. Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 29 / 36

Size Constrained k-means (Bradley, Bennett, Demiriz. 2000) Size constraints: cluster h must have at least τ h points min C,T s.t. N k i=1 h=1 T ( 1 ih 2 x i C h 2) k h=1 T ih = 1, T 0 N i=1 T ih τ h, h = 1... k. Solving T looks like a difficult integer problem Surprise: efficient integer solution found by Minimum Cost Flow linear program Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 30 / 36

Minimum Cost Flow A graph with supply nodes (b i > 0) and demand nodes (b i < 0); directed edge i j has unit transportation cost c ij, traffic variable y ij 0; Meet demand with minimum transportation cost min y s.t. i j c ijy ij j y ij j y ji = b i Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 31 / 36

Minimum Cost Flow with Two Constraints Each cluster has at least n/ log 2 (n) labeled examples, m / log 2 (n) unlabeled examples Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 32 / 36

Example Cuts Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 33 / 36

SSL Example: Two Squares Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 34 / 36

SSL Example: Two Squares Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 35 / 36

SSL Example: Dollar Sign Xiaojin Zhu (Wisconsin) Semi-Supervised Learning 36 / 36