Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Similar documents
Semi-Supervised Learning

Semi-Supervised Learning with Graphs

Active and Semi-supervised Kernel Classification

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Graph-Based Semi-Supervised Learning

1 Graph Kernels by Spectral Transforms

Graphs, Geometry and Semi-supervised Learning

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

How to learn from very few examples?

Analysis of Spectral Kernel Design based Semi-supervised Learning

Cluster Kernels for Semi-Supervised Learning

What is semi-supervised learning?

Random Walks, Random Fields, and Graph Kernels

Global vs. Multiscale Approaches

Graphs in Machine Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Semi-Supervised Learning by Multi-Manifold Separation

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Large-Scale Graph-Based Semi-Supervised Learning via Tree Laplacian Solver

Semi-supervised Learning

Spectral Clustering. Guokun Lai 2016/10

Undirected Graphical Models

Pattern Recognition and Machine Learning

Graphs in Machine Learning

Laplacian Spectrum Learning

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions

CS 6375 Machine Learning

Semi Supervised Distance Metric Learning

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Manifold Coarse Graining for Online Semi-supervised Learning

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Bayesian Learning in Undirected Graphical Models

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Learning gradients: prescriptive models

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Semi-Supervised Classification with Universum

Partially labeled classification with Markov random walks

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Discrete vs. Continuous: Two Sides of Machine Learning

CS249: ADVANCED DATA MINING

Bayesian Co-Training

Gaussian Processes (10/16/13)

Seeing stars when there aren't many stars Graph-based semi-supervised learning for sentiment categorization

Recent Advances in Bayesian Inference Techniques

Topic Models and Applications to Short Documents

Machine Learning for Structured Prediction

Part 4: Conditional Random Fields

Bayesian Learning in Undirected Graphical Models

MTTTS16 Learning from Multiple Sources

Nonparametric Bayesian Methods (Gaussian Processes)

Time-Sensitive Dirichlet Process Mixture Models

Manifold Regularization

IFT LAPLACIAN APPLICATIONS. Mikhail Bessmeltsev

L11: Pattern recognition principles

Online Manifold Regularization: A New Learning Setting and Empirical Study

Justin Solomon MIT, Spring 2017

6.036 midterm review. Wednesday, March 18, 15

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

One-class Label Propagation Using Local Cone Based Similarity

Hidden Markov Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Intelligent Systems:

Conditional Random Field

Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Predicting Graph Labels using Perceptron. Shuang Song

Semi-Supervised Learning with Very Few Labeled Training Examples

Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian Support Vector Machines for Feature Ranking and Selection

Adaptive Multi-Modal Sensing of General Concealed Targets

Brief Introduction of Machine Learning Techniques for Content Analysis

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Information Extraction from Text

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Conditional Random Fields for Sequential Supervised Learning

Data and Structure in Machine Learning

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Support Vector Machines (SVMs).

Probabilistic Graphical Models

Introduction To Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Fast Inference and Learning with Sparse Belief Propagation

Introduction to Machine Learning Midterm, Tues April 8

Efficient Iterative Semi-Supervised Classification on Manifold

Semi-Supervised Learning of Speech Sounds

Probabilistic & Bayesian deep learning. Andreas Damianou

Hidden Markov Models

Transcription:

Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1

Semi-supervised Learning classification classifiers need labeled data to train labeled data scarce, unlabeled data abundant Traditional classifiers cannot use unlabeled data. My interest (semi-supervised learning): Develop classification methods that can use both labeled and unlabeled data. 2

Motivating examples speech recognition (sound sentence) labeled data: transcription, 10 to 400 times real-time unlabeled data: sounds alone, easy to get (radio, call center) parsing ( I saw a falcon with a telescope. tree) labeled data: treebank, English 40,000/5, Chinese 4,000/2 years unlabeled data: sentences without annotation, everywhere. personalized news (article interested?) user patience video surveillance (image identity) named images availability unlabeled data useful? 3

The message Unlabeled data can improve classification. 4

Why unlabeled data might help example: classify astronomy vs. travel articles articles represented by content word occurrence vectors article similarity measured by content word overlap d 1 d 3 d 4 d 2 asteroid bright comet year zodiac. airport bike camp yellowstone zion 5

Why labeled data alone might fail asteroid bright comet year zodiac. airport bike camp yellowstone zion d 1 d 3 d 4 d 2 no overlap! tends to happen when labeled data is scarce 6

Unlabeled data are stepping stones d 1 d 5 d 6 d 7 d 3 d 4 d 8 d 9 d 2 asteroid bright comet year zodiac. airport bike camp yellowstone zion observe direct similarity from features: d 1 d 5, d 5 d 6 etc. assume similar features same label labels propagate via unlabeled articles, indirect similarity 7

Unlabeled data are stepping stones arrange l labeled and u unlabeled(=test) points in a graph nodes: the n = l + u points edges: the direct similarity W ij, e.g. number of overlapping words. (in general: a decreasing function of the distance x i x j ) want to infer indirect similarity (with all paths) d1 d2 d3 d4 8

One way to use labeled and unlabeled data (Zhu and Ghahramani, 2002) input: n n graph weights W (important!) labels Y l {0, 1} l create matrix P ij = W ij / W i. repeat until f converges clamp labeled data f l = Y l propagate f P f f converges to a unique solution, the harmonic function. 0 f 1, soft labels 9

An electric network interpretation (Zhu, Ghahramani and Lafferty, ICML2003) harmonic function f is the voltage at the nodes edges are resistors with R = 1/W 1 volt battery connects to labeled nodes indirect similarity: similar voltage if many paths exist R = 1 ij w ij 1 +1 volt 0 10

A random walk interpretation of harmonic functions harmonic function f i = P (hit label 1 start from i) random walk from node i to j with probability P ij stop if we hit a labeled node indirect similarity: random walks have similar destinations 1 i 0 11

Closed form solution for the harmonic function define diagonal degree matrix D, D ii = W i define graph Laplacian matrix = D W f u = uu 1 ul Y l graph version of the continuous Laplacian operator 2 = 2 + x 2 2 + y 2 2 z 2 harmonic: f = 0 with Dirichlet boundary conditions on labeled data 12

Properties of the harmonic function currents in-flow = out-flow at any node (Kirchoff s law) min energy E(f) = i j W ij(f i f j ) 2 = f f average of neighbors: f u (i) = uniquely exists 0 f 1 j i W ijf(j) j i W ij 13

Text categorization with harmonic functions accuracy baseball PC religion hockey Mac atheism 50 labeled articles, about 2000 unlabeled articles. 10NN graph. 14

Digit recognition with harmonic functions pixel-wise Euclidean distance not similar indirectly similar with stepping stones 15

Digit recognition with harmonic functions accuracy 1 vs. 2 10 digits 50 labeled images, about 4000 unlabeled images, 10NN graph 16

Practical concerns about harmonic functions does it scale? closed form involves matrix inversion f u = uu 1 ul Y l O(u 3 ), e.g. millions of crawled web pages solution 1: use iterative methods the label propagation algorithm (slow) loopy belief propagation conjugate gradient solution 2: reduce problem size use a random small unlabeled subset (Delalleau et al. 2005) harmonic mixtures can it handle new points (induction)? 17

Harmonic mixtures (Zhu and Lafferty, 2005) fit unlabeled data with a mixture model, e.g. Gaussian mixtures for images multinomial mixtures for documents use EM or other methods M mixture components, here M = 30 u 1000 learn soft labels for the mixture components, not the unlabeled points 18

Harmonic mixtures learn labels for mixture components assume mixture component labels λ 1,, λ M labels on unlabeled points determined by the mixture model The mixture model defines responsibility R: R im = p(m x i ) f(i) = M m=1 R imλ m learn λ such that f is closest to harmonic minimize energy E(f) = f f convex optimization closed form solution λ = ( R uu R ) 1 R ul Y l 19

Harmonic mixtures 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 mixture component labels λ follow the graph 20

Harmonic mixtures computational savings computation on unlabeled data harmonic mixtures original harmonic function f u = R(R } {{ uu R } ) 1 R ul Y l M M f u = ( }{{} uu ) 1 ul Y l u u harmonic mixtures O(M 3 ), much cheaper than O(u 3 ) Harmonic mixtures can handle large problems. Also induction f(x) = M m=1 R xmλ m 21

From harmonic functions to kernels harmonic functions too specialized? I will show you the kernel behind harmonic function general, important concept in machine learning. used in many learning algorithms, e.g. support vector machines on the graph: symmetric, positive semi-definite n n matrix I will then give you an even better kernel. but first a short detour... 22

The probabilistic model behind harmonic function random field p(f) exp ( E(f)) energy E(f) = i j W ij(f i f j ) 2 = f f low energy = good label propagation E(f) = 4 E(f) = 2 E(f) = 1 if f {0, 1} discrete, standard Markov random fields (Boltzmann machines), inference hard 23

The probabilistic model behind harmonic function Gaussian random fields (Zhu, Ghahramani and Lafferty, ICML2003) continuous relaxation f R Gaussian random field Gaussian random field p(f) is a n-dimensional Gaussian with inverse covariance matrix. p(f) exp ( E(f)) = exp ( f f ) harmonic functions are the mean of Gaussian random fields Gaussian random fields = Gaussian processes on finite data covariance matrix = kernel matrix in Gaussian processes 24

The kernel behind harmonic functions K = 1 K ij = indirect similarity The direct similarity W ij may be small But K ij will be large if many paths between i, j K can be used with many kernel machines K + support vector machine = semi-supervised SVM kernel built on both labeled and unlabeled data additional benefit: handles noisy labeled data 25

Kernels should encourage smooth eigenvectors graph spectrum = n k=1 λ kφ k φ k small eigenvalue, smooth eigenvector i j W ij(φ k (i) φ k (j)) 2 = λ k kernels encourage smooth eigenvectors with large weights Laplacian harmonic kernel = k λ kφ k φ k K = 1 = k 1 λ k φ k φ k smooth functions good for semi-supervised learning f K = f K 1 f = f f = i j W ij (f i f j ) 2 26

General semi-supervised kernels 1 not the only semi-supervised kernel, may not be the best General principle for creating semi-supervised kernels K = i r(λ i)φ i φ i r(λ) should be large when λ is small, to encourage smooth eigenvectors. Specific choices of r() lead to known kernels harmonic function kernel r(λ) = 1/λ diffusion kernel r(λ) = exp( σ 2 λ) random walk kernel r(λ) = (α λ) p Is there a best r()? Yes, as measured by kernel alignment. 27

Alignment measures kernel quality measures kernel by its alignment to the labeled data Y l align(k, Y l ) = K ll, Y l Y l K ll Y l Y l extension of cosine angle between vectors high alignment related to good generalization performance leads to a convex optimization problem 28

Finding the best kernel (Zhu, Kandola, Ghahramani and Lafferty, NIPS2004) the order constrained semi-supervised kernel max r align(k, Y l ) subject to K = i r iφ i φ i r 1 r n 0 order constraints r 1 r n encourage smoothness convex optimization r nonparametric 29

The order constrained kernel improves alignment and accuracy text categorization (religion vs. atheism), 50 labeled and 2000 unlabeled articles. alignment kernel order harmonic RBF alignment 0.31 0.17 0.04 accuracy with support vector machines kernel order harmonic RBF accuracy 84.5 80.4 69.3 We now have good kernels for semi-supervised learning. 30

Other research (1) Graph hyperparameter learning what if we don t know W ij? set up hyperparameters W ij = exp ( d ) (x id x jd ) 2 α d learn α with e.g. Bayesian evidence maximization 250 250 144 142 140 200 200 138 136 134 150 150 132 130 128 100 100 126 average 7 average 9 learned α 31

Other research (2) Sequences and other structured data (Lafferty, Zhu and Liu, 2004) y x what if x 1 x n form sequences? speech, natural language processing, biosequence analysis, etc. conditional random fields (CRF) y x kernel CRF kernel CRF + semi-supervised kernels 32

Other research (3) Active learning (Zhu, Lafferty and Ghahramani, 2003b) what if the computer can ask for labels? smart queries: not necessarily the most ambiguous points a 0.5 1 0 B 0.4 active learning + semi-supervised learning, fast algorithm 33

Related work in semi-supervised learning based on different assumptions method assumptions references graph similar feature, same label this talk mincuts (Blum and Chawla, 2001) normalized Laplacian (Zhou et al., 2003) regularization (Belkin et al., 2004) mixture model, EM generative model (Nigam et al., 2000) transductive SVM low density separation (Joachims, 1999) co-training feature split (Blum and Mitchell, 1998) Semi-supervised learning has so far received relatively little attention in statistics literature. 34

Some key contributions harmonic function formulations for semi-supervised learning solving large scale problems with harmonic mixtures semi-supervised kernels by spectral transformation of the graph Laplacian kernelizing conditional random fields combining active learning and semi-supervised learning 35

Summary Unlabeled data can improve classification. The methods have reached the stage where we can apply them to real-world tasks. 36

Future Plans continue the research on semi-supervised learning structured data, ranking, clustering, explore different assumptions application to human language tasks speech recognition, document categorization, information retrieval, sentiment analysis explore novel machine learning approaches text mixed with other modalities, e.g. images; speech and multimodal user interfaces; graphics; vision collaboration 37

Thank You 38

References M. Belkin, I. Matveeva and P. Niyogi. Regularization and Semi-supervised Learning on Large Graphs. COLT 2004. A. Blum and S. Chawla. Learning from Labeled and Unlabeled Data using Graph Mincuts. ICML 2001. A. Blum, T. Mitchell. Combining Labeled and Unlabeled Data with Co-training. COLT 1998. O. Delalleau, Y. Bengio, N. Le Roux. Efficient Non-Parametric Function Induction in Semi-Supervised Learning. AISTAT 2005. R. Hwa. A Continuum of Bootstrapping Methods for Parsing Natural Languages. 2003. T. Joachims, Transductive inference for text classification using support vector machines. ICML 1999. K. Nigam, A. McCallum, S Thrun, T. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning. 2000. D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schlkopf. Learning with Local and Global Consistency. NIPS 2003. X. Zhu, J. Lafferty. Harmonic Mixtures. 2005. submitted. X. Zhu, Jaz Kandola, Z. Ghahramani, J. Lafferty. Nonparametric Transforms of 39

Graph Kernels for Semi-Supervised Learning. NIPS 2004. J. Lafferty, X. Zhu, Y. Liu. Kernel Conditional Random Fields: Representation and Clique Selection. ICML 2004. X. Zhu, Z. Ghahramani, J. Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. ICML 2003. X. Zhu, J. Lafferty, Z. Ghahramani. Combining Active Learning and Semi- Supervised Learning Using Gaussian Fields and Harmonic Functions. ICML 2003 workshop. X. Zhu, Z. Ghahramani. Learning from Labeled and Unlabeled Data with Label Propagation. CMU-CALD-02-106, 2002. 40

41

42

43

Graph spectrum = n i=1 λ iφ i φ i 44

Learning component label with EM 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Labels for the components do not follow the graph. (Nigam et al., 2000) 45

Other research (2) Sequences and other structured data KCRF: p f (y x) = Z 1 (x, f) exp ( c f(x, y c)) f induces regularized negative log loss on training data R(f) = l i=1 log p f (y (i) x (i) ) + Ω( f K ) representer theorem for KCRFs: loss minimizer f (x, y c ) = l i=1 c y c α(i, y c )K((x (i), y c ), (x, y c )) 46

Other research (2) Sequences and other structured data learn α to minimize R(f), convex, sparse training special case K((x (i), y c ), (x, y c )) = ψ(k (x (i) c, x c ), y c, y c ) K can be a semi-supervised kernel. 47

Other research (3) Active Learning generalization error err = i U approximation 1 y i =0,1 (sgn(f i ) y i ) Ptrue(y i ) Ptrue(y i = 1) f i estimated error êrr = i U min (f i, 1 f i ) 48

Other research (3) Active Learning estimated error after querying k with answer y k êrr +(x k,y k ) = ( ) min f +(x k,y k ) i, 1 f +(x k,y k ) i i U approximation 2 êrr +x k = (1 f k )êrr +(x k,0) + f k êrr +(x k,1) select query k to minimize the estimated error k = arg min k êrr +x k 49

Other research (3) Active Learning re-train is fast with semi-supervised learning f +(x k,y k ) u = f u + (y k f k ) ( uu) 1 k ( uu ) 1 kk 50