Semi-Supervised Learning with Graphs

Size: px

Start display at page:

Download "Semi-Supervised Learning with Graphs"

Eileen Rodgers
6 years ago
Views:

1 Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu LTI SCS CMU Thesis Committee John Lafferty (co-chair) Ronald Rosenfeld (co-chair) Zoubin Ghahramani Tommi Jaakkola 1

2 Semi-supervised Learning classifiers need labeled data to train labeled data scarce, unlabeled data abundant Traditional classifiers cannot train on unlabeled data. My thesis (semi-supervised learning): Develop classification methods that can use both labeled and unlabeled data. 2

3 Motivating example 1: speech recognition sounds words label data: transcription 10 to 400 real-time unlabeled data: radio, call center. Useful? 3

4 Motivating example 2: parsing sentences parse trees S NP VP PN VB NP PP DET N P NP DET N I saw a falcon with a telescope labeled data: treebank (Hwa, 2003): English 40,000 sentences, 5 years; Chinese 4,000 sentences, 2 years unlabeled data: sentences without annotation. Useful? 4

5 Motivating example 3: video surveillance images identities labeled data: only a few images unlabeled data: abundant. Useful? 5

6 The message Unlabeled data can improve classification. 6

7 Why unlabeled data might help example: classify astronomy vs. travel articles d 1 d 3 d 4 d 2 asteroid bright comet year zodiac. airport bike camp yellowstone zion 7

8 Why labeled data alone might fail asteroid bright comet year zodiac. airport bike camp yellowstone zion d 1 d 3 d 4 d 2 no overlap! tends to happen when labeled data is scarce 8

9 Unlabeled data are stepping stones d 1 d 5 d 6 d 7 d 3 d 4 d 8 d 9 d 2 asteroid bright comet year zodiac. airport bike camp yellowstone zion observe direct similarity d 1 d 5, d 5 d 6,... assume similar features same label infer indirect similarity 9

10 Unlabeled data are stepping stones arrange l labeled and u unlabeled(=test) points in a graph nodes: the n = l + u points weighted edges W : direct similarity, e.g. W ij = exp infer indirect similarity (using all paths) ( ) x i x j 2 2σ 2 d1 d2 d3 d4 10

11 One way to use labeled and unlabeled data (Zhu and Ghahramani, 2002) input: n n graph weights W (important!) labels Y l {0, 1} l create transition matrix P ij = W ij / j W ij start from any n-vector f, repeat: clamp labeled data f l = Y l propagate f P f f converges to a unique solution, the harmonic function. 0 f 1, soft labels 11

12 An electric network interpretation (Zhu, Ghahramani and Lafferty, ICML2003) harmonic function f is the voltage at the nodes edges are resistors with R = 1/W 1 volt battery connects to labeled nodes R = 1 ij w ij 1 +1 volt 0 12

13 A random walk interpretation harmonic function f i = P (hit label 1 start from i) demo 13

14 Closed form solution for the harmonic function diagonal degree matrix D: D ii = j W ij graph Laplacian matrix = D W f u = uu 1 ul Y l : graph version of the continuous Laplacian operator 2 = 2 + x y 2 2 z 2 harmonic: f = 0 on unlabeled data, with Dirichlet boundary conditions on labeled data 14

15 Properties of the harmonic function currents in-flow = out-flow at any node (Kirchoff s law) min energy E(f) = i j W ij(f i f j ) 2 = f f average of neighbors: f u (i) = uniquely exists 0 f 1 P j i W ijf(j) P j i W ij 15

16 Harmonic functions for text categorization accuracy baseball PC religion hockey Mac atheism 50 labeled articles, about 2000 unlabeled articles. 10NN graph. 16

17 Harmonic functions for digit recognition accuracy 1 vs digits 50 labeled images, about 4000 unlabeled images, 10NN graph 17

18 Why it works on text? Good graphs cosine on tf.idf quotes 18

19 Why it works on digits? Good graphs pixel-wise Euclidean distance not similar indirectly similar with stepping stones 19

20 Where do good graphs come from? domain knowledge learn hyperparameters W ij = exp e.g. Bayesian evidence maximization ( d ) (x id x jd ) 2 α d with average 7 average 9 learned α good graphs + harmonic functions = semi-supervised learning 20

21 Too much unlabeled data? does it scale? f u = uu 1 ul Y l O(u 3 ), e.g. millions of crawled web pages solution 1: use iterative methods the label propagation algorithm (slow) loopy belief propagation conjugate gradient solution 2: reduce problem size use a random small unlabeled subset (Delalleau et al. 2005) harmonic mixtures can it handle new points (induction)? 21

22 Harmonic mixtures (Zhu and Lafferty, 2005) fit unlabeled data with a mixture model, e.g. Gaussian mixtures for images multinomial mixtures for documents use EM or other methods M mixture components, here M = 30 u 1000 learn soft labels for the mixture components, not the unlabeled points 22

23 Harmonic mixtures learn labels for mixture components assume mixture component labels λ 1,, λ M λ determines f mixture model responsibility R: R im = p(m x i ) f(i) = M m=1 R imλ m learn λ such that f is closest to harmonic minimize energy E(f) = f f convex optimization closed form solution λ = ( R uu R ) 1 R ul Y l 23

24 Harmonic mixtures λ follows the graph, so does f 24

25 Harmonic mixtures computational savings harmonic mixtures f u = R(R} {{ uu R} ) 1 R ul Y l M M original harmonic function f u = ( }{{} uu ) 1 ul Y l u u harmonic mixtures O(M 3 ) O(u 3 ) Harmonic mixtures can handle large problems. Also induction f(x) = M m=1 R xmλ m 25

26 Too little labeled data? (Zhu, Lafferty and Ghahramani, 2003b) choose unlabeled points to label: active learning most ambiguous vs. most influential a B

27 Active Learning generalization error err = i U y i =0,1 (sgn(f i ) y i ) Ptrue(y i ) approximation: Ptrue(y i = 1) f i estimated error: êrr = i U min (f i, 1 f i ) estimated error after querying k with answer y k : êrr +(x k,y k ) = i U min ( ) f +(x k,y k ) i, 1 f +(x k,y k ) i 27

28 Active Learning approximation: êrr +x k (1 f k )êrr +(x k,0) + f k êrr +(x k,1) select the query k = arg min k êrr +x k re-train is fast with harmonic function f +(x k,y k ) u = f u + (y k f k ) ( uu) 1 k ( uu ) 1 kk Active learning can be effectively integrated into semi-supervised learning. 28

29 Beyond harmonic functions harmonic functions too specialized? I will show you things behind harmonic functions: probabilistic model kernel I will give you an even better kernel. 29

30 The probabilistic model behind harmonic function energy E(f) = i j W ij(f i f j ) 2 = f f low energy good label propagation E(f) = 4 E(f) = 2 E(f) = 1 Markov random field p(f) exp ( E(f)) if f {0, 1} discrete, Boltzmann machines, inference hard 30

31 The probabilistic model behind harmonic function Gaussian random fields (Zhu, Ghahramani and Lafferty, ICML2003) continuous relaxation f R Gaussian random field p(f) exp ( f f ) n-dimensional Gaussian harmonic function = mean, easy to compute inverse covariance matrix equivalent to Gaussian processes on finite data with kernel (covariance) matrix 1 31

32 The kernel matrix behind harmonic functions K = 1 K ij = indirect similarity The direct similarity W ij may be small But K ij will be large if many paths between i, j K can be used with many kernel machines K built with both labeled and unlabeled data K + SVM = semi-supervised SVM different from transductive SVM (Joachims 1999) additional benefit: handles noisy labeled data 32

33 K encourages smooth functions graph spectrum = n k=1 λ kφ k φ k small eigenvalue, smooth eigenvector i j W ij(φ k (i) φ k (j)) 2 = λ k K encourage smooth eigenvectors by large weights Laplacian harmonic kernel = k λ kφ k φ k K = 1 = k 1 λ k φ k φ k smooth functions have small RKHS norm f K = f K 1 f = f f = i j W ij (f i f j ) 2 33

34 General semi-supervised kernels K = 1 may not be the best semi-supervised kernel general semi-supervised kernels K = i r(λ i)φ i φ i r(λ) large when λ small, to encourage smooth eigenvectors. Specific choices of r() lead to some known kernels harmonic function kernel r(λ) = 1/λ diffusion kernel r(λ) = exp( σ 2 λ) random walk kernel r(λ) = (α λ) p Is there a best r()? Yes, as measured by kernel alignment. 34

35 Alignment measures kernel quality alignment between kernel and the labeled data Y l align(k, Y l ) = K ll, Y l Y l K ll Y l Y l extension of cosine angle between vectors high alignment related to good generalization performance leads to a convex optimization problem 35

36 Finding the best kernel (Zhu, Kandola, Ghahramani and Lafferty, NIPS2004) the order constrained semi-supervised kernel max r align(k, Y l ) subject to K = i r iφ i φ i r 1 r n 0 order constraints r 1 r n encourage smoothness convex optimization r nonparametric 36

37 The order constrained kernel improves alignment and accuracy text categorization (religion vs. atheism), 50 labeled and 2000 unlabeled articles. alignment kernel order harmonic RBF alignment accuracy with support vector machines kernel order harmonic RBF accuracy We now have good kernels for semi-supervised learning. 37

38 Sequences and other structured data (Lafferty, Zhu and Liu, 2004) y x what if x 1 x n form sequences? speech, natural language processing, biosequence analysis, etc. conditional random fields (CRF) y x kernel CRF kernel CRF + semi-supervised kernels 38

39 Related work in semi-supervised learning based on different assumptions method assumptions references graph-based similar feature, same label this talk mincuts (Blum & Chawla, 2001) normalized Laplacian (Zhou et al., 2003) regularization (Belkin et al., 2004) mixture model, EM generative model (Nigam et al., 2000) transductive SVM low density separation (Joachims, 1999) info. regularization y varies less when p(x) high (Szummer & Jaakkola 02) co-training feature split (Blum & Mitchell, 1998) 39

40 Key contributions of the thesis use unlabeled data? harmonic functions good graph? graph hyperparameter learning large problems? harmonic mixtures pick items to label? active semi-supervised learning probabilistic framework? Gaussian random fields kernel machines? semi-supervised kernels structured data? kernel conditional random fields 40

41 Summary Unlabeled data can improve classification. The methods have reached the stage where we can apply them to real-world tasks. 41

42 Thank You 42

43 Open questions more efficient graph learning more research on structured data regression, ranking, clustering new semi-supervised assumptions applications: NLP, bioinformatics, vision, etc. learning theory: consistency etc. 43

44 References M. Belkin, I. Matveeva and P. Niyogi. Regularization and Semi-supervised Learning on Large Graphs. COLT A. Blum and S. Chawla. Learning from Labeled and Unlabeled Data using Graph Mincuts. ICML A. Blum, T. Mitchell. Combining Labeled and Unlabeled Data with Co-training. COLT O. Delalleau, Y. Bengio, N. Le Roux. Efficient Non-Parametric Function Induction in Semi-Supervised Learning. AISTAT R. Hwa. A Continuum of Bootstrapping Methods for Parsing Natural Languages T. Joachims, Transductive inference for text classification using support vector machines. ICML K. Nigam, A. McCallum, S Thrun, T. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning M. Szummer, T. Jaakkola. Information Regularization with Partially Labeled Data. NIPS D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schlkopf. Learning with Local and Global Consistency. NIPS

45 X. Zhu, J. Lafferty. Harmonic Mixtures submitted. X. Zhu, Jaz Kandola, Z. Ghahramani, J. Lafferty. Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning. NIPS J. Lafferty, X. Zhu, Y. Liu. Kernel Conditional Random Fields: Representation and Clique Selection. ICML X. Zhu, Z. Ghahramani, J. Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. ICML X. Zhu, J. Lafferty, Z. Ghahramani. Combining Active Learning and Semi- Supervised Learning Using Gaussian Fields and Harmonic Functions. ICML 2003 workshop. X. Zhu, Z. Ghahramani. Learning from Labeled and Unlabeled Data with Label Propagation. CMU-CALD ,

46 Graph spectrum = n i=1 λ iφ i φ i 46

47 Learning component label with EM Labels for the components do not follow the graph. (Nigam et al., 2000) 47

48 KCRF: Sequences and other structured data KCRF: p f (y x) = Z 1 (x, f) exp ( c f(x, y c)) f induces regularized negative log loss on training data R(f) = l i=1 log p f (y (i) x (i) ) + Ω( f K ) representer theorem for KCRFs: loss minimizer f (x, y c ) = l i=1 c y c α(i, y c )K((x (i), y c ), (x, y c )) 48

49 KCRF: Sequences and other structured data learn α to minimize R(f), convex, sparse training special case K((x (i), y c ), (x, y c )) = ψ(k (x (i) c, x c ), y c, y c ) K can be a semi-supervised kernel. 49

50 A random walk interpretation of harmonic functions harmonic function f i = P (hit label 1 start from i) random walk from node i to j with probability P ij stop if we hit a labeled node indirect similarity: random walks have similar destinations 1 i 0 50

51 An image and its neighbors in the graph image 4005 neighbor 1: time edge neighbor 2: color edge neighbor 3: color edge neighbor 4: color edge neighbor 5: face edge 51

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning classification classifiers need labeled data to train labeled data