Graph-Based Semi-Supervised Learning

Size: px

Start display at page:

Download "Graph-Based Semi-Supervised Learning"

Leon Dawson
5 years ago
Views:

1 Graph-Based Semi-Supervised Learning Olivier Delalleau, Yoshua Bengio and Nicolas Le Roux Université de Montréal CIAR Workshop - April 26th, 2005

2 Graph-Based Semi-Supervised Learning Yoshua Bengio, Olivier Delalleau and Nicolas Le Roux Université de Montréal CIAR Workshop - April 26th, 2005

3 Graph-Based Semi-Supervised Learning Nicolas Le Roux, Olivier Delalleau and Yoshua Bengio Université de Montréal CIAR Workshop - April 26th, 2005

4 Graph-Based Semi-Supervised Learning Olivier Delalleau, Nicolas Le Roux and Yoshua Bengio Université de Montréal CIAR Workshop - April 26th, 2005

5 Graph-Based Semi-Supervised Learning Yoshua Bengio, Nicolas Le Roux and Olivier Delalleau Université de Montréal CIAR Workshop - April 26th, 2005

6 Graph-Based Semi-Supervised Learning Nicolas Le Roux, Yoshua Bengio and Olivier Delalleau Université de Montréal CIAR Workshop - April 26th, 2005

7 Outline 1 Semi-Supervised Setting 2 Graph Regularization and Label Propagation 3 Transduction vs. Induction 4 Curse of Dimensionality

8 Semi-Supervised Learning for Dummies Task of binary classification with labels y i { 1, 1}

9 Semi-Supervised Learning for Dummies Task of binary classification with labels y i { 1, 1} Semi-supervised = learn something about labels using both labeled and unlabeled data X = (x 1, x 2,..., x n ) Y l = (y 1, y 2,..., y l ) n = l + u

10 Semi-Supervised Learning for Dummies Task of binary classification with labels y i { 1, 1} Semi-supervised = learn something about labels using both labeled and unlabeled data X = (x 1, x 2,..., x n ) Y l = (y 1, y 2,..., y l ) n = l + u Transduction Ŷu = (ŷ l+1, ŷ l+2,..., ŷ n )

11 Semi-Supervised Learning for Dummies Task of binary classification with labels y i { 1, 1} Semi-supervised = learn something about labels using both labeled and unlabeled data X = (x 1, x 2,..., x n ) Y l = (y 1, y 2,..., y l ) n = l + u Transduction Ŷu = (ŷ l+1, ŷ l+2,..., ŷ n ) Induction ŷ : x ŷ(x)

12 The Classical Two-Moon Problem

13 The Classical Two-Moon Problem

14 Where are Manifolds and Kernels? What is a good labeling Ŷ = (Ŷl, Ŷu)?

15 Where are Manifolds and Kernels? What is a good labeling Ŷ = (Ŷl, Ŷu)? 1 one that is consistent with the given labels: Ŷl Y l

16 Where are Manifolds and Kernels? What is a good labeling Ŷ = (Ŷl, Ŷu)? 1 one that is consistent with the given labels: Ŷl Y l 2 one that is smooth on the where the data lie ( / cluster assumption): ŷ i ŷ j when x i close to x j

17 Where are Manifolds and Kernels? What is a good labeling Ŷ = (Ŷl, Ŷu)? 1 one that is consistent with the given labels: Ŷl Y l 2 one that is smooth on the manifold where the data lie (manifold / cluster assumption): ŷ i ŷ j when x i close to x j

18 Where are Manifolds and Kernels? What is a good labeling Ŷ = (Ŷl, Ŷu)? 1 one that is consistent with the given labels: Ŷl Y l 2 one that is smooth on the manifold where the data lie (manifold / cluster assumption): ŷ i ŷ j when x i close to x j Cost function C(Ŷ ) = l (ŷ i y i ) 2 + µ 2 i=1 n W ij (ŷ i ŷ j ) 2 i,j=1 with W ij = W X (x i, x j ) a positive weighting function (e.g )

19 Where are Manifolds and Kernels? What is a good labeling Ŷ = (Ŷl, Ŷu)? 1 one that is consistent with the given labels: Ŷl Y l 2 one that is smooth on the manifold where the data lie (manifold / cluster assumption): ŷ i ŷ j when x i close to x j Cost function C(Ŷ ) = l (ŷ i y i ) 2 + µ 2 i=1 n W ij (ŷ i ŷ j ) 2 i,j=1 with W ij = W X (x i, x j ) a positive weighting function (e.g kernel)

20 Graphs Would be Cool too! Nodes = points, edges (i, j) W ij > 0.

21 Graphs Would be Cool too! Nodes = points, edges (i, j) W ij > 0.

22 Regularization on Graph Graph Laplacian: L ii = j i W ij and L ij = W ij. l C(Ŷ ) = (ŷ i y i ) 2 + µ 2 i=1 n W ij (ŷ i ŷ j ) 2 i,j=1 = Ŷl Y l 2 + µŷ LŶ

23 Regularization on Graph Graph Laplacian: L ii = j i W ij and L ij = W ij. l C(Ŷ ) = (ŷ i y i ) 2 + µ 2 i=1 C(Ŷ ) is minimized when n W ij (ŷ i ŷ j ) 2 i,j=1 = Ŷl Y l 2 + µŷ LŶ (S + µl) Ŷ = Y linear system with n unknowns and equations.

24 From Matrix Inversion to Label Propagation Linear system rewrites for a labeled point ŷ i = j W ijŷj + 1 µ y i j W ij + 1 µ and for an unlabeled point ŷ i = j W ijŷj j W. ij

25 From Matrix Inversion to Label Propagation Linear system rewrites for a labeled point ŷ (t+1) i = and for an unlabeled point ŷ (t+1) i = j W ijŷ(t) j + 1 µ y i j W ij + 1 µ j W ijŷ(t) j j W. ij Jacobi or Gauss-Seidel iteration algorithm.

26 No, I didn t come up with that last week-end X. Zhu, Z. Ghahramani and J. Lafferty (2003): Semi-supervised learning using Gaussian fields and harmonic functions D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Schölkopf (2004): Learning with local and global consistency M. Belkin, I. Matveeva and P. Niyogi (2004): Regularization and Semi-supervised Learning on Large Graphs

27 Remember your Physics class? Electric network analogy (Doyle and Snell, 1984; Zhu, Ghahramani and Lafferty, 2003). Graph electric network with resistors between nodes: R ij = 1 W ij

28 Remember your Physics class? Electric network analogy (Doyle and Snell, 1984; Zhu, Ghahramani and Lafferty, 2003). Graph electric network with resistors between nodes: Ohm s law (potential = label): R ij = 1 W ij ŷ j ŷ i = R ij I ij

29 Remember your Physics class? Electric network analogy (Doyle and Snell, 1984; Zhu, Ghahramani and Lafferty, 2003). Graph electric network with resistors between nodes: Ohm s law (potential = label): R ij = 1 W ij ŷ j ŷ i = R ij I ij Kirchoff s law on an unlabeled node i: I ij = 0 j

30 Remember your Physics class? Electric network analogy (Doyle and Snell, 1984; Zhu, Ghahramani and Lafferty, 2003). Graph electric network with resistors between nodes: Ohm s law (potential = label): R ij = 1 W ij ŷ j ŷ i = R ij I ij Kirchoff s law on an unlabeled node i: I ij = 0 j Same linear system as minimizing C(Ŷ ) over Ŷu only.

31 From Transduction to Induction Solving the linear system Ŷ (transduction).

32 From Transduction to Induction Solving the linear system Ŷ (transduction). From a new point x and already computed Ŷ : C(ŷ(x)) = C(Ŷ ) + µ 2 n W X (x i, x)(ŷ i ŷ(x)) 2 i=1 ŷ(x) = n i=1 W X (x i, x)ŷ i n i=1 W X (x i, x) (Induction like Parzen Windows, but using estimated labels Ŷ ).

33 Faster Training from Subset Previous algorithms are at least quadratic in n.

34 Faster Training from Subset Previous algorithms are at least quadratic in n. Induction formula could train only on subset S.

35 Faster Training from Subset Previous algorithms are at least quadratic in n. Induction formula could train only on subset S. Better: minimize the full cost over ŶS only.

36 Faster Training from Subset Previous algorithms are at least quadratic in n. Induction formula could train only on subset S. Better: minimize the full cost over ŶS only. For x i R = X \ S: ŷ i = j S W ijŷj j S W ij i.e. ŶR = W RS Ŷ S : the cost C(Ŷ ) now only depends on ŶS.

37 Nice Cost isn t it? C(ŶS) = µ ) 2 W ij (ŷi ŷ j 2 i,j R }{{} C RR + 2 µ ) 2 W ij (ŷi ŷ j 2 i R,j S }{{} C RS + µ ) 2 W ij (ŷi ŷ j 2 i,j S }{{} C SS + (ŷ i y i ) 2 i L }{{} C L

38 Let s Make it Simpler Computing C RR is quadratic in n just get rid of it. Linear system with S = m n unknowns much faster Still need to do matrix multiplications scales as O(m 2 n) This slide looked pretty empty with only three points It is important to have reading material when nobody understands your accent

39 Subset Selection 1 Random:

40 Subset Selection 1 Random: fast,

41 Subset Selection 1 Random: fast, easy,

42 Subset Selection 1 Random: fast, easy, crappy. Main problem = does not fill the space well enough bad approximation by the induction formula (some points have no near neighbors in the subset)

43 Subset Selection 1 Random: fast, easy, crappy. Main problem = does not fill the space well enough bad approximation by the induction formula (some points have no near neighbors in the subset) 2 Heuristic: greedy construction of subset by starting with the labeled points and iteratively minimizing over i j S W ij i.e. choose the point x i farthest from the current subset. (Additional tricks to eliminate outliers and sample more points near decision surface)

44 Experimental Results This is not last-minute research so we do have results!

45 Experimental Results This is not last-minute research so we do have results! Table: Comparative Classification Error (Induction) % labeled LETTERS MNIST COVTYPE 1% NoSub RandSub subonly RandSub SmartSub % NoSub RandSub subonly RandSub SmartSub % NoSub RandSub subonly RandSub SmartSub More comparisons between RandSub and SmartSub on 8 more UCI datasets SmartSub always performs better.

46 Curse of Dimensionality Labeling function: ŷ(x) = i ŷ i W X (x i, x)

Curse of Dimensionality Labeling function: ŷ(x) = i ŷ i W X (x i, x) Locality: far point prediction = nearest neighbor; also normal vector ŷ x (x) is approximately in the span of the nearest

47 Curse of Dimensionality Labeling function: ŷ(x) = i ŷ i W X (x i, x) Locality: far point prediction = nearest neighbor; also normal vector ŷ x (x) is approximately in the span of the nearest neighbors of x Smoothness: it does not vary much within a ball of radius small w.r.t. σ (obtained from second derivative); also form of cost the learned function varies smoothly in regions with no labeled examples Curse: one needs lots of unlabeled examples to fill the region near the decision surface, and lots of labeled examples to account for all clusters

48 Almost Real-Life Example Need unlabeled examples along the sinus decision surface, and labeled examples in each class region.

49 Conclusion

50 Conclusion Simple non-parametric setting powerful non-parametric semi-supervised algorithm

51 Conclusion Simple non-parametric setting powerful non-parametric semi-supervised algorithm Can scale to large datasets thanks to sparsity / subset selection

52 Conclusion Simple non-parametric setting powerful non-parametric semi-supervised algorithm Can scale to large datasets thanks to sparsity / subset selection Interesting links with electric networks / heat diffusion

53 Conclusion Simple non-parametric setting powerful non-parametric semi-supervised algorithm Can scale to large datasets thanks to sparsity / subset selection Interesting links with electric networks / heat diffusion Limitations of local weights: curse of dimensionality

54 Conclusion Simple non-parametric setting powerful non-parametric semi-supervised algorithm Can scale to large datasets thanks to sparsity / subset selection Interesting links with electric networks / heat diffusion Limitations of local weights: curse of dimensionality Makes it possible to be on time for lunch!!!

55 References Belkin, M., Matveeva, I., and Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. In Shawe-Taylor, J. and Singer, Y., editors, COLT Springer. Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efficient non-parametric function induction in semi-supervised learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Doyle, P. G. and Snell, J. L. (1984). Random walks and electric networks. Mathematical Association of America. Zhou, D., Bousquet, O., Navin Lal, T., Weston, J., and Schölkopf, B. (2004). Learning with local and global consistency. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16, Cambridge, MA. MIT Press. Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In ICML 2003.

Semi-Supervised Learning

Semi-Supervised Learning getting more for less in natural language processing and beyond Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning many human