Graphs in Machine Learning

Size: px

Start display at page:

Download "Graphs in Machine Learning"

August Harris
5 years ago
Views:

1 Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France TA: Pierre Perrault Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton October 30, 2017 MVA 2017/2018

2 Previous Lecture recommendation on a bipartite graph resistive networks recommendation score as a resistance? Laplacian and resistive networks resistance distance and random walks geometry of the data and the connectivity spectral clustering connectivity vs. compactness MinCut, RatioCut, NCut spectral relaxations manifold learning with Laplacian eigenmaps Michal Valko Graphs in Machine Learning SequeL - 2/50

3 Previous Lab Session by Pierre Perrault Content graph construction test sensitivity to parameters: σ, k, ε spectral clustering spectral clustering vs. k-means image segmentation Short written report (graded, all reports around 40% of grade) Check the course website for the policies Questions to piazza Deadline: , 23:59 Michal Valko Graphs in Machine Learning SequeL - 3/50

4 This Lecture manifold learning with Laplacian eigenmaps semi-supervised learning inductive and transductive semi-supervised learning SSL with self-training SVMs and semi-supervised SVMs = TSVMs Gaussian random fields and harmonic solution graph-based semi-supervised learning transductive learning manifold regularization Michal Valko Graphs in Machine Learning SequeL - 4/50

5 Manifold Learning: Recap problem: definition reduction/manifold learning Given {x i } N i=1 from Rd find {y i } N i=1 in Rm, where m d. What do we know about the dimensionality reduction representation/visualization (2D or 3D) an old example: globe to a map often assuming M R d feature extraction linear vs. nonlinear dimensionality reduction What do we know about linear vs. nonlinear methods? linear: ICA, PCA, SVD,... nonlinear often preserve only local distances Michal Valko Graphs in Machine Learning SequeL - 5/50

6 Manifold Learning: Linear vs. Non-linear Michal Valko Graphs in Machine Learning SequeL - 6/50

7 Manifold Learning: Preserving (just) local distances d(y i, y j ) = d(x i, x j ) only if d(x i, x j ) is small min ij w ij y i y j 2 Looks familiar? Michal Valko Graphs in Machine Learning SequeL - 7/50

8 Manifold Learning: Laplacian Eigenmaps Step 1: Solve generalized eigenproblem: Lf = λdf Step 2: Assign m new coordinates: x i (f 2 (i),..., f m+1 (i)) Note 1 : we need to get m + 1 smallest eigenvectors Note 2 : f 1 is useless mbelkin/papers/lem_nc_03.pdf Michal Valko Graphs in Machine Learning SequeL - 8/50

9 Manifold Learning: Laplacian Eigenmaps to 1D Laplacian Eigenmaps 1D objective min f f T Lf s.t. f i R, f T D1 = 0, f T Df = 1 The meaning of the constraints is similar as for spectral clustering: f T Df = 1 is for scaling f T D1 = 0 is to not get v 1 What is the solution? Michal Valko Graphs in Machine Learning SequeL - 9/50

Manifold Learning: Example http://www.mathworks.

10 Manifold Learning: Example laplacian-eigenmap- -diffusion-map- -manifold-learning Michal Valko Graphs in Machine Learning SequeL - 10/50

11 Semi-supervised learning: How is it possible? This is how children learn! hypothesis Michal Valko Graphs in Machine Learning SequeL - 11/50

12 Semi-supervised learning (SSL) SSL problem: definition Given {x i } N i=1 from Rd and {y i } n l i=1, with n l N, find {y i } n i=n l +1 (transductive) or find f predicting y well beyond that (inductive). Some facts about SSL assumes that the unlabeled data is useful works with data geometry assumptions cluster assumption low-density separation manifold assumption smoothness assumptions, generative models,... now it helps now, now it does not (sic) provable cases when it helps inductive or transductive/out-of-sample extension Michal Valko Graphs in Machine Learning SequeL - 12/50

13 SSL: Self-Training Michal Valko Graphs in Machine Learning SequeL - 13/50

14 SSL: Overview: Self-Training SSL: Self-Training Input: L = {x i, y i } n l i=1 and U = {x i} N i=n l +1 Repeat: train f using L apply f to (some) U and add them to L What are the properties of self-training? its a wrapper method heavily depends on the the internal classifier some theory exist for specific classifiers nobody uses it anymore errors propagate (unless the clusters are well separated) Michal Valko Graphs in Machine Learning SequeL - 14/50

15 SSL: Self-Training: Bad Case Michal Valko Graphs in Machine Learning SequeL - 15/50

16 SSL: Transductive SVM: S3VM Michal Valko Graphs in Machine Learning SequeL - 16/50

17 SSL: Transductive SVM: Classical SVM Linear case: f = w T x + b we look for (w, b) max-margin classification max w,b 1 w s.t. y i (w T x i + b) 1 i = 1,..., n l note the difference between functional and geometric margin max-margin classification min w,b w 2 s.t. y i (w T x i + b) 1 i = 1,..., n l Michal Valko Graphs in Machine Learning SequeL - 17/50

18 SSL: Transductive SVM: Classical SVM max-margin classification: separable case min w,b w 2 s.t. y i (w T x i + b) 1 i = 1,..., n l max-margin classification: non-separable case min w,b λ w 2 + i ξ i s.t. y i (w T x i + b) 1 ξ i i = 1,..., n l ξ i 0 i = 1,..., n l Michal Valko Graphs in Machine Learning SequeL - 18/50

19 SSL: Transductive SVM: Classical SVM max-margin classification: non-separable case min w,b λ w 2 + i ξ i s.t. y i (w T x i + b) 1 ξ i i = 1,..., n l ξ i 0 i = 1,..., n l Unconstrained formulation using hinge loss: In general? min w,b n l max (1 y i (w T x i + b), 0) + λ w 2 i min w,b n l V (x i, y i, f (x i )) + λω(f ) i Michal Valko Graphs in Machine Learning SequeL - 19/50

20 SSL: Transductive SVM: Classical SVM: Hinge loss V (x i, y i, f (x i )) = max (1 y i (w T x i + b), 0) Michal Valko Graphs in Machine Learning SequeL - 20/50

21 SSL: Transductive SVM: Unlabeled Examples min w,b n l max (1 y i (w T x i + b), 0) + λ w 2 i How to incorporate unlabeled examples? No y s for unlabeled x. Prediction of f for (any) x? ŷ = sgn (f (x)) = sgn (w T x + b) Pretending that sgn (f (x)) is the true label... V (x, ŷ, f (x)) = max (1 ŷ (w T x + b), 0) = max (1 sgn (w T x + b) (w T x + b), 0) = max (1 w T x + b, 0) Michal Valko Graphs in Machine Learning SequeL - 21/50

22 SSL: Transductive SVM: Hinge and Hat Loss What is the difference in the objectives? Hinge loss penalizes? Hat loss penalizes? Michal Valko Graphs in Machine Learning SequeL - 22/50

23 SSL: Transductive SVM: S3VM This is what we wanted! Michal Valko Graphs in Machine Learning SequeL - 23/50

24 SSL: Transductive SVM: Formulation Main SVM idea stays the same: penalize the margin n l n l +n u min max (1 y i (w T x i + b), 0)+λ 1 w 2 +λ 2 max (1 w T x i + b, 0) w,b i=1 i=n l +1 min w,b What is the loss and what is the regularizer? n l n l +n u max (1 y i (w T x i + b), 0)+λ 1 w 2 +λ 2 max (1 w T x i + b, 0) i=1 i=n l +1 Think of unlabeled data as the regularizers for your classifiers! Practical hint: Additionally enforce the class balance. What it the main issue of TSVM? recent advancements: Michal Valko Graphs in Machine Learning SequeL - 24/50

25 SSL with Graphs: Prehistory Blum/Chawla: Learning from Labeled and Unlabeled Data using Graph Mincuts *following some insights from vision research in 1980s Michal Valko Graphs in Machine Learning SequeL - 25/50

26 SSL with Graphs: MinCut MinCut SSL: an idea similar to MinCut clustering Where is the link? What is the formal statement? We look for f (x) {±1} cut = n l +n u i,j=1 w ij (f (x i ) f (x j )) 2 = Ω(f ) Why (f (x i ) f (x j )) 2 and not f (x i ) f (x j )? Michal Valko Graphs in Machine Learning SequeL - 26/50

27 SSL with Graphs: MinCut We look for f (x) {±1} Ω(f) = n l +n u i,j=1 w ij (f (x i ) f (x j )) 2 Clustering was unsupervised, here we have supervised data. Recall the general objective-function framework: min w,b n l V (x i, y i, f (x i )) + λω(f) i It would be nice if we match the prediction on labeled data: V (x, y, f (x)) = (f (x i ) y i ) 2 n l i=1 Michal Valko Graphs in Machine Learning SequeL - 27/50

28 SSL with Graphs: MinCut Final objective function: min f {±1} n l +nu n l i=1 n l +n u (f (x i ) y i ) 2 + λ i,j=1 w ij (f (x i ) f (x j )) 2 This is an integer program :( Can we solve it? Are we happy? We need a better way to reflect the confidence. Michal Valko Graphs in Machine Learning SequeL - 28/50

29 SSL with Graphs: Harmonic Functions Zhu/Ghahramani/Lafferty: Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions *a seminal paper that convinced people to use graphs for SSL Idea 1: Look for a unique solution. Idea 2: Find a smooth one. (harmonic solution) Harmonic SSL 1): As before we constrain f to match the supervised data: f (x i ) = y i i {1,..., n l } 2): We enforce the solution f to be harmonic. i j f (x i ) = f (x j)w ij i j w i {n l + 1,..., n u + n l } ij Michal Valko Graphs in Machine Learning SequeL - 29/50

30 SSL with Graphs: Harmonic Functions The harmonic solution is obtained from the mincut one... min f {±1} n l +nu n l i=1 n l +n u (f (x i ) y i ) 2 + λ i,j=1 w ij (f (x i ) f (x j )) 2... if we just relax the integer constraints to be real... min f R n l +nu n l i=1 n l +n u (f (x i ) y i ) 2 + λ i,j=1... or equivalently (note that f (x i ) = f i )... n l +n u min w ij (f (x i ) f (x j )) 2 f R n l +nu i,j=1 s.t. y i = f (x i ) i = 1,..., n l w ij (f (x i ) f (x j )) 2 Michal Valko Graphs in Machine Learning SequeL - 30/50

31 SSL with Graphs: Harmonic Functions Properties of the relaxation from ±1 to R there is a closed form solution for f this solution is unique globally optimal it is either constant or has a maximum/minimum on a boundary f (x i ) may not be discrete but we can threshold it electric-network interpretation random-walk interpretation Michal Valko Graphs in Machine Learning SequeL - 31/50

32 SSL with Graphs: Harmonic Functions Random walk interpretation: 1) start from the vertex you want to label and randomly walk 2) P(j i) = w ij k w P = D 1 W ik 3) finish when a labeled vertex is hit absorbing random walk f i = probability of reaching a positive labeled vertex Michal Valko Graphs in Machine Learning SequeL - 32/50

33 SSL with Graphs: Harmonic Functions How to compute HS? Option A: iteration/propagation Step 1: Set f (x i ) = y i for i = 1,..., n l Step 2: Propagate iteratively (only for unlabeled) i j f (x i ) f (x j)w ij i j w i {n l + 1,..., n u + n l } ij Properties: this will converge to the harmonic solution we can set the initial values for unlabeled nodes arbitrarily an interesting option for large-scale data Michal Valko Graphs in Machine Learning SequeL - 33/50

34 SSL with Graphs: Harmonic Functions How to compute HS? Option B: Closed form solution Define f = (f (x 1 ),..., f (x nl +n u )) = (f 1,..., f nl +n u ) Ω(f) = n l +n u i,j=1 w ij (f (x i ) f (x j )) 2 = f T Lf L is a (n l + n u ) (n l + n u ) matrix: [ Lll L L = lu L u1 L uu ] How to compute this constrained minimization problem? Michal Valko Graphs in Machine Learning SequeL - 34/50

35 SSL with Graphs: Harmonic Functions Let us compute harmonic solution using harmonic property! How did we formalize the harmonic property of a circuit? (Lf) u = 0 u In matrix notation [ Lll L ul L lu L uu ] [ ] [ fl... = f l is constrained to be y l and for f u f u 0 u ]... from which we get L ul f l + L uu f u = 0 u f u = L 1 uu ( L ul f l ) = L 1 uu (W ul f l ). Note that this does not depend on L ll. Michal Valko Graphs in Machine Learning SequeL - 35/50

36 SSL with Graphs: Harmonic Functions Can we see that this calculates the probability of a random walk? f u = L 1 uu ( L ul f l ) = L 1 uu (W ul f l ) Note that P = D 1 W. Then equivalently f u = (I P uu ) 1 P ul f l. Split the equation into +ve & -ve part: f i = (I P uu ) 1 iu P ulf l = (I P uu ) 1 j:y j =1 iu P uj } {{ } p (+1) i = p (+1) i p ( 1) i j:y j = 1 (I P uu ) 1 iu P uj } {{ } p ( 1) i Michal Valko Graphs in Machine Learning SequeL - 36/50

37 SSL with Graphs: Regularized Harmonic Functions f i = p (+1) i p ( 1) i = f i = f i sgn(f }{{} i ) }{{} confidence label What if a nasty outlier sneaks in? The prediction for the outlier can be hyperconfident :( How to control the confidence of the inference? We add a sink to the graph. Allow the random walk to die! sink = artificial label node with value 0 We connect it to every other vertex. What will this do to our predictions? depends on the weigh on the edges Michal Valko Graphs in Machine Learning SequeL - 37/50

38 SSL with Graphs: Regularized Harmonic Functions How do we compute this regularized random walk? How does γ g influence HS? f u = (L uu + γ g I) 1 (W ul f l ) What happens to sneaky outliers? Michal Valko Graphs in Machine Learning SequeL - 38/50

39 SSL with Graphs: Harmonic Functions Why don t we represent the sink in L explicitly? Formally, to get the harmonic solution on the graph with sink... L ll + γ G I nl L lu γ G f l... L ul L uu + γ G I nu γ G f u = 0 u γ G 1 nl 1 γ G 1 nu 1 nγ G 0... L ul f l + (L uu + γ G I nu ) f u = 0 u... which is the same if we disregard the last column and row... [ ] [ ] [ ] Lll + γ G I nl L lu fl... = L uu + γ G I nu L ul f u 0 u... and therefore we simply add γ G to the diagonal of L! Michal Valko Graphs in Machine Learning SequeL - 39/50

40 SSL with Graphs: Soft Harmonic Functions Regularized HS objective with Q = L + γ g I: n l min (f (x i ) y i ) 2 + λf T Qf f R n l +nu i=1 What if we do not really believe that f (x i ) = y i, i? f = min (f f R N y)t C(f y) + f T Qf { c l for labeled examples C is diagonal with C ii = c u otherwise. { true label for labeled examples y pseudo-targets with y i = 0 otherwise. Michal Valko Graphs in Machine Learning SequeL - 40/50

41 SSL with Graphs: Soft Harmonic Functions f = min f R n (f y)t C(f y) + f T Qf Closed form soft harmonic solution: f = (C 1 Q + I) 1 y What are the differences between hard and soft? Not much different in practice. Provable generalization guarantees for the soft one. Michal Valko Graphs in Machine Learning SequeL - 41/50

42 SSL with Graphs: Regularized Harmonic Functions Larger implications of random walks random walk relates to commute distance which should satisfy ( ) Vertices in the same cluster of the graph have a small commute distance, whereas two vertices in different clusters of the graph have a large commute distance. Do we have this property for HS? What if N? Luxburg/Radl/Hein: Getting lost in space: Large sample analysis of the commute distance people/luxburg/publications/luxburgradlhein2010_paperandsupplement.pdf Solutions? 1) γ g 2) amplified commute distance 3) L p 4) L... The goal of these solutions: make them remember! Michal Valko Graphs in Machine Learning SequeL - 42/50

43 SSL with Graphs: Out of sample extension Both MinCut and HFS only inferred the labels on unlabeled data. They are transductive. What if a new point x nl +n u+1 arrives? also called out-of-sample extension Option 1) Add it to the graph and recompute HFS. Option 2) Make the algorithms inductive! Allow to be defined everywhere: f : X R Allow f (x i ) y i. Why? To deal with noise. Solution: Manifold Regularization Michal Valko Graphs in Machine Learning SequeL - 43/50

44 SSL with Graphs: Manifold Regularization General (S)SL objective: min f n l V (x i, y i, f (x i )) + λω(f ) i Want to control f, also for the out-of-sample data, i.e., everywhere. Ω(f ) = λ 2 f T Lf + λ 1 f (x) 2 dx For general kernels: min f H K n l x X V (x i, y i, f (x i )) + λ 1 f 2 K + λ 2 f T Lf i Michal Valko Graphs in Machine Learning SequeL - 44/50

45 SSL with Graphs: Manifold Regularization f = arg min f H K n l V (x i, y i, f ) + λ 1 f 2 K + λ 2 f T Lf i Representer Theorem for Manifold Regularization The minimizer f has a finite expansion of the form V (x, y, f ) = (y f (x)) 2 f (x) = V (x, y, f ) = max (0, 1 yf (x)) n l +n u i=1 α i K(x, x i ) LapRLS Laplacian Regularized Least Squares LapSVM Laplacian Support Vector Machines Michal Valko Graphs in Machine Learning SequeL - 45/50

46 SSL with Graphs: Laplacian SVMs f = arg min f H K n l max (0, 1 yf (x)) + γ A f 2 K + γ I f T Lf i Allows us to learn a function in RKHS, i.e., RBF kernels. Michal Valko Graphs in Machine Learning SequeL - 46/50

47 SSL with Graphs: Laplacian SVMs Michal Valko Graphs in Machine Learning SequeL - 47/50

48 Checkpoint 1 Semi-supervised learning with graphs: min ( ) f {±1} n l +nu n l i=1 n l +n u w ij (f (x i ) y i ) 2 + λ i,j=1 (f (x i ) f (x j )) 2 Regularized harmonic Solution: f u = (L uu + γ g I) 1 (W ul f l ) Michal Valko Graphs in Machine Learning SequeL - 48/50

49 Checkpoint 2 Unconstrained regularization in general: f = min f R N (f y)t C(f y) + f T Qf Out of sample extension: Laplacian SVMs f = arg min f H K n l max (0, 1 yf (x)) + λ 1 f 2 K + λ 2 f T Lf i Michal Valko Graphs in Machine Learning SequeL - 49/50

50 Michal Valko ENS Paris-Saclay, MVA 2017/2018 SequeL team, Inria Lille Nord Europe

Graphs in Machine Learning

Graphs in Machine Learning Michal Valko INRIA Lille - Nord Europe, France Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton February 10, 2015 MVA 2014/2015 Previous