Graphs in Machine Learning

Size: px

Start display at page:

Download "Graphs in Machine Learning"

Annabel McLaughlin
6 years ago
Views:

1 Graphs in Machine Learning Michal Valko INRIA Lille - Nord Europe, France Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton February 10, 2015 MVA 2014/2015

2 Previous Lecture resistive networks recommendation score as a resistance? Laplacian and resistive networks computation of effective resistance geometry of the data and the connectivity spectral clustering connectivity vs. compactness MinCut, RatioCut, NCut spectral relaxations manifold learning Michal Valko Graphs in Machine Learning Lecture 4-2/43

3 This Lecture manifold learning with Laplacian Eigenmaps Gaussian random fields and harmonic solution Graph-based semi-supervised learning and manifold regularization Theory of Laplacian-based manifold methods Transductive learning SSL Learnability Michal Valko Graphs in Machine Learning Lecture 4-3/43

4 Previous Lab Session by Content Graph Construction Test sensitivity to parameters: σ, k, ε Spectral Clustering Spectral Clustering vs. k-means Image Segmentation Short written report (graded, each lab around 5% of grade) Hint: Order 2.1, 2.6 (find the bend), 2.2, 2.3, 2.4, 2.5 Questions to Deadline: ~calandri/ta/graphs/td1_handout.pdf Michal Valko Graphs in Machine Learning Lecture 4-4/43

Advanced Learning for Text and Graph Data Time: Wednesdays 8h30-11h30 4 lectures and 3 Labs Place: Polytechnique / Amphi Sauvy Lecturer 1: Michalis Vazirgiannis (Polytechnique) Lecturer 2: Yassine

5 Advanced Learning for Text and Graph Data Time: Wednesdays 8h30-11h30 4 lectures and 3 Labs Place: Polytechnique / Amphi Sauvy Lecturer 1: Michalis Vazirgiannis (Polytechnique) Lecturer 2: Yassine Faihe (Hewlett-Packard - Vertica) ALTeGraD and Graphs in ML run in parallel The two graph courses are coordinated to be complementary. Some of covered graph topics not covered in this course Ranking algorithms and measures (Kendal Tau, NDCG) Advanced graph generators Community mining, advanced graph clustering Graph degeneracy (k-core & extensions) Privacy in graph mining contenus-/advanced-learning-for-text-and-graph-data-altegrad kjsp?rh= Michal Valko Graphs in Machine Learning Lecture 4-5/43

6 PhD proposal at CMU and JIE SYSU-CMU Joint Institute of Engineering (JIE) in Guangzhou, China: International environment, English working language Fully-funded PhD positions available at SYSU-CMU JIE: Single-degree program at SYSU in Guangzhou, China Double-degree program (selective) 2 years at CMU, Pittsburgh rest of the time at JIE in Guangzhou, China Fundamental research with applications in: Supercomputing & Big Data Biomedical applications Autonomous driving Smart grids and power systems Contact: paweng@cmu.edu A New Engineering School Michal Valko Graphs in Machine Learning Lecture 4-6/43

7 Manifold Learning: Recap problem: definition reduction/manifold learning Given {x i } n i=1 from Rd find {y i } n i=1 in Rm, where m d. What do we know about the dimensionality reduction representation/visualization (2D or 3D) an old example: globe to a map often assuming M R d feature extraction linear vs. nonlinear dimensionality reduction What do we know about linear linear vs. nonlinear methods? linear: ICA, PCA, SVD,... nonlinear often preserve only local distances Michal Valko Graphs in Machine Learning Lecture 4-7/43

8 Manifold Learning: Linear vs. Non-linear Michal Valko Graphs in Machine Learning Lecture 4-8/43

9 Manifold Learning: Preserving (just) local distances d(y i, y j ) = d(x i, x j ) only if d(x i, x j ) is small min ij w ij y i y j 2 Looks familiar? Michal Valko Graphs in Machine Learning Lecture 4-9/43

10 Manifold Learning: Laplacian Eigenmaps Step 1: Solve generalized eigenproblem: Lf = λdf Step 2: Assign m new coordinates: x i (f 2 (i),..., f m (i)) Note 1 : we need to get m smallest eigenvectors Note 2 : f 1 is useless Michal Valko Graphs in Machine Learning Lecture 4-10/43

11 Manifold Learning: Laplacian Eigenmaps to 1D Laplacian Eigenmaps 1D objective min f f T Lf s.t. f i R, f T D1 = 0, f T Df = 1 The meaning for constraints is similar as for spectral clustering: f T Df = 1 is for scaling f T D1 = 0 is to not get v 1 What is the solution? Michal Valko Graphs in Machine Learning Lecture 4-11/43

Manifold Learning: Example http://www.mathworks.

12 Manifold Learning: Example laplacian-eigenmap-~-diffusion-map-~-manifold-learning Michal Valko Graphs in Machine Learning Lecture 4-12/43

13 Semi-supervised learning: How is it possible? This is how children learn! hypothesis Michal Valko Graphs in Machine Learning Lecture 4-13/43

14 Semi-supervised learning (SSL) SSL problem: definition Given {x i } n i=1 from Rd and {y i } n l i=1, with n l n, find {y i } n i=n l +1 (transductive) or find f predicting y well beyond that (inductive). Some facts about SSL assumes that the unlabeled data is useful works with data geometry assumptions cluster assumption - low-density separation manifold assumption smoothness assumptions, generative models,... now it helps now, now it does not (sic) provable cases when it helps inductive or transductive/out-of-sample extension Michal Valko Graphs in Machine Learning Lecture 4-14/43

15 SSL: Self-Training Michal Valko Graphs in Machine Learning Lecture 4-15/43

16 SSL: Overview: Self-Training SSL: Self-Training Input: L = {x i, y i } n l i=1 and U = {x i} n i=n l +1 Repeat: train f using L apply f to (some) U and add them to L What are the properties of self-training? its a wrapper method heavily depends on the the internal classifier some theory exist for specific classifiers nobody uses it anymore errors propagate (unless the cluster are well separated) Michal Valko Graphs in Machine Learning Lecture 4-16/43

17 SSL: Self-Training: Bad Case Michal Valko Graphs in Machine Learning Lecture 4-17/43

18 SSL: Transductive SVM: S3VM Michal Valko Graphs in Machine Learning Lecture 4-18/43

19 SSL: Transductive SVM: Classical SVM Linear case: f = w T x + b we look for (w, b) max-margin classification max w,b 1 w s.t. y i (w T x i + b) 1 i = 1,..., n l max-margin classification min w,b w 2 s.t. y i (w T x i + b) 1 i = 1,..., n l Michal Valko Graphs in Machine Learning Lecture 4-19/43

20 SSL: Transductive SVM: Classical SVM max-margin classification: separable case min w,b w 2 s.t. y i (w T x i + b) 1 i = 1,..., n l max-margin classification: non-separable case min w,b λ w 2 + i ξ i s.t. y i (w T x i + b) 1 ξ i i = 1,..., n l ξ i 0 i = 1,..., n l Michal Valko Graphs in Machine Learning Lecture 4-20/43

21 SSL: Transductive SVM: Classical SVM max-margin classification: non-separable case min w,b λ w 2 + i ξ i s.t. y i (w T x i + b) 1 ξ i i = 1,..., n l ξ i 0 i = 1,..., n l Unconstrained formulation: l max (1 y i (w T x i + b), 0) + λ w 2 In general? min w,b i min w,b l V (x i, y i, f (x i )) + λω(f ) i Michal Valko Graphs in Machine Learning Lecture 4-21/43

22 SSL: Transductive SVM: Unlabeled Examples min w,b n l max (1 y i (w T x i + b), 0) + λ w 2 i How to incorporate unlabeled examples? No y s for unlabeled x. Prediction of f for (any) x? ŷ = sgn (f (x)) = sgn (w T x + b) Pretending that sgn (f (x)) is true... V (x, ŷ, f (x)) = max (1 ŷ (w T x + b), 0) = max (1 sgn (w T x + b) (w T x + b), 0) = max (1 w T x + b, 0) Michal Valko Graphs in Machine Learning Lecture 4-22/43

23 SSL: Transductive SVM: Hinge and Hat Loss What is the difference in the objectives? Hinge loss penalizes? Hat loss penalizes? the margin of being on the wrong side predicting in the margin Michal Valko Graphs in Machine Learning Lecture 4-23/43

24 SSL: Transductive SVM: S3VM This what we wanted! Michal Valko Graphs in Machine Learning Lecture 4-24/43

25 SSL: Transductive SVM: Formulation min w,b Main SVM idea stays: penalize the margin n l i=1 n l +n u max (1 y i (w T x i + b), 0)+λ 1 w 2 +λ 2 What is the loss and what is the regularizer? i=l+1 max (1 w T x i + b, 0) min w,b n l i=1 n l +n u max (1 y i (w T x i + b), 0)+λ 1 w 2 +λ 2 i=l+1 max (1 w T x i + b, 0) Think of unlabeled data as the regularizers for your classifiers! Practical hint: Additionally enforce the class balance. Another problem: Optimization is difficult. Michal Valko Graphs in Machine Learning Lecture 4-25/43

26 SSL with Graphs: Prehistory Blum/Chawla: Learning from Labeled and Unlabeled Data using Graph Mincuts *following some insights from vision research in 1980s Michal Valko Graphs in Machine Learning Lecture 4-26/43

27 SSL with Graphs: MinCut MinCut SSL: an idea similar to MinCut clustering Where is the link? connected classes, not necessarily compact What is the formal statement? We look for f (x) {±1} n l +n u cut = w ij i,j=1 (f (x i ) f (x j )) 2 = Ω(f ) Why (f (x i ) f (x j )) 2 and not f (x i ) f (x j )? It does not matter. Michal Valko Graphs in Machine Learning Lecture 4-27/43

28 SSL with Graphs: MinCut We look for f (x) {±1} Ω(f ) = n l +n u i,j=1 w ij (f (x i ) f (x j )) 2 Clustering was unsupervised, here we have supervised data. Recall the general objective framework: min w,b l V (x i, y i, f (x i )) + λω(f ) i It would be nice if we match the prediction on labeled data: V (x, y, f (x)) = (f (x) y) 2 n l i=1 Michal Valko Graphs in Machine Learning Lecture 4-28/43

29 SSL with Graphs: MinCut Final objective function: min f {±1} n l +nu n l i=1 n l +n u (f (x) y) 2 + λ i,j=1 w ij (f (x i ) f (x j )) 2 This is an integer program :( Can we solve it? It still just MinCut. Are we happy? There are six solutions. All equivalent. We need a better way to reflect the confidence. Michal Valko Graphs in Machine Learning Lecture 4-29/43

30 SSL with Graphs: Harmonic Functions Zhu/Ghahramani/Lafferty: Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions *a seminal paper that convinced people to use graphs for SSL Idea 1: Look for a unique solution. Idea 2: Find a smooth one. (Harmonic solution) Harmonic SSL 1): As before we constrain f to match the supervised data: f (x i ) = y i i {1,..., n l } 2): We enforce the solution f to be harmonic. i j f (x i ) = f (x j)w ij i j w i {n l + 1,..., n u + n l } ij Michal Valko Graphs in Machine Learning Lecture 4-30/43

31 SSL with Graphs: Harmonic Functions The harmonic solution is obtained from the mincut one... min f {±1} n l +nu n l i=1 n l +n u (f (x i ) y i ) 2 + λ i,j=1 w ij (f (x i ) f (x j )) 2... if we just relax the integer constraints to be real... min f R n l +nu n l i=1 n l +n u (f (x i ) y i ) 2 + λ i,j=1... or equivalently (note that f (x i ) = f i )... n l +n u min w ij (f (x i ) f (x j )) 2 f R n l +nu i,j=1 s.t. y i = f (x i ) i = 1,..., n l w ij (f (x i ) f (x j )) 2 Michal Valko Graphs in Machine Learning Lecture 4-31/43

32 SSL with Graphs: Harmonic Functions Properties of the relaxation from ±1 to R there is a closed form solution for f this solution is unique globally optimal it is either constant or has a maximum /minimum on a boundary f (x i ) may not be discrete but we can threshold it random walk interpretation electric networks interpretation Michal Valko Graphs in Machine Learning Lecture 4-32/43

33 SSL with Graphs: Harmonic Functions Random walk interpretation: 1) start from the vertex to label and follow 2) P(j i) = w ij k w P = D 1 W ik 3) finish when the labeled vertex is hit absorbing random walk f i = probability of reaching a positive labeled vertex Michal Valko Graphs in Machine Learning Lecture 4-33/43

34 SSL with Graphs: Harmonic Functions How to compute HS? Option A: iteration/propagation Step 1: Set f (x i ) = y i for i = 1,..., n l Step 2: Propagate iteratively (only for unlabeled) i j f (x i ) f (x j)w ij i j w i {n l + 1,..., n u + n l } ij Properties: this will converge to the harmonic solution we can set the initial values for unlabeled nodes arbitrarily an interesting option for large-scale data Michal Valko Graphs in Machine Learning Lecture 4-34/43

35 SSL with Graphs: Harmonic Functions How to compute HS? Option B: Closed form solution Define f = (f (x 1 ),..., f (x nl +n u )) = (f 1,..., f nl +n u ) Ω(f ) = n l +n u i,j=1 w ij (f (x i ) f (x j )) 2 = f T Lf L is a (n l + n u ) (n l + n u ) matrix: [ Lll L L = lu L u1 L uu ] How to compute this constrained minimization problem? Yes, Lagrangian multipliers are an option, but... Michal Valko Graphs in Machine Learning Lecture 4-35/43

36 SSL with Graphs: Harmonic Functions Let us compute harmonic solution using harmonic property! How did we formalize the harmonic property of a circuit? (Lf) u = 0 In matrix notation [ Lll L ul L lu L uu ] [ ] fl = f l is constrained to be y l and for f u f u [ 0l 0 u ]... from which we get L ul f l + L uu f u = 0 u f u = L 1 uu ( L ul f l ) = L 1 uu (W ul f l ). Michal Valko Graphs in Machine Learning Lecture 4-36/43

37 SSL with Graphs: Harmonic Functions Can we see that this calculate the probability of a random walk? f u = L 1 uu ( L ul f l ) = L 1 uu (W ul f l ) Note that P = D 1 W. Then equivalently f u = (I P uu ) 1 P ul f l. Split the equation into +ve & -ve part: f i = (I P uu ) 1 iu P ulf l = (I P uu ) 1 j:y j =1 iu P uj } {{ } p (+1) i = p (+1) i p ( 1) i j:y j = 1 (I P uu ) 1 iu P uj } {{ } p ( 1) i Michal Valko Graphs in Machine Learning Lecture 4-37/43

38 SSL with Graphs: Regularized Harmonic Functions f i = f i sgn(f }{{} i ) }{{} confidence label What if a nasty outlier sneaks in? The prediction for the outlier can be hyperconfident :( How to control the confidence of the inference? We add a sink to the graph. Allow the random walk to die! sink = artificial label node with value 0 We connect it to every other vertex. What will this do to our predictions? depends on the weigh on the edges Michal Valko Graphs in Machine Learning Lecture 4-38/43

39 SSL with Graphs: Regularized Harmonic Functions How do we compute this regularized random walk? How does γ g influence HS? f u = (L uu + γ g I) 1 (W ul f l ) What happens to sneaky outliers? Michal Valko Graphs in Machine Learning Lecture 4-39/43

40 SSL with Graphs: Soft Harmonic Functions Regularized HS objective with Q = L + γ g I: n l min w ij (f (x i ) y i ) 2 + λf T Qf f R n l +nu i=1 What if we do not really believe that f (x i ) = y i, i? f = min (f f R n y)t C(f y) + f T Qf { c l for labeled examples C is diagonal with C ii = c u otherwise. { true label for labeled examples y pseudo-targets with y i = 0 otherwise. Michal Valko Graphs in Machine Learning Lecture 4-40/43

41 SSL with Graphs: Soft Harmonic Functions f = min f R n (f y)t C(f y) + f T Qf Closed form soft harmonic solution: f = (C 1 Q + I) 1 y What are the differences between hard and soft? Not much different in practice. Provable generalization guarantees for soft. Michal Valko Graphs in Machine Learning Lecture 4-41/43

42 SSL with Graphs: Regularized Harmonic Functions Larger implications of random walks random walk relates to commute distance which should satisfy ( ) Vertices in the same cluster of the graph have a small commute distance, whereas two vertices in different clusters of the graph have a large commute distance. Do we have this property for HS? What if n? Luxburg/Radl/Hein: Getting lost in space: Large sample analysis of the commute distance people/luxburg/publications/luxburgradlhein2010_paperandsupplement.pdf Solutions? 1) γ g 2) amplified commute distance 3) L p 4) L... The goal of these solutions: make them remember! Michal Valko Graphs in Machine Learning Lecture 4-42/43

43 SequeL INRIA Lille MVA 2014/2015 Michal Valko sequel.lille.inria.fr

Graphs in Machine Learning

Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France TA: Pierre Perrault Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton October 30, 2017