Random Walks, Random Fields, and Graph Kernels

Similar documents
Data and Structure in Machine Learning

Active and Semi-supervised Kernel Classification

Semi-Supervised Learning with Graphs

Semi-Supervised Learning

Graphs, Geometry and Semi-supervised Learning

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

1 Graph Kernels by Spectral Transforms

Kernels A Machine Learning Overview

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Analysis of Spectral Kernel Design based Semi-supervised Learning

Graphs in Machine Learning

How to learn from very few examples?

Machine Learning, Fall 2011: Homework 5

CIS 520: Machine Learning Oct 09, Kernel Methods

Hilbert Space Methods in Learning

Lecture 4 February 2

Support Vector Machines

Definition: If y = f(x), then. f(x + x) f(x) y = f (x) = lim. Rules and formulas: 1. If f(x) = C (a constant function), then f (x) = 0.

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Reproducing Kernel Hilbert Spaces

Kernel Methods. Outline

Lecture No 1 Introduction to Diffusion equations The heat equat

Discrete vs. Continuous: Two Sides of Machine Learning

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

10-701/ Recitation : Kernels

How Good is a Kernel When Used as a Similarity Measure?

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions

Group theoretical methods in Machine Learning

Reproducing Kernel Hilbert Spaces

Max Margin-Classifier

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

1 Distributions (due January 22, 2009)

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Statistical learning on graphs

Graph-Based Semi-Supervised Learning

Reproducing Kernel Hilbert Spaces

Support Vector Machines.

Support Vector Machines for Classification: A Statistical Portrait

Binet-Cauchy Kernerls on Dynamical Systems and its Application to the Analysis of Dynamic Scenes

Partially labeled classification with Markov random walks

Manifold Regularization

Kernel Methods. Barnabás Póczos

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Semi-Supervised Learning on Riemannian Manifolds

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

IFT CONTINUOUS LAPLACIAN Mikhail Bessmeltsev

The Kernel Trick. Carlos C. Rodríguez October 25, Why don t we do it in higher dimensions?

of semi-supervised Gaussian process classifiers.

Designing Kernel Functions Using the Karhunen-Loève Expansion

Learning theory. Ensemble methods. Boosting. Boosting: history

Support Vector Machines

Kernel Methods for Computational Biology and Chemistry

Geometry on Probability Spaces

Clustering and efficient use of unlabeled examples

22 : Hilbert Space Embeddings of Distributions

Convex Optimization in Classification Problems

XVI International Congress on Mathematical Physics

Support Vector Machines on General Confidence Functions

Support Vector Machine (SVM) and Kernel Methods

Logarithmic Harnack inequalities

Support Vector Machines

Regularization on Discrete Spaces

Kernel Methods and Support Vector Machines

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Support Vector Machine (SVM) and Kernel Methods

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Kernel methods CSE 250B

Functions of Several Variables

Linear algebra and applications to graphs Part 1

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Kernel methods for comparing distributions, measuring dependence

Graphs in Machine Learning

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Introduction to Support Vector Machines

Statistical learning theory, Support vector machines, and Bioinformatics

Lecture: Some Practical Considerations (3 of 4)

Diffusion Kernels on Statistical Manifolds

Eigenvalues of graphs

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Basis Expansion and Nonlinear SVM. Kai Yu

A graph based approach to semi-supervised learning

9.2 Support Vector Machines 159

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods & Support Vector Machines

Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Advanced Introduction to Machine Learning

Justin Solomon MIT, Spring 2017

ML (cont.): SUPPORT VECTOR MACHINES

On the definition and properties of p-harmonious functions

Transcription:

Random Walks, Random Fields, and Graph Kernels John Lafferty School of Computer Science Carnegie Mellon University Based on work with Avrim Blum, Zoubin Ghahramani, Risi Kondor Mugizi Rwebangira, Jerry Zhu

Outline Graph Kernels Random Fields Random Walks Continuous Fields

Using a Kernel ˆf(x) = N i= α i y i x, x i ˆf(x) = N i= α i y i K(x, x i ) 2

K(x, x ) positive semidefinite: The Kernel Trick X X f(x)f(x )K(x, x ) dx dx 0 Taking feature space of functions F = {Φ(x) = K(, x), x X }, has reproducing property g(x) = K(, x), g. Φ(x), Φ(x ) = K(, x), K(, x ) = K(x, x ) 3

Structured Data What if data lies on a graph or other data structure? S Google Cornell CMU N time VP foobar.com NSF L like V flies 4

Combinatorial Laplacian Think of edge e as tangent vector at e. For f : V R, df : E R is the -form df(e) = f(e + ) f(e ) Then = d d (as matrix) is discrete analogue of div 5

It is an averaging operator Combinatorial Laplacian f(x) = y x w xy (f(x) f(y)) = d(x) f(x) x y w xy f(y) We say f is harmonic if f = 0. Since f, g = df, dg, is self-adjoint and positive. 6

Diffusion Kernels on Graphs (Kondor and L., 2002) If is the graph Laplacian, in analogy with the continuous setting, t K t = K t is the heat equation on a graph. Solution K t = e t is the diffusion kernel. 7

Physical Interpretation ( t) K = 0, initial condition δx (y): e t f(x) = M K t(x, y) f(y) dy For a kernel-based classifier ŷ(x) = i α i y i K t (x i, x) decision function is given by heat flow with initial condition α i x = x i positive labeled data f(x) = α i x = x i negative labeled data 0 otherwise 8

RKHS Representation General spectral representation of a kernel as K(x, y) = n i= λ iφ i (x)φ i (y) leads to reproducing kernel Hilbert space i a i φ i, i b i φ i HK = i a i b i λ i For the diffusion kernel, RKHS inner product is f, g HK = i e tµ i fi ĝ i Interpretation: Functions with small norm don t oscillate rapidly on the graph. 9

Building Up Kernels If K (i) t are kernels on X i K t = n i= K(i) t is a kernel on X... X n. For the hypercube: Hamming distance {}}{ K t (x, x ) (tanh t) d(x, x ) Similar kernels apply to standard categorical data. Other graphs with explicit diffusion kernels: Infinite trees (Chung & Yau, 999) Rooted trees Cycles Strings with wildcards 0

Results on UCI Datasets Hamming Diffusion Kernel Improv. Data Set error SV error SV β err SV Breast Cancer 7.64% 387.0 3.64% 62.9 0.30 62% 83% Hepatitis 7.98% 750.0 7.66% 34.9.50 2% 58% Income 9.9% 49.5 8.50% 033.4 0.40 4% 8% Mushroom 3.36% 96.3 0.75% 28.2 0.0 77% 70% Votes 4.69% 286.0 3.9% 252.9 2.00 7% 2% Recent application to protein classification by Vert and Kanehisa (NIPS 2002).

Random Fields View of Combining Labeled/Unlabeled Data 2

Random Fields View View each vertex x as having label f(x) {+, }. Ising model on graph/lattice, spins f : V {+, } Energy H(f) = 2 w xy (f(x) f(y)) 2 x y x y w xy f(x) f(y) Gibbs distribution P (f) = Z(β) e βh(f) β = T Partition function Z(β) = f e βh(f) 3

Graph Mincuts Graph mincuts can be very unbalanced 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0. 0. 0. 0 0 0 Graph mincuts don t exploit probabilistic properties of random fields Idea: Replace by averages under Ising model E β [f(x)] = f(x) e βh(f) Z(β) f S =f B 4

Pinned Ising Model β=3 β=2 0.5 0.5 0 0 5 0 5 β=.5 0 0 5 0 5 β= 0.5 0.5 0 0 5 0 5 β=0.75 0 0 5 0 5 β=0. 0.5 0.5 0 0 5 0 5 0 0 5 0 5 5

Not (Provably) Efficient to Approximate Unfortunately, analogue of rapid mixing result of Jerrum & Sinclair for ferromagnetic Ising model not known for mixed boundary conditions Question: Can we compute averages using graph algorithms in the zero temperature limit? 6

Idea: Relax to Statistical Field Theory Euclidean field theory on graph/lattice, fields f : V R Energy H(f) = 2 Gibbs distribution P (f) = Partition function Z(β) = x y w xy (f(x) f(y)) 2 Z(β) e βh(f) e βh(f) df f β = T Physical Interpretation: analytic continuation to imaginary time, t it Poincaré group Euclidean group. 7

View from Statistical Field Theory (cont.) Most probable field is harmonic Weighted graph G = (V, E), edge weights w xy, combinatorial Laplacian. Subgraph S with boundary S. Dirichlet Problem: unique solution f = 0 on S f S = f B 8

Random Walk Solution Perform random walk on unlabeled data, stop when hit a labeled point. What is the probability of hitting a positive labeled point before a negative labeled point? Precisely the same as minimum energy (continuous) random field. Label Propagation. Related work by Szummer and Jaakkola (NIPS 200) 9

Unconstrained Constrained 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0 5 0 5 20 25 30 35 40 45 50 0 5 0 5 20 25 30 35 40 45 50 0.5 0.5 0 0 0.5 0.5 30 25 20 5 0 5 5 0 5 20 25 30 30 25 20 5 0 5 5 0 5 20 25 30 20

View from Statistical Field Theory In one-dimensional case: low temperature limit of average Ising model is the same is minimum energy Euclidean field. (Landau) Intuition: linear. average over graph s-t mincuts; harmonic solution is Not true in general... 2

Computing the Partition Function Let λ i be spectrum of, Dirichlet boundary conditions: Z(β) = e βh(f ) (βπ) n/2 det det = n i= λ i By generalization of matrix-tree (Chung & Langlands, 96) det = # rooted spanning forests i deg(i) 22

Connection with Diffusion Kernels Again take, combinatorial Laplacian with Dirichlet boundary conditions (zero on labeled data) For K t = e t diffusion kernel let K = 0 K t dt Solution to the Dirichlet problem (label prop, minimum energy continuous field): f (x) = z fringe K(x, z) f D (z) 23

Connection with Diffusion Kernels (cont.) Want to solve Laplace s equation: f = g. terms of. Solution given in Quick way to see connection using spectral representation: x,x = i µ i φ i (x) φ i (x ) K t (x, x ) = i e tµ i φ i (x) φ i (x ) x,x = i µ i φ i (x) φ i (x ) = 0 K t (x, x ) dt Used by Chung and Yau (2000). 24

Bounds on Covering Numbers and Generalization Error, Continuous Case Eigenvalue bounds from differential geometry (Li and Yau): )2 d µj c 2 ( j + )2 d c ( j V V Give bounds on SVM hypothesis class covering numbers log N (ɛ, F R (x)) = O (( V t d 2 ) log d+2 2 ( )) ɛ 25

Bounds on Generalization Error Better bounds on generalization error are now available based on Rademacher averages involving trace of the kernel (Bartlett, Bousquet, & Mendelson, preprint). Question: Can diffusion kernel connection be exploited to get transductive generalization error bounds for random walks approach? 26

Summary Random fields with discrete class labels intractable, unstable Continuous fields tractable, more desirable behavior for segentation and labeling Intimate connections with random walks, electric networks, graph flows, and diffusion kernels Advantages/disadvantages? 27