Predicting Graph Labels using Perceptron Shuang Song shs037@eng.ucsd.edu
Online learning over graphs M. Herbster, M. Pontil, and L. Wainer, Proc. 22nd Int. Conf. Machine Learning (ICML'05), 2005 Prediction on a graph with a perceptron M. Herbster, and M. Pontil, NIPS 20, 2006
Outline 1. Problem Setting 2. Perceptron 3. Properties of Graphs 4. Bound # of mistakes
Problem Setting Known graph, unknown labels on vertices eg. Advertisement service on web page eg. Digit recognition task on USPS (graph is built using NN)
Problem Setting Given a graph G = (V, E) where V = {1,, n} for t = 1,, l nature selects v t V learner predicts y t {1, 1} nature reveals y t {1, 1} if y t y t, mistakes = mistakes + 1 minimize mistakes
1 3 5 6 2 4 8 7 Node Predict Nature Mistakes 1 1-1 1 2-1 -1 1 3-1 -1 1 4-1 -1 1 5-1 1 2 6 1 1 2 7 1 1 2 8 1 1 2
easy hard
Problem Setting Implicit assumption: adjacent nodes have similar labels The nature can be adversarial, and the learner can always make mistake; yet if the nature is regular and simple, then it is possible for the learner to make only a few mistake. Bound mistakes using complexity of nature s labelling Assume graph is connected, unweighted
What algorithm are we going to use?
Perceptron simply linear classification assume linearly separable with margin 1
Perceptron: algorithm data: x 1, y 1,, (x l, y l ) H 1, 1 l Initial w 1 = 0 H For t = 1,, l receive x t H predict y t = sign( w t, x t ) receive y t {1, 1} if y t y t mistake = mistake + 1 w t+1 = w t + y t x t else w t+1 = w t
Perceptron: mistake bound l Theorem: given a sequence x t, y t t=1 H { 1,1}, and M as the set of trails in which the perceptron predicted incorrectly, then for all w M w 2 max x t 2 t M 1,1 n st. w t = y t, t = 1,, l norm is taken w.r.t. the inner product of H
Perceptron: how to use For us, what is the inner product? what is x t? We would want a x t that captures the structure of the whole graph.
Perceptron: algorithm data: x 1, y 1,, (x l, y l ) H 1, 1 l Initial w 1 = 0 H For t = 1,, l receive x t H predict y t = sign( w t, x t ) receive y t {1, 1} if y t y t mistake = mistake + 1 w t+1 = w t + y t x t else w t+1 = w t For us, what is x t? What is the inner product? We would want a x t that captures the structure of the whole graph.
Properties of Graphs Graph Laplacian L = D A, where A is adjacency matrix and D = diag(d 1,, d n ) Inner product: f, g = f T Lg, f, g R n Semi-norm: f 2 = f, f = f i f j 2 i,j E
Properties of Graphs Norm measures smoothness or complexity of a labelling g: 1 1-1 -1-1 1-1 1 1 1-1 -1 1-1 1-1 1-1 1-1 g 2 = 3 4 g 2 = 12 4
Properties of Graphs g 2 = 1 4 g 2 = 9 4
Properties of Graphs Eigenvalue λ i and eigenvector u i of L: Connected 0 = λ 1 < λ 2 λ 3 λ n with u 1 as constant vector H = span u 2,, u n = {g: g T u 1 = 0} = {g: n i=1 g i = 0} Semi-norm becomes norm
Properties of Graphs Pseudoinverse K = L + = n i=2 λ i 1 u i u i T It is the reproducing kernel of H: g H, K i = K(:, i) K i, g = g i K t 2 = K tt measures the remoteness of vertex t and it decreases with connectivity
This will be our feature of a vertex, i.e., x t = K t
Properties of Graphs For p V, define Distance (of two vertices): d p, q = min P(p, q) where P is a path from p to q Eccentricity (of a vertex): ρ p = max q V d(p, q) Diameter (of a graph): D G = max p ρ p
Bound M w 2 max t M x t 2 We want to know what w 2 and max t M x t 2 is with the properties of graph
Bound # of mistakes: w 2 Firstly we look at w: w 1, +1 n w 2 = w i w j 2 i,j E smoothness or complexity w 2 = 4 (# edges spanning different labels)
Bound # of mistakes: x t 2 Then we look at x t = K t : K t 2 = K tt. So we want to bound K tt Theorem: For a connected graph G with Laplacian kernel K, K tt min 1 λ 2, ρ t, t V 2nd smallest eigenvalue eccentricity: ρ t = max min q V P(p, q)
Bound # of mistakes : x t 2 Proof: K tt 1 λ 2 g T Lg λ 2 g T g, g H Taking g = K t, K tt λ 2 g p 2 λ 2 K tt 2 K tt ρ t If g t > 0, then s, s.t. g s < 0 path P from t to s, s.t. E P ρ t
Bound # of mistakes : x t 2 g i g j g t g s > g t i,j E(P) n n By n i=1 a 2 i i=1 a i for non negative {a i }, we have i,j E(P) 2 i,j E P g i g j g i g j E(P) i,j E P ρ t 2 g i g j 2 g t 2 ρ t 2
Bound # of mistakes : x t 2 Taking g = K t, we have K 2 2 t = i,j E(G) K ti K tj and K 2 t = K t, K t = K tt 2 K 2 K tt = K ti K tj tt i,j E(G) ρ t K tt 1 λ 2 and K tt ρ t
Bound # of mistakes M w 2 max t M x t 2 4 # edges spanning different labels min 1 λ 2, ρ t
Further improvement Noisy samples: M 2 M M w + w 2 X 2 + 2 M M w w 2 X 2 + w 4 X 4 Bound K pp using resistance K pp r(p, q) max p,q V 2 4