Predicting Graph Labels using Perceptron. Shuang Song

Similar documents
CS 590D Lecture Notes

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

A graph based approach to semi-supervised learning

Non-linear Dimensionality Reduction

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

A Brief Survey on Semi-supervised Learning with Graph Regularization

18.312: Algebraic Combinatorics Lionel Levine. Lecture 19

Active and Semi-supervised Kernel Classification

CMPSCI 791BB: Advanced ML: Laplacian Learning

An Introduction to Spectral Graph Theory

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Fast and Optimal Algorithms for Weighted Graph Prediction

Machine Learning for Data Science (CS4786) Lecture 11

Linear & nonlinear classifiers

Statistical Machine Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

How to learn from very few examples?

Manifold Regularization

Support Vector Machines: Maximum Margin Classifiers

Logistic Regression Logistic

Locally-biased analytics

Kernels for Multi task Learning

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

What is semi-supervised learning?

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

6.867 Machine learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

6.036 midterm review. Wednesday, March 18, 15

Manifold Coarse Graining for Online Semi-supervised Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

L 2,1 Norm and its Applications

Linear & nonlinear classifiers

Multiclass Classification-1

COMS 4771 Introduction to Machine Learning. Nakul Verma

Applied Machine Learning Annalisa Marsico

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS 6375 Machine Learning

Support Vector Machine

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay

Efficient and Principled Online Classification Algorithms for Lifelon

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Machine learning for pervasive systems Classification in high-dimensional spaces

Semi-Supervised Learning

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

MIRA, SVM, k-nn. Lirong Xia

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Undirected Graphical Models

Semi Supervised Distance Metric Learning

Kernel Methods & Support Vector Machines

Machine Learning for Structured Prediction

Graph Partitioning Using Random Walks

Efficient Bandit Algorithms for Online Multiclass Prediction

Predicting a Switching Sequence of Graph Labelings

Perceptron Mistake Bounds

THE ADVICEPTRON: GIVING ADVICE TO THE PERCEPTRON

Support Vector and Kernel Methods

SVMs: nonlinearity through kernels

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Perceptron Revisited: Linear Separators. Support Vector Machines

MATH 829: Introduction to Data Mining and Analysis Clustering II

Kernels A Machine Learning Overview

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction

Graph-Laplacian PCA: Closed-form Solution and Robustness

Discriminative Models

Kernel Methods. Barnabás Póczos

Statistical Learning Notes III-Section 2.4.3

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Spectra of Adjacency and Laplacian Matrices

Lecture 12 : Graph Laplacians and Cheeger s Inequality

CS-E4830 Kernel Methods in Machine Learning

MULTIPLEKERNELLEARNING CSE902

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

A summary of Deep Learning without Poor Local Minima

Support Vector Machine (SVM) and Kernel Methods

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

CS 231A Section 1: Linear Algebra & Probability Review

The Perceptron Algorithm, Margins

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Discriminative Models

On the inverse matrix of the Laplacian and all ones matrix

Data dependent operators for the spatial-spectral fusion problem

Perceptron (Theory) + Linear Regression

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Support Vector Machines for Classification: A Statistical Portrait

Graphs, Geometry and Semi-supervised Learning

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Diffusion and random walks on graphs

1 Inner Product and Orthogonality

Kernel Methods and Support Vector Machines

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega

Oslo Class 2 Tikhonov regularization and kernels

Transcription:

Predicting Graph Labels using Perceptron Shuang Song shs037@eng.ucsd.edu

Online learning over graphs M. Herbster, M. Pontil, and L. Wainer, Proc. 22nd Int. Conf. Machine Learning (ICML'05), 2005 Prediction on a graph with a perceptron M. Herbster, and M. Pontil, NIPS 20, 2006

Outline 1. Problem Setting 2. Perceptron 3. Properties of Graphs 4. Bound # of mistakes

Problem Setting Known graph, unknown labels on vertices eg. Advertisement service on web page eg. Digit recognition task on USPS (graph is built using NN)

Problem Setting Given a graph G = (V, E) where V = {1,, n} for t = 1,, l nature selects v t V learner predicts y t {1, 1} nature reveals y t {1, 1} if y t y t, mistakes = mistakes + 1 minimize mistakes

1 3 5 6 2 4 8 7 Node Predict Nature Mistakes 1 1-1 1 2-1 -1 1 3-1 -1 1 4-1 -1 1 5-1 1 2 6 1 1 2 7 1 1 2 8 1 1 2

easy hard

Problem Setting Implicit assumption: adjacent nodes have similar labels The nature can be adversarial, and the learner can always make mistake; yet if the nature is regular and simple, then it is possible for the learner to make only a few mistake. Bound mistakes using complexity of nature s labelling Assume graph is connected, unweighted

What algorithm are we going to use?

Perceptron simply linear classification assume linearly separable with margin 1

Perceptron: algorithm data: x 1, y 1,, (x l, y l ) H 1, 1 l Initial w 1 = 0 H For t = 1,, l receive x t H predict y t = sign( w t, x t ) receive y t {1, 1} if y t y t mistake = mistake + 1 w t+1 = w t + y t x t else w t+1 = w t

Perceptron: mistake bound l Theorem: given a sequence x t, y t t=1 H { 1,1}, and M as the set of trails in which the perceptron predicted incorrectly, then for all w M w 2 max x t 2 t M 1,1 n st. w t = y t, t = 1,, l norm is taken w.r.t. the inner product of H

Perceptron: how to use For us, what is the inner product? what is x t? We would want a x t that captures the structure of the whole graph.

Perceptron: algorithm data: x 1, y 1,, (x l, y l ) H 1, 1 l Initial w 1 = 0 H For t = 1,, l receive x t H predict y t = sign( w t, x t ) receive y t {1, 1} if y t y t mistake = mistake + 1 w t+1 = w t + y t x t else w t+1 = w t For us, what is x t? What is the inner product? We would want a x t that captures the structure of the whole graph.

Properties of Graphs Graph Laplacian L = D A, where A is adjacency matrix and D = diag(d 1,, d n ) Inner product: f, g = f T Lg, f, g R n Semi-norm: f 2 = f, f = f i f j 2 i,j E

Properties of Graphs Norm measures smoothness or complexity of a labelling g: 1 1-1 -1-1 1-1 1 1 1-1 -1 1-1 1-1 1-1 1-1 g 2 = 3 4 g 2 = 12 4

Properties of Graphs g 2 = 1 4 g 2 = 9 4

Properties of Graphs Eigenvalue λ i and eigenvector u i of L: Connected 0 = λ 1 < λ 2 λ 3 λ n with u 1 as constant vector H = span u 2,, u n = {g: g T u 1 = 0} = {g: n i=1 g i = 0} Semi-norm becomes norm

Properties of Graphs Pseudoinverse K = L + = n i=2 λ i 1 u i u i T It is the reproducing kernel of H: g H, K i = K(:, i) K i, g = g i K t 2 = K tt measures the remoteness of vertex t and it decreases with connectivity

This will be our feature of a vertex, i.e., x t = K t

Properties of Graphs For p V, define Distance (of two vertices): d p, q = min P(p, q) where P is a path from p to q Eccentricity (of a vertex): ρ p = max q V d(p, q) Diameter (of a graph): D G = max p ρ p

Bound M w 2 max t M x t 2 We want to know what w 2 and max t M x t 2 is with the properties of graph

Bound # of mistakes: w 2 Firstly we look at w: w 1, +1 n w 2 = w i w j 2 i,j E smoothness or complexity w 2 = 4 (# edges spanning different labels)

Bound # of mistakes: x t 2 Then we look at x t = K t : K t 2 = K tt. So we want to bound K tt Theorem: For a connected graph G with Laplacian kernel K, K tt min 1 λ 2, ρ t, t V 2nd smallest eigenvalue eccentricity: ρ t = max min q V P(p, q)

Bound # of mistakes : x t 2 Proof: K tt 1 λ 2 g T Lg λ 2 g T g, g H Taking g = K t, K tt λ 2 g p 2 λ 2 K tt 2 K tt ρ t If g t > 0, then s, s.t. g s < 0 path P from t to s, s.t. E P ρ t

Bound # of mistakes : x t 2 g i g j g t g s > g t i,j E(P) n n By n i=1 a 2 i i=1 a i for non negative {a i }, we have i,j E(P) 2 i,j E P g i g j g i g j E(P) i,j E P ρ t 2 g i g j 2 g t 2 ρ t 2

Bound # of mistakes : x t 2 Taking g = K t, we have K 2 2 t = i,j E(G) K ti K tj and K 2 t = K t, K t = K tt 2 K 2 K tt = K ti K tj tt i,j E(G) ρ t K tt 1 λ 2 and K tt ρ t

Bound # of mistakes M w 2 max t M x t 2 4 # edges spanning different labels min 1 λ 2, ρ t

Further improvement Noisy samples: M 2 M M w + w 2 X 2 + 2 M M w w 2 X 2 + w 4 X 4 Bound K pp using resistance K pp r(p, q) max p,q V 2 4