Discrete vs. Continuous: Two Sides of Machine Learning

Similar documents
How to learn from very few examples?

Regularization on Discrete Spaces

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

What is semi-supervised learning?

A Regularization Framework for Learning from Graph Data

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Graphs, Geometry and Semi-supervised Learning

Manifold Regularization

Analysis of Spectral Kernel Design based Semi-supervised Learning

Higher Order Learning with Graphs

1 Graph Kernels by Spectral Transforms

Semi-Supervised Learning

Active and Semi-supervised Kernel Classification

Cluster Kernels for Semi-Supervised Learning

Graph-Based Semi-Supervised Learning

Similarity and kernels in machine learning

A graph based approach to semi-supervised learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

An Iterated Graph Laplacian Approach for Ranking on Manifolds

Graphs in Machine Learning

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Global vs. Multiscale Approaches

Does Unlabeled Data Help?

Random Walks, Random Fields, and Graph Kernels

Manifold Denoising. Matthias Hein Markus Maier Max Planck Institute for Biological Cybernetics Tübingen, Germany

Discriminative Direction for Kernel Classifiers

Semi-Supervised Learning with Graphs

Semi-Supervised Classification with Universum

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning through Principal Directions Estimation

CS 6375 Machine Learning

Semi-Supervised Learning by Multi-Manifold Separation

Kernel Learning via Random Fourier Representations

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Manifold Regularization

Statistical learning on graphs

Back to the future: Radial Basis Function networks revisited

Robust Support Vector Machines for Probability Distributions

Efficient Iterative Semi-Supervised Classification on Manifold

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Advanced Introduction to Machine Learning

Semi-Supervised Learning on Riemannian Manifolds

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Introduction to Support Vector Machines

Predicting Graph Labels using Perceptron. Shuang Song

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Support Vector Machines for Classification: A Statistical Portrait

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Limits of Spectral Clustering

Semi-Supervised Learning on Riemannian Manifolds

CS-E4830 Kernel Methods in Machine Learning

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

A Deep Interpretation of Classifier Chains

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Support Vector Machines (SVMs).

Statistical Machine Learning

Semi-supervised Learning

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Graph Metrics and Dimension Reduction

Manifold Coarse Graining for Online Semi-supervised Learning

Learning gradients: prescriptive models

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Learning with Consistency between Inductive Functions and Kernels

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Online Manifold Regularization: A New Learning Setting and Empirical Study

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Kernel Methods. Barnabás Póczos

Foundations of Machine Learning

Nonlinear Dimensionality Reduction. Jose A. Costa

Kernels for Multi task Learning

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

EECS 275 Matrix Computation

Kernel methods for comparing distributions, measuring dependence

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Graphs in Machine Learning

CMPSCI 791BB: Advanced ML: Laplacian Learning

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Relevance Aggregation Projections for Image Retrieval

Models, Data, Learning Problems

Semi-supervised Eigenvectors for Locally-biased Learning

Introduction to Machine Learning

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Support Vector and Kernel Methods

Unsupervised dimensionality reduction

Nonlinear Dimensionality Reduction

A New Space for Comparing Graphs

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels

From Last Meeting. Studied Fisher Linear Discrimination. - Mathematics. - Point Cloud view. - Likelihood view. - Toy examples

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces

CIS 520: Machine Learning Oct 09, Kernel Methods

On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Transcription:

Discrete vs. Continuous: Two Sides of Machine Learning Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Oct. 18, 2004

Our contributions Developed the discrete calculus and geometry on discrete objects Constructed the discrete regularization framework for learning on discrete spaces Proposed a family of transductive algorithms derived from the discrete framework Max Planck Institute for Biological Cybernetics 1

Outline Introduction Discrete analysis and regularization Related work Discussion and future work Max Planck Institute for Biological Cybernetics 2

Introduction Definition and motivations of transductive inference A principled approach to transductive inference Max Planck Institute for Biological Cybernetics 3

Problem setting Consider a finite input space X = {x 1,..., x l, x l+1,..., x l+k } and output space Y = { 1, 1}. Observes the labels of the first l points: (x 1, y 1 ),..., (x l, y l ). Predicts the labels of the given points: x l+1,..., x l+k. We may use any supervised learning classifiers, for instance, SVMs (Boser et al., 1992; Cortes and Vapnik, 1995), to estimate the labels of the given points. However, Max Planck Institute for Biological Cybernetics 4

Motivation in theory: Vapnik s philosophy and transduction (Vapnik, 1998) Do not solve a more general problem as an intermediate step. It is quite possible that you have enough information to solve a particular problem well, but have not enough information to solve a general problem. Transductive inference: Do not estimate a function defined on the whole space (inductive). Directly estimate the function values on the given points of interest (transductive)! Max Planck Institute for Biological Cybernetics 5

Vapnik s picture: Transduction vs. Induction Max Planck Institute for Biological Cybernetics 6

Further motivations: Learning with very few training examples Engineering Improving classification accuracy by using unlabeled data. Labeling needs expensive human labor, whereas unlabeled data is far easier to obtain. Cognitive science Understanding human inference. Humans can learn from very few labeled examples. Max Planck Institute for Biological Cybernetics 7

Introduction Definition and motivations of transductive inference A principled approach to transductive inference Max Planck Institute for Biological Cybernetics 8

A basic idea in supervised learning: Regularization A common way to achieve smoothness (cf. Tikhonov and Arsenin, 1977): min{remp[f] + λω[f]} f Remp[f] is the empirical risk; Ω[f] is the regularization (stabilization) term; λ is the positive regularization parameter specifying the trade-off. Max Planck Institute for Biological Cybernetics 9

A basic idea in supervised learning: Regularization (cont. ) Typically, the regularization term takes the form: Ω[f] = Df 2, where D is a differential operator, such as D = (gradient). Kernel methods: Ω[f] = f 2 H, where H denotes a Reproducing Kernel Hilbert Space (RKHS). They are equivalent (cf. Schölkopf and Smola, 2002). Max Planck Institute for Biological Cybernetics 10

A principled approach to transductive inference Develop the discrete analysis and geometry on finite discrete spaces consisting of discrete objects to be classified Discretize the classical regularization framework used in the inductive inference. Then the transductive inference approaches are derived from the discrete regularization. Transduction vs. induction: discrete vs. continuous regularizer Max Planck Institute for Biological Cybernetics 11

Outline Introduction Discrete analysis and regularization Related works Discussion and future works Max Planck Institute for Biological Cybernetics 12

Discrete analysis and regularization A basic differential calculus on graphs Discrete regularization and operators 2-smoothness (p = 2, discrete heat flow, linear) 1-smoothness (p = 1, discrete curvature flow, non-linear) -smoothness (p =, discrete large margin) Max Planck Institute for Biological Cybernetics 13

A prior assumption: pairwise relationships among points If there is no relation among the discrete points, then we cannot make any prediction which is statistically better than a random guess. Assume the pairwise relationships among points: w : X X R +. The set of points may be thought of as a weighted graph, where the weights of edges encode the pairwise relationships. Max Planck Institute for Biological Cybernetics 14

Some basic notions in graph theory A graph Γ = (V, E) consists of a set V of vertices and a set of pairs of vertices E V V called edges. A graph is undirected if for each edge (u, v) E we also have (v, u) E. A graph is weighted if it is associated with a function w : E R + satisfying w(u, v) = w(v, u). Max Planck Institute for Biological Cybernetics 15

Some basic notions in graph theory (cont.) Max Planck Institute for Biological Cybernetics 16

Some basic notions in graph theory (cont.) The degree function g : V R + is defined to be g(v) := u v w(u, v), where u v denote the set of vertices u connected to v via the edges (u, v). The degree can be regarded as a measure. Max Planck Institute for Biological Cybernetics 17

Some basic notions in graph theory (cont.) Max Planck Institute for Biological Cybernetics 18

The space of functions defined on graphs Let H(V ) denote the Hilbert space of real-valued functions endowed with the usual inner product ϕ, φ := v ϕ(v)φ(v), where ϕ and φ denote any two functions in H(V ). Similarly define H(E). Note that function ψ H(E) need not be symmetric, i.e., we do not require ψ(u, v) = ψ(v, u). Max Planck Institute for Biological Cybernetics 19

Gradient (or boundary) operator We define the graph gradient operator d : H(V ) H(E) to be ( Zhou and Schölkopf, 2004) (dϕ)(u, v) := for all (u, v) in E. w(u, v) g(u) ϕ(u) w(u, v) g(v) ϕ(v), Remark In the lattice case, the gradient degrades into (dϕ)(u, v) = ϕ(u) ϕ(v), which is the standard difference definition in numerical analysis. Max Planck Institute for Biological Cybernetics 20

Gradient (or boundary) operator Max Planck Institute for Biological Cybernetics 21

Divergence (or co-boundary) operator We define the adjoint d Schölkopf, 2004) : H(E) H(V ) of d by (Zhou and dϕ, ψ = ϕ, d ψ, for all ϕ H(V ), ψ H(E). We call d the graph divergence operator. Note that the inner products are respectively in the space H(E) and H(V ). Max Planck Institute for Biological Cybernetics 22

Divergence (or co-boundary) operator (cont.) We can show that d is given by (d ψ)(v) = u v ( ) w(u, v) ψ(v, u) ψ(u, v). g(v) Remark The divergence of a vector field F in Euclidean space is defined by div(f ) = F x x + F y y + F z z. The physical significance is the net flux flowing out of a point. Max Planck Institute for Biological Cybernetics 23

Edge derivative The edge derivative e : H(V ) R v along edge e = (v, u) at vertex v is defined by ϕ e := (dϕ)(v, u). v Max Planck Institute for Biological Cybernetics 24

Local smoothness measure Define the local variation of ϕ at v to be v ϕ := [ e v ( ) ϕ 2 ] 1/2, e v where e v denotes the set of edges incident on v. Max Planck Institute for Biological Cybernetics 25

Global smoothness measure Let S denote a functional on H(V ), for any p [1, ), which is defined to be S p (ϕ) := 1 v ϕ p, p and, especially, S (ϕ) := max v ϕ. v The functional S p (ϕ) can be thought of as the measure of the smoothness of ϕ. v Max Planck Institute for Biological Cybernetics 26

Discrete analysis and regularization A basic differential calculus on graphs Discrete regularization and operators 2-smoothness (p = 2, heat flow, linear); 1-smoothness (p = 1, curvature flow, non-linear); -smoothness (p =, large margin) Directed graphs Max Planck Institute for Biological Cybernetics 27

Discrete regularization on graphs Given a function y in H(V ), the goal is to search for another function f in H(V ), which is not only smooth enough on the graph but also close enough to the given function y. This idea is formalized via the following optimization problem: argmin{s p (f) + µ f H(V ) 2 f y 2 }. For classification problems, define y(v) = 1 or 1 if v is labeled as positive or negative and 0 otherwise. Each vertex v is finally classified as sgn f(v). Max Planck Institute for Biological Cybernetics 28

Discrete analysis and regularization A basic differential calculus on graphs Discrete regularization and operators 2-smoothness (p = 2, heat flow, linear); 1-smoothness (p = 1, curvature flow, non-linear); -smoothness (p =, large margin) Directed graphs Max Planck Institute for Biological Cybernetics 29

Laplacian operator By analogy with the Laplace-Beltrami operator on forms on Riemannian manifolds, we define the graph Laplacian : H(V ) H(V ) by (Zhou and Schölkopf, 2004) := 1 2 d d. Remark The Laplace in Euclidean space is defined by f = 2 f x x 2 + 2 f y y 2 + 2 f z z 2. Max Planck Institute for Biological Cybernetics 30

Laplacian operator (cont.) Laplacian is a self-adjoint and linear operator ϕ, φ = 1 2 d dϕ, φ = 1 2 dϕ, dφ = 1 2 ϕ, d dφ = ϕ, φ. Laplacian is positive semi-definite ϕ, ϕ = 1 2 d dϕ, ϕ = 1 2 dϕ, dϕ = S 2(ϕ) 0. Laplacian and smoothness ϕ = S 2(ϕ) ϕ. Max Planck Institute for Biological Cybernetics 31

Laplacian operator (cont.) An equivalent definition of the graph Laplacian: ( ϕ)(v) := 1 2 e v ( 1 ϕ g g e e ). v This is basically the discrete analogue of the Laplace-Beltrami operator based on the gradient. Max Planck Institute for Biological Cybernetics 32

Laplacian operator (cont.) Computation of the graph Laplacian Substituting the definitions of gradient and divergence operators into that of Laplacian, we have ( ϕ)(v) = ϕ(v) u v w(u, v) g(u)g(v) ϕ(u). Remark In spectral graph theory (Chung, 1997), the graph Laplacian is defined as the matrix D 1/2 (D W )D 1/2. Max Planck Institute for Biological Cybernetics 33

Solving the optimization problem (p = 2) The so- Theorem. [Zhou and Schölkopf, 2004; Zhou et al., 2003] lution f of the optimization problem argmin f H(V ) { 1 2 v v f 2 + µ 2 f y 2 }. satisfies f + µ(f y) = 0. Corollary. f = µ(µi + ) 1 y. Max Planck Institute for Biological Cybernetics 34

An equivalent iterative algorithm Isotropic information diffusion (or heat flow) (Zhou et al., 2003) Define p(u, v) = w(u, v) g(u)g(v). Then f (t+1) (v) = u v αp(u, v)f t (u) + (1 α)y(v), where α is a parameter in (0, 1). Remark See also (Eells and Sampson, 1964) for the heat diffusion on Riemannian manifolds. Max Planck Institute for Biological Cybernetics 35

A toy classification problem 1.5 1 (a) Toy Data (Two Moons) unlabeled point labeled point 1 labeled point +1 1.5 1 (b) SVM (RBF Kernel) 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 (c) k NN 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 (d) Ideal Classification 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Max Planck Institute for Biological Cybernetics 36

A toy classification problem (cont.) 1.5 (a) t = 10 1.5 (b) t = 50 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 (c) t = 100 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 (d) t = 400 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 [Note: A fully connected graph: w(u, v) = exp( λ u v ).] Max Planck Institute for Biological Cybernetics 37

A toy classification problem (cont.) [Note: The function is becoming flatter and flatter.] Max Planck Institute for Biological Cybernetics 38

Handwritten digits recognition 0.45 0.4 k NN (k = 1) SVM (RBF kernel) our method 0.35 0.3 test error 0.25 0.2 0.15 0.1 0.05 10 20 30 40 50 60 70 80 90 100 # labeled points Digit recognition with USPS handwritten 16x16 digits dataset for a total of 9298. The left panel shows test errors for different algorithms with the number of labeled points increasing from 10 to 100. Max Planck Institute for Biological Cybernetics 39

Example-based ranking Given an input space X = {x 0, x 1,..., x n } R m, the first point is the query. The goal is to rank the remaining points with respect to their relevances or similarities to the query. [See also (e.g., Crammer and Singer, 2001; Freund et al., 2004) for other ranking work in the machine learning community. ] Define y(v) = 1 if vertex v is a query and 0 otherwise. Then rank each vertex v according to the corresponding function value f(v) (largest ranked first). Max Planck Institute for Biological Cybernetics 40

A toy ranking problem (a) Connected graph [Note: The shortest path based ranking does not work!] Max Planck Institute for Biological Cybernetics 41

Image ranking Ranking digits in USPS. The top-left digit in each panel is the query. The left panel shows the top 99 by our method; and the right panel shows the top 99 by the Euclidean distance based ranking. Note that in addition to 3s there are many more 2s with knots in the right panel. Max Planck Institute for Biological Cybernetics 42

MoonRanker: A recommendation system http://www.moonranker.com/ Max Planck Institute for Biological Cybernetics 43

Protein ranking (With J. Weston, A. Elisseeff, C. Leslie and W.S. Noble) Protein ranking: from local to global structure in the protein similarity network. PNAS 101(17) (2004). Max Planck Institute for Biological Cybernetics 44

Discrete analysis and regularization A basic differential calculus on graphs Discrete regularization and operators 2-smoothness (p = 2, heat flow, linear); 1-smoothness (p = 1, curvature flow, non-linear); -smoothness (p =, large margin) Directed graphs Max Planck Institute for Biological Cybernetics 45

Curvature operator By analogy with the curvature of a curve which is measured by the change in the unit normal, we define the graph curvature κ : H(V ) H(V ) by (Zhou and Schölkopf, 2004) ( ) dϕ κϕ := d. ϕ Max Planck Institute for Biological Cybernetics 46

Curvature operator (cont.) Computation of the graph curvature (κϕ)(v) = u v ( w(u, v) 1 g(v) v ϕ + 1 )( ϕ(v) ϕ(u) ) u ϕ g(v) g(u) Unlike the graph Laplacian, the graph curvature is a non-linear operator. Max Planck Institute for Biological Cybernetics 47

Curvature operator (cont.) Another equivalent definition of the graph curvature based on the gradient: (κϕ)(v) := e v ( ) 1 g ϕ g e ϕ e. v An elegant property of the graph curvature κϕ = S 1(ϕ) ϕ. Max Planck Institute for Biological Cybernetics 48

Solving the optimization problem (p = 1) The solution of the opti- Theorem. [Zhou and Schölkopf, 2004] mization problem satisfies argmin f H(V ) No closed form solution. { v f } µ + f y 2 2 v κf + µ(f y) = 0. Max Planck Institute for Biological Cybernetics 49

An iterative algorithm Anisotropic information diffusion (curvature flow)(zhou and Schölkopf, 2004) f (t+1) (v) = u v p (t) (u, v)f (t) (v) + p (t) (v, v)y(v), v V Remark The weight coefficients p(u, v) are adaptively updated at each iteration, in addition to the classifying function being updated. This weight update causes the diffusion inside clusters to be enhanced, and the diffusion across clusters to be reduced. Max Planck Institute for Biological Cybernetics 50

Compute the iteration coefficients: step 1 Compute the new weights m : E R defined by by ( 1 m(u, v) = w(u, v) u f + 1 ). v f Remark The smoother the function f at nodes u and v, the larger the function m at edge (u, v). Max Planck Institute for Biological Cybernetics 51

Compute the iteration coefficients: step 2 Compute the coefficients p(u, v) = m(u, v) g(u)g(v), if u v; m(u, v) + µ g(v) u v and p(v, v) = µ m(u, v) u v g(v). + µ Max Planck Institute for Biological Cybernetics 52

A toy classification problem 6 4 labeled point +1 labeled point 1 unlabeled point 6 4 2 2 0 0 2 2 4 4 6 5 0 5 6 5 0 5 6 6 4 4 2 2 0 0 2 2 4 4 6 5 0 5 6 5 0 5 Classification on the spiral toy data. Top-left: toy data; top-right: spectral clustering; bottom-left: 2-smoothness; bottom-right: 1-smoothness. Max Planck Institute for Biological Cybernetics 53

Discrete analysis and regularization A basic differential calculus on graphs Discrete regularization and operators 2-smoothness (p = 2, heat flow, linear); 1-smoothness (p = 1, curvature flow, non-linear); -smoothness (p =, large margin) Directed graphs Max Planck Institute for Biological Cybernetics 54

Discrete large margin classification (p = ) Discrete large margin (Zhou and Schölkopf, 2004): argmin{max v f µ + f H(V ) v 2 f y 2 }. Only the worst case is considered! Remark This is closely related to the classic graph bandwidth problem in combinatorial mathematics (cf. Linial, 2002), which is a NP-hard problem and has a polylogarithmic approximation. Max Planck Institute for Biological Cybernetics 55

Discrete analysis and regularization Graph, gradient and divergence Discrete regularization and operators 2-smoothness (p = 2, heat flow, linear); 1-smoothness (p = 1, curvature flow, non-linear); -smoothness (p =, large margin) Directed graphs Max Planck Institute for Biological Cybernetics 56

Classification and ranking on directed graphs The differential geometry on undirected graphs can be naturally generalized to directed graphs (Zhou et al., 2004). Max Planck Institute for Biological Cybernetics 57

Two key observations on WWW The pages in a densely linked subgraph perhaps belong to a common topic. Therefore it is natural to force the classification function to change slowly on densely linked subgraphs. The pairwise similarity is measured based on the mutual reinforcement relationship between hub and authority (Kleinberg, 1998): a good hub node points to many good authorities and a good authority node is pointed to by many good hubs. Max Planck Institute for Biological Cybernetics 58

The importance of directionality AUC AUC 0.9 0.8 0.7 0.6 0.5 directed (γ=1, α=0.10) undirected (α=0.10) 0.4 2 4 6 8 10 12 14 16 18 20 # labeled points 0.8 0.7 0.6 0.5 0.4 0.3 directed (γ=1, α=0.10) undirected (α=0.10) 2 4 6 8 10 12 14 16 18 20 # labeled points AUC AUC 0.9 0.8 0.7 0.6 0.5 directed (γ=1, α=0.10) 0.4 undirected (α=0.10) 2 4 6 8 10 12 14 16 18 20 # labeled points 0.8 0.7 0.6 0.5 0.4 directed (γ=1, α=0.10) undirected (α=0.10) 0.3 2 4 6 8 10 12 14 16 18 20 # labeled points Classification on the WebKB dataset: student vs. the rest in each university. Taking the directionality of edges into account can yield substantial accuracy gains. Max Planck Institute for Biological Cybernetics 59

Outline Introduction Discrete analysis and regularization Related works Discussion and future works Max Planck Institute for Biological Cybernetics 60

A closely related regularizer The 2-smoothness regularizer can be rewritten as (Zhou et al, 2003) u,v A closely related one ( f(u) w(u, v) g(u) u,v f(v) ) 2 g(v) w(u, v)(f(u) f(v)) 2 is proposed by (Belkin and Niyogi, 2002, 2003; Zhu et al., 2003). See also (Joachims, 2003) for a similar one. Max Planck Institute for Biological Cybernetics 61

Similarities between the two regularizers Both can be rewritten into the quadratic forms: u,v ( f(u) w(u, v) g(u) f(v) g(v) ) 2 = f T D 1 2(D W )D 1 2f and w(u, v)(f(u) f(v)) 2 u,v = f T (D W )f. Both D 1 2(D W )D 1 2 and D W are called the graph Laplacian (unfortunate truth). Max Planck Institute for Biological Cybernetics 62

Differences between the two regularizers: limit cases (Bousquet et al., 2003) showed the following limit consequence: w(u, v)(f(u) f(v)) 2 f(x) 2 p 2 (x)dx. u,v A conjecture: u,v ( f(u) w(u, v) g(u) f(v) g(v) ) 2 f(x) 2 p(x)dx. Max Planck Institute for Biological Cybernetics 63

Difference between the two regularizers: experiments (cont. ) 0.7 0.6 Unnormalized Regularizer Our method 0.5 test error 0.4 0.3 0.2 0.1 0 4 10 15 20 25 30 40 50 # labeled points [Note: A subset of USPS containing the digits from 1 to 4; the same RBF kernel for all methods. ] Max Planck Institute for Biological Cybernetics 64

Improve the unnormalized regularizer by heuristics (Belkin and Niyogi, 2002, 2003) Choose a number k and construct a k-nn graph with 0/1 weights over points. Using the weight matrix as the affinity among points. (Zhu et al., 2003) Estimate the proportion of different classes based on the labeled points, and then rescale the function based on the estimated proportion. Both of them just empirically approximate to the normalization in our 2-smoothness regularizer. Max Planck Institute for Biological Cybernetics 65

Improve the unnormalized regularizer by heuristics: experiments 0.7 0.6 Unnormalized Regularizer Unnormalized Regularizer (+ heuristics) Our method 0.5 test error 0.4 0.3 0.2 0.1 0 4 10 15 20 25 30 40 50 # labeled points [Note: A subset of USPS containing the digits from 1 to 4; the same RBF kernel for all methods. ] Max Planck Institute for Biological Cybernetics 66

Another related work: graph/cluster kernels Graph or cluster kernels (Smola and Kondor, 2003; Chapelle et al., 2002): Descompose the (normalized) graph Laplacian K = U T ΛU and then replace the eigenvalues λ with ϕ(λ), where ϕ is a decreasing function, to obtain the so-called graph kernel: K = U T diag[ϕ(λ 1 ),..., ϕ(λ n )]U. In the closed form of the p = 2 case, the matrix (µi + ) 1 can be viewed as a graph kernel with ϕ(λ) = 1/(µ + λ). Max Planck Institute for Biological Cybernetics 67

Difference from graph/cluster kernels The matrix (µi + ) 1 is naturally derived from our regularization framework for transductive inference. In contrast, graph/cluster kernels are obtained by manipulating the eigenvalues. SVM combined with (µi + ) 1 does not work well in our transductive experiments. When p takes other values, e.g. exists any more. p = 1, no corresponding kernel Max Planck Institute for Biological Cybernetics 68

Outline Introduction to learn on discrete spaces Discrete analysis and regularization Related works Limitation and future works Max Planck Institute for Biological Cybernetics 69

Limitation: How to beat our method? One can construct arbitrarily bad problems for a given algorithm: Theorem. [No Free Lunch, e.g., Devroye, 1996] For any algorithm, any n and any ɛ > 0, there exists a distribution P such that R = 0 and P [R(g n ) 12 ] ɛ = 1, where g n is the function estimated by the algorithm based on the n training examples. Max Planck Institute for Biological Cybernetics 70

Limitation: How to beat our method? (cont.) 1.5 1 (a) Toy Data (Two Moons) unlabeled point labeled point 1 labeled point +1 1.5 1 (b) True labels 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Max Planck Institute for Biological Cybernetics 71

Future work: Theory We need a new statistical learning theory: How is a graph like a manifold? The convergence of the discrete differential operators How does transductive inference converge? Parallel to the bounds given in the context of inductive inference What is the transductive principle? Parallel to the inductive inference principle Structural Risk Minimization Max Planck Institute for Biological Cybernetics 72

Future work: Algorithms From kernelize to transductize (or discretize): Learning with graph data (undirected, directed, bipartite, hypergraph), time series data,... Multi-label learning, hierarchical classification, regression,... Active learning, selection problems,... Unsupervised learning combined with partial prior knowledge, such as clustering, manifold learning,...... Max Planck Institute for Biological Cybernetics 73

Future work: Applications Computational biology Web information retrieval Natural language processing... Max Planck Institute for Biological Cybernetics 74

Future work: Beyond machine learning The discrete differential calculus and geometry over discrete objects provides the exact computation of the differential operators, and can be applied to more problems: Images processing/computer graphics: Digital images/graphics are represented as lattices/graphs Structure and evolution of the real-world networks: WWW, Internet, biological networks,... (e.g., what do the curvatures of these networks mean?)... Max Planck Institute for Biological Cybernetics 75

This talk is based on our following work: Differential Geometry D. Zhou and B. Schölkopf. Transductive Inference with Graphs. Technical Report, Max Planck Institute for Biological Cybernetics, August, 2004. Directed Graphs D. Zhou, B. Schölkopf and T. Hofmann. Semi-supervised Learning on Directed Graphs. NIPS 2004. Undirected Graphs D. Zhou, O. Bousquet, T.N. Lal, J. Weston and B. Schölkopf. Learning with Local and Global Consistency. NIPS 2003. Ranking D. Zhou, J. Weston, A. Gretton, O. Bousquet and B. Schölkopf. Ranking on Data Manifolds. NIPS 2003. Bioinformatics J. Weston, A. Elisseeff, D. Zhou, C.Leslie and W.S. Noble. Protein ranking: from local to global structure in the protein similarity network. PNAS 101(17) (2004). Bioinformatics J. Weston, C. Leslie, D. Zhou, A. Elisseeff and W. S. Noble. Semi-Supervised Protein Classification using Cluster Kernels. NIPS 2003. Max Planck Institute for Biological Cybernetics 76

Conclusions Developed the discrete analysis and geometry over discrete objects Constructed the discrete regularizer and the effective transductive algorithms are derived Validated the transductive algorithms on many real-world problems Transduction and Induction are the two sides of machine learning: discrete vs. continuous. Two sides of the same. Max Planck Institute for Biological Cybernetics 77

... perhaps the success of the Heisenberg method points to a purely algebraic method of description of nature, that is, to the elimination of continuous functions from physics. Then, however, we must give up, by principle, the space-time continuum... Albert Einstein Although there have been suggestions that space-time may have a discrete structure I see no reason to abandon the continuum theories that have been so successful. Stephen Hawking Max Planck Institute for Biological Cybernetics 78