Data and Structure in Machine Learning

Size: px
Start display at page:

Download "Data and Structure in Machine Learning"

Transcription

1 Data and Structure in Machine Learning John Lafferty School of Computer Science Carnegie Mellon University based on joint work with Zoubin Ghahramani, Risi Kondor, Guy Lebanon, Jerry Zhu

2 The Data Century The coming century is surely the century of data. Our society is investing massively in the collection and processing of data of all kinds, on scales unimaginable until recently. David Donoho (August 2000) Satellite Images DNA sequences Internet portals Sensor networks Text and hypertext Financial transactions Environment monitoring Astrostatistics... 2

3 Data Maven: IBM s John Cocke Language models from data Machine translation data Optimizing compilers 3

4 Scholarly Documents 4

5 Relating Phenotype and Genotype 5

6 NCAR Mass Storage System Central facility for storing data used and generated by climate models, field experiments, and other earth-science models Current size (October 2003): 1.5 petabytes Growth rate: 50 terabytes/month. Coupled to, but exceeds Moore s law growth for cycles 6

7 Garbage Out: Municipal Waste 440 Population (millions) Municipal Waste (millions of tons) Waste Importers Million Tons Waste Exporters Million Tons 1. Pennsylvania New York Virginia New Jersey Michigan Missouri Illinois Maryland Indiana Massachusetts

8 Using Labeled and Unlabeled Data Standard paradigm in machine learning is supervised learning Examples given by teacher to student Well-understood theoretically, startling success Labeled data is limited Labeling examples often very expensive Example: protein shape classification, crystallography How can unlabeled data be most effectively leveraged? Making theoretical and practical progress on the labeled/unlabeled data question is currently one of the most important and interesting problems in machine learning 8

9 Labeled and Unlabeled Data as a Graph 9

10 Outline Laplacians and Graph Kernels Laplacians on graphs Diffusion kernels on graphs Diffusion kernels on statistical manifolds Semi-supervised learning Gaussian fields and graph kernels Active learning Experiments with text and images Kernel conditional random fields Challenges 10

11 Kernels (Aizerman, Braverman & Rozonoér, 1964) (Boser, Guyon & Vapnik, 1992) A powerful & elegant tool dramatic improvements in accuracy (and sometimes computational efficiency) Non-linear classifiers, implicit representation feature space Popular indoor sport Kernel X, where X is PCA, ICA, Clustering,.... Primarily a tool for Euclidean space 11

12 Linear Classifiers vs. Kernel Classifiers ˆf(x) = N i=1 α i y i x, x i ˆf(x) = N i=1 α i y i K(x, x i ) 12

13 Popular Kernels Gaussian: K σ (x, x ) = exp ( x x 2 /2σ 2) Polynomial: K d (x, x ) = (1 + x, x ) d Processing data into vectors in R n is the dirty laundry of machine learning (Dietterich); proper set of features is crucial. Often data is shoehorned into Euclidean space (e.g., vector space representation for text processing). Are there methods that take advantage of the natural structure of the data and problem? 13

14 Structured Data What if data lies on a graph or other discrete structure? S N time VP Google Cornell CMU L V like flies foobar.com NSF 14

15 An Elliptical Sun The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry. Edward Nelson, Tensor Analysis There is nothing new under the sun. Ecclesiastes, 1:9 15

16 Laplacian on a Riemannian Manifold On a Riemannian manifold with metric g, the geometric Laplacian is div or f = 1 i (g ij ) det g i f det g = ij g ij j i f + (lower order terms) More generally, = d d + d d. 16

17 Laplacian: Invariant Definition 2n f(p) = lim r 0 r 2 ( f(p) 1 vol(s r ) S r (p) f ) f is harmonic if f = 0 17

18 Combinatorial Laplacian Think of edge e as tangent vector at e. For f : V R, df : E R is the 1-form df(e) = f(e + ) f(e ) Then = d d is discrete analogue of div. 18

19 Combinatorial Laplacian Since f, g = df, dg, is self-adjoint and positive. = D W is an averaging operator f(x) = w xy (f(x) f(y)) y x = d(x) f(x) x y w xy f(y) We say f is harmonic if f = 0. 19

20 Diffusion Kernels on Graphs (Kondor and L., ICML 2002) Kernels are generalized covariances. Crucial positive semidefinite property required by RKHS theory is in general difficult to obtain. Let be the graph Laplacian. In analogy with the continuous setting, d dt K t = K t is the heat equation on a graph. Solution K t = e t is called the heat kernel or diffusion kernel. 20

21 Simple Example: K n For the complete graph K n on n vertices ij = nδ(i, j) 1 and K t (i, j) = 1 + (n 1)e nt n 1 e nt n if i = j i j 21

22 Building Up Kernels If K (i) t are kernels on X i then K t = n i=1 K(i) t X 1... X n. is a kernel on For the hypercube, we get Hamming distance {}}{ K t (x, x ) (tanh t) d(x, x ) Other graphs with explicit diffusion kernels: Infinite trees (Chung & Yau, 1999) Rooted trees Cycles Strings with wildcards 22

23 RKHS Representation General spectral representation of a kernel as K(x, y) = n i=1 λ i φ i (x)φ i (y) leads to reproducing kernel Hilbert space i a i φ i, i b i φ i HK = i a i b i λ i For the diffusion kernel, RKHS inner product is f, g HK = i e tµ i fi ĝ i Regularization interpretation: Functions with small norm don t oscillate rapidly on the graph. 23

24 Results on UCI Datasets Hamming Diffusion Kernel Improv. Data Set error SV error SV t err SV Breast Cancer 7.64% % % 83% Hepatitis 17.98% % % 58% Income 19.19% % % 8% Mushroom 3.36% % % 70% Votes 4.69% % % 12% Using the voted perceptron (Freund & Schapire, 1999). K(x, x ) i ( 1 e A i t 1 + ( A i 1) e A i t ) d(xi,x i ) 24

25 Recent Application of Diffusion Kernels (Vert and Kanehisa, NIPS 02) Task: Gene function prediction from microarray data. Question: Can knowledge of metabolic pathways and catalyzing enzymes improve the performance? Idea: Documented network of biochemical reactions forms a graph where genes are vertices. Relevant features are likely to exhibit correlation with respect to the topology of the graph. Combine diffusion kernel with expression profiles using kernel CCA (Bach and Jordan, 2002). Experiments on genes of yeast S.cerevisiae yield consistent improvement in predicting genes functional class. 25

26 Diffusion on Statistical Manifolds (L. and Lebanon, NIPS 02) Similarity between x and x may be difficult to quantify even for domain experts. Generative models are fairly easy to derive. A good generative model will use domain knowledge. Fisher information on the family codifies this into a distance. Gives statistical family structure of a Riemannian manifold. Associate each data point with a model: x θ(x), and consider diffusion kernel on information manifold. 26

27 Information Geometry F = {p( θ), θ Θ R d } d-dimensional statistical family. Fisher information matrix [g ij (θ)] g ij (θ) = X i log p(x θ) j log p(x θ) p(x θ) dx Defines a Riemannian metric on Θ, which is invariant under reparameterization. See (Amari, 00; Kass & Voss, 97) 27

28 Special Case: Multinomial maps simplex to the sphere d(θ, θ ) = 2 arccos ( d+1 i=1 θi θ i ) 28

29 Parametrices for the Heat Kernel Have an asymptotic expansion for the heat kernel: ( ) K t (x, y) = (4πt) n 2 exp d2 (x, y) N ψ i (x, y)t i + O(t N ) 4t i=0 in terms of the parametrices ψ i. Much of modern differential geometry is based on Laplacians, heat kernels and related constructions. 29

30 Special Case: Multinomial Parametrix expansion leads to approximation K t (x, x ) ( ( 1 exp 1 d+1 (4πt) d/2 t arccos2 i=1 θi (x)θ i (x ) )) Positive definite for small t. 30

31 SVM Decision Boundaries: Trinomial Gaussian kernel Information diffusion kernel 31

32 WebKB Text Classification Experiments linear rbf diffusion 0.3 linear rbf diffusion Test set error rate Test set error rate Number of training examples Number of training examples Intuition: improvement comes from in metric, which places more emphasis on values close to the boundary of the simplex. Related work: Fisher kernel of Haussler and Jaakkola. 32

33 Bounds on Covering Numbers and Generalization Error Theorem. Suppose the Fisher information manifold M is compact, with positive curvature, dimension d and volume V. Then the SVM hypothesis class covering numbers N (ɛ, F R (x)) for the diffusion kernel K t on M satisfy log N (ɛ, F R (x)) = O (( V t d 2 ) log d+2 2 ( )) 1 ɛ Based on eigenvalue bounds in differential geometry: c 1 (d) ( j V )2 d µj c 2 (d) ( j + 1 V )2 d Better error bounds now available based on Rademacher averages. 33

34 Semi-Supervised Learning: Initial Explorations Two main frameworks: Generative and Discriminative According to standard statistical theory, only generative models can benefit from unlabeled data Castelli and Cover (Trans. IT, 1996) Assumes simple generative mixture model, identifiable Quantifies relative value of labeled/unlabeled data in terms of Fisher information Asymptotic analysis Blum and Mitchell (COLT, 1998) Co-Training Two discriminative classifiers, sufficiently redundant Can PAC learn from unlabeled data Unrealistic independence assumptions 34

35 Labeled and Unlabeled Data as a Graph 35

36 Labeled and Unlabeled Data as a Graph Idea: Construct a random field on graph Intuition: Similar examples have similar labels Information propagates from labeled examples Graph encodes prior information may be inaccurate or misleading! 36

37 Random Fields Energy function E(y) = ij w ij (y i y j ) 2 with (conditional) random field p(y x) exp ( βh(y)) Discrete case, y i {+1, 1} Most probable = graph mincuts (Blum and Chawla, 2001). Not unique. Intractable to compute marginals Our framework: relaxation to continuous case, y i R Convex, unique mode Gaussian field, efficient algorithms 37

38 Gaussian Fields and Harmonic Functions [ Combinatorial Laplacian = LL LU UL UU ] = D W, where W = w w 1n... w n1... w nn D = w wn Field is clamped on the labeled data. On unlabeled data, Gaussian: y U N ( ) f U, UU f U = 1 UU ULf L harmonic 38

39 Random Walks and Electric Networks 1 R = 1 ij w ij 1 +1 volt 0 0 i Related recent work: Szummer and Jaakkola 2002, Belkin and Niyogi 2002, Chapelle et al. 2003, Deng et al

40 Graph Kernels Integrate diffusion kernel over time: 1 UU = 0 e t UU dt

41 Gaussian Fields vs. Gaussian Processes p(y) = exp ( y y ) p(y) = exp ( y Σ 1 y ) uses unlabeled data ignores unlabeled data

42 Text Classification: PC vs. MAC (Zhu, Ghahramani, and L., ICML 2003) accuracy labeled set size GF thresh SVM 42

43 Digit Recognition (all 10) accuracy labeled set size GF thresh SVM 43

44 Active Learning (Experimental Design) a B Idea: Choose query point to be labeled Common heuristic: most uncertain example Our approach: minimize (estimated) expected risk, combining semi-supervised and active learning Can be implemented efficiently in Gaussian field/process view 44

45 Active Learning: Text Classification Accuracy Active Learning Random Query SVM Labeled set size 45

46 Active Learning: Digit Classification Accuracy Active Learning Random Query SVM Labeled set size 46

47 Why Does This Work So Well? Data matches our assumptions: clusters in the graph have like labels. In newsgroups data, like , one posting will quote part of a previous posting, creating threads of discussion that are well-captured by the graph. In active learning, a single label suffices to disambiguate the entire thread. If your computer could ask you just one question, should it ask the one it s most unsure about, or the one from which it can learn the most? Graph structure allows the program a way to estimate the impact of each such question. 47

48 Generalization Error Bounds Gaussian prior over y, covariance kernel uses unlabeled data What if prior assumptions are wrong? Information-theoretic, PAC bounds (McAllester, 2000; Seeger 2002, Derbeko et al., 2003) ε gen ε emp + O ( ) D(Q P )+1+log(1/δ) n Q is approximate posterior (e.g., Laplace, Expectation-Propagation), P is prior If classification task doesn t match prior, bound blows up 48

49 Inference for Kernel (Prior) Generalization bound suggests minimizing D(Q P ). approximation: Gaussian D(Q P ) = log det K + trace ( f Kf ) + constant Convex optimization problem for graph selection Closely related to computing channel capacity for Gaussian channels 49

50 Recent Work: Free Food Cam Data Our goal is not to solve the object/person recognition problem, but to develop machine learning methods that will complement image processing techniques, improving the system by leveraging different sources of information 50

51 Extension: Conditional Random Fields (L., McCallum & Pereira, 2001) p(y x) = 1 Z(x) exp ( s λ sφ s (x, y)) Global normalization. Undirected graphical models; general and expressive modeling technique Using features, incorporate domain knowledge without increasing state space Do not make strong independence assumptions Comparable to HMMs in computationally efficiency 51

52 Protein Secondary Structure Prediction Predict protein secondary structure: coil, α-helix, or β-strand Current state of the art: SVM window-based classifiers using polynomial kernel derived from PSI-BLAST queries. CRFs give improvements by modeling sequential structure Our research question: Can unlabeled protein data be effectively incorporated using a graph kernel? Current work with Yan Liu and Jerry Zhu, part of joint UPenn/CMU NSF ITR grant. 52

53 Kernel Conditional Random Fields Proposition. For each clique c let K c be a Mercer kernel with associated RKHS norm c, and let Ω c : R c R be a strictly monotonic function. Then the minimizer {f c } = argmin f c H c m ) l (x i, y i, {f c (x i, y c )} c,yc + i=1 if it exists, has the form f c ( ) = i,y c α c,i (y c )K c (x i, y c ; ) C Ω c ( f c ) c=1 The dual parameters α c,i (y c ) depend on the i-th instance, and all of the clique label assignments y c. 53

54 Challenges Fundamental limits for using unlabeled data Capacity and risk Consistency and error bounds Algorithms for graph construction Random forests Multiple similarity metrics Approximation methods and scaling to large data sets Propagation algorithms Unbiasedness vs. computation 54

Random Walks, Random Fields, and Graph Kernels

Random Walks, Random Fields, and Graph Kernels Random Walks, Random Fields, and Graph Kernels John Lafferty School of Computer Science Carnegie Mellon University Based on work with Avrim Blum, Zoubin Ghahramani, Risi Kondor Mugizi Rwebangira, Jerry

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Semi-Supervised Learning

Semi-Supervised Learning Semi-Supervised Learning getting more for less in natural language processing and beyond Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning many human

More information

Graphs, Geometry and Semi-supervised Learning

Graphs, Geometry and Semi-supervised Learning Graphs, Geometry and Semi-supervised Learning Mikhail Belkin The Ohio State University, Dept of Computer Science and Engineering and Dept of Statistics Collaborators: Partha Niyogi, Vikas Sindhwani In

More information

Semi-Supervised Learning with Graphs

Semi-Supervised Learning with Graphs Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu LTI SCS CMU Thesis Committee John Lafferty (co-chair) Ronald Rosenfeld (co-chair) Zoubin Ghahramani Tommi Jaakkola 1 Semi-supervised Learning classifiers

More information

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning classification classifiers need labeled data to train labeled data

More information

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Diffusion Kernels on Statistical Manifolds

Diffusion Kernels on Statistical Manifolds John Lafferty Guy Lebanon School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA lafferty@cs.cmu.edu lebanon@cs.cmu.edu Editor: Tommi Jaakkola Abstract A family of kernels for statistical

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

A graph based approach to semi-supervised learning

A graph based approach to semi-supervised learning A graph based approach to semi-supervised learning 1 Feb 2011 Two papers M. Belkin, P. Niyogi, and V Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Diffusion Kernels on Statistical Manifolds

Diffusion Kernels on Statistical Manifolds Journal of Machine Learning Research 6 2005) 129 163 Submitted 1/04; Published 1/05 Diffusion Kernels on Statistical Manifolds John Lafferty Guy Lebanon School of Computer Science Carnegie Mellon University

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Discrete vs. Continuous: Two Sides of Machine Learning

Discrete vs. Continuous: Two Sides of Machine Learning Discrete vs. Continuous: Two Sides of Machine Learning Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Oct. 18,

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

1 Graph Kernels by Spectral Transforms

1 Graph Kernels by Spectral Transforms Graph Kernels by Spectral Transforms Xiaojin Zhu Jaz Kandola John Lafferty Zoubin Ghahramani Many graph-based semi-supervised learning methods can be viewed as imposing smoothness conditions on the target

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Diffusion Kernels on Statistical Manifolds

Diffusion Kernels on Statistical Manifolds Diffusion Kernels on Statistical Manifolds John Lafferty Guy Lebanon January 16, 2004 CMU-CS-04-101 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract A family of kernels

More information

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances Wee Sun Lee,2, Xinhua Zhang,2, and Yee Whye Teh Department of Computer Science, National University of Singapore. 2

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Cluster Kernels for Semi-Supervised Learning

Cluster Kernels for Semi-Supervised Learning Cluster Kernels for Semi-Supervised Learning Olivier Chapelle, Jason Weston, Bernhard Scholkopf Max Planck Institute for Biological Cybernetics, 72076 Tiibingen, Germany {first. last} @tuebingen.mpg.de

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Semi-Supervised Learning by Multi-Manifold Separation

Semi-Supervised Learning by Multi-Manifold Separation Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob

More information

Statistical Translation, Heat Kernels, and Expected Distances

Statistical Translation, Heat Kernels, and Expected Distances Statistical Translation, Heat Kernels, and Expected Distances Joshua Dillon School of Elec. and Computer Engineering jvdillon@ecn.purdue.edu Guy Lebanon Department of Statistics, and School of Elec. and

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction. Jose A. Costa Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time

More information

How Good is a Kernel When Used as a Similarity Measure?

How Good is a Kernel When Used as a Similarity Measure? How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum

More information

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Learning on Graphs and Manifolds CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Outline Manifold learning is a relatively new area of machine learning (2000-now). Main idea Model the underlying geometry of

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

of semi-supervised Gaussian process classifiers.

of semi-supervised Gaussian process classifiers. Semi-supervised Gaussian Process Classifiers Vikas Sindhwani Department of Computer Science University of Chicago Chicago, IL 665, USA vikass@cs.uchicago.edu Wei Chu CCLS Columbia University New York,

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

Partially labeled classification with Markov random walks

Partially labeled classification with Markov random walks Partially labeled classification with Markov random walks Martin Szummer MIT AI Lab & CBCL Cambridge, MA 0239 szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 0239 tommi@ai.mit.edu Abstract To

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Statistical learning on graphs

Statistical learning on graphs Statistical learning on graphs Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr ParisTech, Ecole des Mines de Paris Institut Curie INSERM U900 Seminar of probabilities, Institut Joseph Fourier, Grenoble,

More information

Graphs in Machine Learning

Graphs in Machine Learning Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France TA: Pierre Perrault Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton October 30, 2017

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Hyperplane Margin Classifiers on the Multinomial Manifold

Hyperplane Margin Classifiers on the Multinomial Manifold Hyperplane Margin Classifiers on the Multinomial Manifold Guy Lebanon LEBANON@CS.CMU.EDU John Lafferty LAFFERTY@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh

More information

Support vector machines, Kernel methods, and Applications in bioinformatics

Support vector machines, Kernel methods, and Applications in bioinformatics 1 Support vector machines, Kernel methods, and Applications in bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group Machine Learning in Bioinformatics conference,

More information

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Xiaojin hu HUXJ@CSCMUEDU oubin Ghahramani OUBIN@GASBUCLACU John Lafferty LAFFER@CSCMUEDU School of Computer Science, Carnegie Mellon

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Semi-Supervised Learning on Riemannian Manifolds

Semi-Supervised Learning on Riemannian Manifolds Semi-Supervised Learning on Riemannian Manifolds Mikhail Belkin, Partha Niyogi University of Chicago, Department of Computer Science Abstract. We consider the general problem of utilizing both labeled

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN Contribution from: Mathematical Physics Studies Vol. 7 Perspectives in Analysis Essays in Honor of Lennart Carleson s 75th Birthday Michael Benedicks, Peter W. Jones, Stanislav Smirnov (Eds.) Springer

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Tree Dependent Identically Distributed Learning

Tree Dependent Identically Distributed Learning Tree Dependent Identically Distributed Learning Tony Jebara Computer Science Department Columbia University New York, NY 10027 Philip M. Long Center for Computational Learning Systems Columbia University

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Manifold Regularization

Manifold Regularization Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is

More information

Clustering and efficient use of unlabeled examples

Clustering and efficient use of unlabeled examples Clustering and efficient use of unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA 02139 szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 02139 tommi@ai.mit.edu Abstract Efficient

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction

Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction Nina Balcan Avrim Blum Nati Srebro Toyota Technological Institute Chicago Sparse Parzen Window Prediction We are concerned with

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS 249-2 2017 Winter Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Xiaojin

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Kernel expansions with unlabeled examples

Kernel expansions with unlabeled examples Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts Hariharan Narayanan Department of Computer Science University of Chicago Chicago IL 6637 hari@cs.uchicago.edu

More information

Global vs. Multiscale Approaches

Global vs. Multiscale Approaches Harmonic Analysis on Graphs Global vs. Multiscale Approaches Weizmann Institute of Science, Rehovot, Israel July 2011 Joint work with Matan Gavish (WIS/Stanford), Ronald Coifman (Yale), ICML 10' Challenge:

More information

Regularization on Discrete Spaces

Regularization on Discrete Spaces Regularization on Discrete Spaces Dengyong Zhou and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany {dengyong.zhou, bernhard.schoelkopf}@tuebingen.mpg.de

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information