Data and Structure in Machine Learning
|
|
- Jasmin Mason
- 6 years ago
- Views:
Transcription
1 Data and Structure in Machine Learning John Lafferty School of Computer Science Carnegie Mellon University based on joint work with Zoubin Ghahramani, Risi Kondor, Guy Lebanon, Jerry Zhu
2 The Data Century The coming century is surely the century of data. Our society is investing massively in the collection and processing of data of all kinds, on scales unimaginable until recently. David Donoho (August 2000) Satellite Images DNA sequences Internet portals Sensor networks Text and hypertext Financial transactions Environment monitoring Astrostatistics... 2
3 Data Maven: IBM s John Cocke Language models from data Machine translation data Optimizing compilers 3
4 Scholarly Documents 4
5 Relating Phenotype and Genotype 5
6 NCAR Mass Storage System Central facility for storing data used and generated by climate models, field experiments, and other earth-science models Current size (October 2003): 1.5 petabytes Growth rate: 50 terabytes/month. Coupled to, but exceeds Moore s law growth for cycles 6
7 Garbage Out: Municipal Waste 440 Population (millions) Municipal Waste (millions of tons) Waste Importers Million Tons Waste Exporters Million Tons 1. Pennsylvania New York Virginia New Jersey Michigan Missouri Illinois Maryland Indiana Massachusetts
8 Using Labeled and Unlabeled Data Standard paradigm in machine learning is supervised learning Examples given by teacher to student Well-understood theoretically, startling success Labeled data is limited Labeling examples often very expensive Example: protein shape classification, crystallography How can unlabeled data be most effectively leveraged? Making theoretical and practical progress on the labeled/unlabeled data question is currently one of the most important and interesting problems in machine learning 8
9 Labeled and Unlabeled Data as a Graph 9
10 Outline Laplacians and Graph Kernels Laplacians on graphs Diffusion kernels on graphs Diffusion kernels on statistical manifolds Semi-supervised learning Gaussian fields and graph kernels Active learning Experiments with text and images Kernel conditional random fields Challenges 10
11 Kernels (Aizerman, Braverman & Rozonoér, 1964) (Boser, Guyon & Vapnik, 1992) A powerful & elegant tool dramatic improvements in accuracy (and sometimes computational efficiency) Non-linear classifiers, implicit representation feature space Popular indoor sport Kernel X, where X is PCA, ICA, Clustering,.... Primarily a tool for Euclidean space 11
12 Linear Classifiers vs. Kernel Classifiers ˆf(x) = N i=1 α i y i x, x i ˆf(x) = N i=1 α i y i K(x, x i ) 12
13 Popular Kernels Gaussian: K σ (x, x ) = exp ( x x 2 /2σ 2) Polynomial: K d (x, x ) = (1 + x, x ) d Processing data into vectors in R n is the dirty laundry of machine learning (Dietterich); proper set of features is crucial. Often data is shoehorned into Euclidean space (e.g., vector space representation for text processing). Are there methods that take advantage of the natural structure of the data and problem? 13
14 Structured Data What if data lies on a graph or other discrete structure? S N time VP Google Cornell CMU L V like flies foobar.com NSF 14
15 An Elliptical Sun The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry. Edward Nelson, Tensor Analysis There is nothing new under the sun. Ecclesiastes, 1:9 15
16 Laplacian on a Riemannian Manifold On a Riemannian manifold with metric g, the geometric Laplacian is div or f = 1 i (g ij ) det g i f det g = ij g ij j i f + (lower order terms) More generally, = d d + d d. 16
17 Laplacian: Invariant Definition 2n f(p) = lim r 0 r 2 ( f(p) 1 vol(s r ) S r (p) f ) f is harmonic if f = 0 17
18 Combinatorial Laplacian Think of edge e as tangent vector at e. For f : V R, df : E R is the 1-form df(e) = f(e + ) f(e ) Then = d d is discrete analogue of div. 18
19 Combinatorial Laplacian Since f, g = df, dg, is self-adjoint and positive. = D W is an averaging operator f(x) = w xy (f(x) f(y)) y x = d(x) f(x) x y w xy f(y) We say f is harmonic if f = 0. 19
20 Diffusion Kernels on Graphs (Kondor and L., ICML 2002) Kernels are generalized covariances. Crucial positive semidefinite property required by RKHS theory is in general difficult to obtain. Let be the graph Laplacian. In analogy with the continuous setting, d dt K t = K t is the heat equation on a graph. Solution K t = e t is called the heat kernel or diffusion kernel. 20
21 Simple Example: K n For the complete graph K n on n vertices ij = nδ(i, j) 1 and K t (i, j) = 1 + (n 1)e nt n 1 e nt n if i = j i j 21
22 Building Up Kernels If K (i) t are kernels on X i then K t = n i=1 K(i) t X 1... X n. is a kernel on For the hypercube, we get Hamming distance {}}{ K t (x, x ) (tanh t) d(x, x ) Other graphs with explicit diffusion kernels: Infinite trees (Chung & Yau, 1999) Rooted trees Cycles Strings with wildcards 22
23 RKHS Representation General spectral representation of a kernel as K(x, y) = n i=1 λ i φ i (x)φ i (y) leads to reproducing kernel Hilbert space i a i φ i, i b i φ i HK = i a i b i λ i For the diffusion kernel, RKHS inner product is f, g HK = i e tµ i fi ĝ i Regularization interpretation: Functions with small norm don t oscillate rapidly on the graph. 23
24 Results on UCI Datasets Hamming Diffusion Kernel Improv. Data Set error SV error SV t err SV Breast Cancer 7.64% % % 83% Hepatitis 17.98% % % 58% Income 19.19% % % 8% Mushroom 3.36% % % 70% Votes 4.69% % % 12% Using the voted perceptron (Freund & Schapire, 1999). K(x, x ) i ( 1 e A i t 1 + ( A i 1) e A i t ) d(xi,x i ) 24
25 Recent Application of Diffusion Kernels (Vert and Kanehisa, NIPS 02) Task: Gene function prediction from microarray data. Question: Can knowledge of metabolic pathways and catalyzing enzymes improve the performance? Idea: Documented network of biochemical reactions forms a graph where genes are vertices. Relevant features are likely to exhibit correlation with respect to the topology of the graph. Combine diffusion kernel with expression profiles using kernel CCA (Bach and Jordan, 2002). Experiments on genes of yeast S.cerevisiae yield consistent improvement in predicting genes functional class. 25
26 Diffusion on Statistical Manifolds (L. and Lebanon, NIPS 02) Similarity between x and x may be difficult to quantify even for domain experts. Generative models are fairly easy to derive. A good generative model will use domain knowledge. Fisher information on the family codifies this into a distance. Gives statistical family structure of a Riemannian manifold. Associate each data point with a model: x θ(x), and consider diffusion kernel on information manifold. 26
27 Information Geometry F = {p( θ), θ Θ R d } d-dimensional statistical family. Fisher information matrix [g ij (θ)] g ij (θ) = X i log p(x θ) j log p(x θ) p(x θ) dx Defines a Riemannian metric on Θ, which is invariant under reparameterization. See (Amari, 00; Kass & Voss, 97) 27
28 Special Case: Multinomial maps simplex to the sphere d(θ, θ ) = 2 arccos ( d+1 i=1 θi θ i ) 28
29 Parametrices for the Heat Kernel Have an asymptotic expansion for the heat kernel: ( ) K t (x, y) = (4πt) n 2 exp d2 (x, y) N ψ i (x, y)t i + O(t N ) 4t i=0 in terms of the parametrices ψ i. Much of modern differential geometry is based on Laplacians, heat kernels and related constructions. 29
30 Special Case: Multinomial Parametrix expansion leads to approximation K t (x, x ) ( ( 1 exp 1 d+1 (4πt) d/2 t arccos2 i=1 θi (x)θ i (x ) )) Positive definite for small t. 30
31 SVM Decision Boundaries: Trinomial Gaussian kernel Information diffusion kernel 31
32 WebKB Text Classification Experiments linear rbf diffusion 0.3 linear rbf diffusion Test set error rate Test set error rate Number of training examples Number of training examples Intuition: improvement comes from in metric, which places more emphasis on values close to the boundary of the simplex. Related work: Fisher kernel of Haussler and Jaakkola. 32
33 Bounds on Covering Numbers and Generalization Error Theorem. Suppose the Fisher information manifold M is compact, with positive curvature, dimension d and volume V. Then the SVM hypothesis class covering numbers N (ɛ, F R (x)) for the diffusion kernel K t on M satisfy log N (ɛ, F R (x)) = O (( V t d 2 ) log d+2 2 ( )) 1 ɛ Based on eigenvalue bounds in differential geometry: c 1 (d) ( j V )2 d µj c 2 (d) ( j + 1 V )2 d Better error bounds now available based on Rademacher averages. 33
34 Semi-Supervised Learning: Initial Explorations Two main frameworks: Generative and Discriminative According to standard statistical theory, only generative models can benefit from unlabeled data Castelli and Cover (Trans. IT, 1996) Assumes simple generative mixture model, identifiable Quantifies relative value of labeled/unlabeled data in terms of Fisher information Asymptotic analysis Blum and Mitchell (COLT, 1998) Co-Training Two discriminative classifiers, sufficiently redundant Can PAC learn from unlabeled data Unrealistic independence assumptions 34
35 Labeled and Unlabeled Data as a Graph 35
36 Labeled and Unlabeled Data as a Graph Idea: Construct a random field on graph Intuition: Similar examples have similar labels Information propagates from labeled examples Graph encodes prior information may be inaccurate or misleading! 36
37 Random Fields Energy function E(y) = ij w ij (y i y j ) 2 with (conditional) random field p(y x) exp ( βh(y)) Discrete case, y i {+1, 1} Most probable = graph mincuts (Blum and Chawla, 2001). Not unique. Intractable to compute marginals Our framework: relaxation to continuous case, y i R Convex, unique mode Gaussian field, efficient algorithms 37
38 Gaussian Fields and Harmonic Functions [ Combinatorial Laplacian = LL LU UL UU ] = D W, where W = w w 1n... w n1... w nn D = w wn Field is clamped on the labeled data. On unlabeled data, Gaussian: y U N ( ) f U, UU f U = 1 UU ULf L harmonic 38
39 Random Walks and Electric Networks 1 R = 1 ij w ij 1 +1 volt 0 0 i Related recent work: Szummer and Jaakkola 2002, Belkin and Niyogi 2002, Chapelle et al. 2003, Deng et al
40 Graph Kernels Integrate diffusion kernel over time: 1 UU = 0 e t UU dt
41 Gaussian Fields vs. Gaussian Processes p(y) = exp ( y y ) p(y) = exp ( y Σ 1 y ) uses unlabeled data ignores unlabeled data
42 Text Classification: PC vs. MAC (Zhu, Ghahramani, and L., ICML 2003) accuracy labeled set size GF thresh SVM 42
43 Digit Recognition (all 10) accuracy labeled set size GF thresh SVM 43
44 Active Learning (Experimental Design) a B Idea: Choose query point to be labeled Common heuristic: most uncertain example Our approach: minimize (estimated) expected risk, combining semi-supervised and active learning Can be implemented efficiently in Gaussian field/process view 44
45 Active Learning: Text Classification Accuracy Active Learning Random Query SVM Labeled set size 45
46 Active Learning: Digit Classification Accuracy Active Learning Random Query SVM Labeled set size 46
47 Why Does This Work So Well? Data matches our assumptions: clusters in the graph have like labels. In newsgroups data, like , one posting will quote part of a previous posting, creating threads of discussion that are well-captured by the graph. In active learning, a single label suffices to disambiguate the entire thread. If your computer could ask you just one question, should it ask the one it s most unsure about, or the one from which it can learn the most? Graph structure allows the program a way to estimate the impact of each such question. 47
48 Generalization Error Bounds Gaussian prior over y, covariance kernel uses unlabeled data What if prior assumptions are wrong? Information-theoretic, PAC bounds (McAllester, 2000; Seeger 2002, Derbeko et al., 2003) ε gen ε emp + O ( ) D(Q P )+1+log(1/δ) n Q is approximate posterior (e.g., Laplace, Expectation-Propagation), P is prior If classification task doesn t match prior, bound blows up 48
49 Inference for Kernel (Prior) Generalization bound suggests minimizing D(Q P ). approximation: Gaussian D(Q P ) = log det K + trace ( f Kf ) + constant Convex optimization problem for graph selection Closely related to computing channel capacity for Gaussian channels 49
50 Recent Work: Free Food Cam Data Our goal is not to solve the object/person recognition problem, but to develop machine learning methods that will complement image processing techniques, improving the system by leveraging different sources of information 50
51 Extension: Conditional Random Fields (L., McCallum & Pereira, 2001) p(y x) = 1 Z(x) exp ( s λ sφ s (x, y)) Global normalization. Undirected graphical models; general and expressive modeling technique Using features, incorporate domain knowledge without increasing state space Do not make strong independence assumptions Comparable to HMMs in computationally efficiency 51
52 Protein Secondary Structure Prediction Predict protein secondary structure: coil, α-helix, or β-strand Current state of the art: SVM window-based classifiers using polynomial kernel derived from PSI-BLAST queries. CRFs give improvements by modeling sequential structure Our research question: Can unlabeled protein data be effectively incorporated using a graph kernel? Current work with Yan Liu and Jerry Zhu, part of joint UPenn/CMU NSF ITR grant. 52
53 Kernel Conditional Random Fields Proposition. For each clique c let K c be a Mercer kernel with associated RKHS norm c, and let Ω c : R c R be a strictly monotonic function. Then the minimizer {f c } = argmin f c H c m ) l (x i, y i, {f c (x i, y c )} c,yc + i=1 if it exists, has the form f c ( ) = i,y c α c,i (y c )K c (x i, y c ; ) C Ω c ( f c ) c=1 The dual parameters α c,i (y c ) depend on the i-th instance, and all of the clique label assignments y c. 53
54 Challenges Fundamental limits for using unlabeled data Capacity and risk Consistency and error bounds Algorithms for graph construction Random forests Multiple similarity metrics Approximation methods and scaling to large data sets Propagation algorithms Unbiasedness vs. computation 54
Random Walks, Random Fields, and Graph Kernels
Random Walks, Random Fields, and Graph Kernels John Lafferty School of Computer Science Carnegie Mellon University Based on work with Avrim Blum, Zoubin Ghahramani, Risi Kondor Mugizi Rwebangira, Jerry
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationSemi-Supervised Learning
Semi-Supervised Learning getting more for less in natural language processing and beyond Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning many human
More informationGraphs, Geometry and Semi-supervised Learning
Graphs, Geometry and Semi-supervised Learning Mikhail Belkin The Ohio State University, Dept of Computer Science and Engineering and Dept of Statistics Collaborators: Partha Niyogi, Vikas Sindhwani In
More informationSemi-Supervised Learning with Graphs
Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu LTI SCS CMU Thesis Committee John Lafferty (co-chair) Ronald Rosenfeld (co-chair) Zoubin Ghahramani Tommi Jaakkola 1 Semi-supervised Learning classifiers
More informationSemi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University
Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning classification classifiers need labeled data to train labeled data
More informationLearning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31
Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from
More informationHow to learn from very few examples?
How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationDiffusion Kernels on Statistical Manifolds
John Lafferty Guy Lebanon School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA lafferty@cs.cmu.edu lebanon@cs.cmu.edu Editor: Tommi Jaakkola Abstract A family of kernels for statistical
More informationAnalysis of Spectral Kernel Design based Semi-supervised Learning
Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationA graph based approach to semi-supervised learning
A graph based approach to semi-supervised learning 1 Feb 2011 Two papers M. Belkin, P. Niyogi, and V Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.
More informationHilbert Space Methods in Learning
Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem
More informationDiffusion Kernels on Statistical Manifolds
Journal of Machine Learning Research 6 2005) 129 163 Submitted 1/04; Published 1/05 Diffusion Kernels on Statistical Manifolds John Lafferty Guy Lebanon School of Computer Science Carnegie Mellon University
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationDiscrete vs. Continuous: Two Sides of Machine Learning
Discrete vs. Continuous: Two Sides of Machine Learning Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Oct. 18,
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More information1 Graph Kernels by Spectral Transforms
Graph Kernels by Spectral Transforms Xiaojin Zhu Jaz Kandola John Lafferty Zoubin Ghahramani Many graph-based semi-supervised learning methods can be viewed as imposing smoothness conditions on the target
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationDiffusion Kernels on Statistical Manifolds
Diffusion Kernels on Statistical Manifolds John Lafferty Guy Lebanon January 16, 2004 CMU-CS-04-101 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract A family of kernels
More informationSemi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances
Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances Wee Sun Lee,2, Xinhua Zhang,2, and Yee Whye Teh Department of Computer Science, National University of Singapore. 2
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCluster Kernels for Semi-Supervised Learning
Cluster Kernels for Semi-Supervised Learning Olivier Chapelle, Jason Weston, Bernhard Scholkopf Max Planck Institute for Biological Cybernetics, 72076 Tiibingen, Germany {first. last} @tuebingen.mpg.de
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationWhat is semi-supervised learning?
What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationSemi-Supervised Learning by Multi-Manifold Separation
Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob
More informationStatistical Translation, Heat Kernels, and Expected Distances
Statistical Translation, Heat Kernels, and Expected Distances Joshua Dillon School of Elec. and Computer Engineering jvdillon@ecn.purdue.edu Guy Lebanon Department of Statistics, and School of Elec. and
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationNonlinear Dimensionality Reduction. Jose A. Costa
Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time
More informationHow Good is a Kernel When Used as a Similarity Measure?
How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum
More informationLearning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst
Learning on Graphs and Manifolds CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Outline Manifold learning is a relatively new area of machine learning (2000-now). Main idea Model the underlying geometry of
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationof semi-supervised Gaussian process classifiers.
Semi-supervised Gaussian Process Classifiers Vikas Sindhwani Department of Computer Science University of Chicago Chicago, IL 665, USA vikass@cs.uchicago.edu Wei Chu CCLS Columbia University New York,
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels
More informationPartially labeled classification with Markov random walks
Partially labeled classification with Markov random walks Martin Szummer MIT AI Lab & CBCL Cambridge, MA 0239 szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 0239 tommi@ai.mit.edu Abstract To
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationKernel Methods & Support Vector Machines
Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationStatistical learning on graphs
Statistical learning on graphs Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr ParisTech, Ecole des Mines de Paris Institut Curie INSERM U900 Seminar of probabilities, Institut Joseph Fourier, Grenoble,
More informationGraphs in Machine Learning
Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France TA: Pierre Perrault Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton October 30, 2017
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationHyperplane Margin Classifiers on the Multinomial Manifold
Hyperplane Margin Classifiers on the Multinomial Manifold Guy Lebanon LEBANON@CS.CMU.EDU John Lafferty LAFFERTY@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh
More informationSupport vector machines, Kernel methods, and Applications in bioinformatics
1 Support vector machines, Kernel methods, and Applications in bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group Machine Learning in Bioinformatics conference,
More informationSemi-Supervised Learning Using Gaussian Fields and Harmonic Functions
Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Xiaojin hu HUXJ@CSCMUEDU oubin Ghahramani OUBIN@GASBUCLACU John Lafferty LAFFER@CSCMUEDU School of Computer Science, Carnegie Mellon
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationSemi-Supervised Learning on Riemannian Manifolds
Semi-Supervised Learning on Riemannian Manifolds Mikhail Belkin, Partha Niyogi University of Chicago, Department of Computer Science Abstract. We consider the general problem of utilizing both labeled
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationContribution from: Springer Verlag Berlin Heidelberg 2005 ISBN
Contribution from: Mathematical Physics Studies Vol. 7 Perspectives in Analysis Essays in Honor of Lennart Carleson s 75th Birthday Michael Benedicks, Peter W. Jones, Stanislav Smirnov (Eds.) Springer
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationTree Dependent Identically Distributed Learning
Tree Dependent Identically Distributed Learning Tony Jebara Computer Science Department Columbia University New York, NY 10027 Philip M. Long Center for Computational Learning Systems Columbia University
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationManifold Regularization
Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is
More informationClustering and efficient use of unlabeled examples
Clustering and efficient use of unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA 02139 szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 02139 tommi@ai.mit.edu Abstract Efficient
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationSimilarity-Based Theoretical Foundation for Sparse Parzen Window Prediction
Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction Nina Balcan Avrim Blum Nati Srebro Toyota Technological Institute Chicago Sparse Parzen Window Prediction We are concerned with
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationClassification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter
Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS 249-2 2017 Winter Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Xiaojin
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationKernel expansions with unlabeled examples
Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications
More informationMicroarray Data Analysis: Discovery
Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationAppendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts Hariharan Narayanan Department of Computer Science University of Chicago Chicago IL 6637 hari@cs.uchicago.edu
More informationGlobal vs. Multiscale Approaches
Harmonic Analysis on Graphs Global vs. Multiscale Approaches Weizmann Institute of Science, Rehovot, Israel July 2011 Joint work with Matan Gavish (WIS/Stanford), Ronald Coifman (Yale), ICML 10' Challenge:
More informationRegularization on Discrete Spaces
Regularization on Discrete Spaces Dengyong Zhou and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany {dengyong.zhou, bernhard.schoelkopf}@tuebingen.mpg.de
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More information