Structured Prediction
|
|
- Candace Hart
- 5 years ago
- Views:
Transcription
1 Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016
2 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
3 Structured Prediction Problem Structured Output: Y = Y 1... Y l. Loss Function: L : Y Y R + decomposable. Training data: sample drawn i.i.d. from X Yaccording to some distribution D, S =((x 1, y 1 ),..., (x m, y m )) X Y Problem: find hypothesis h : X Ywith small generalization error E (x,y) D [L (h (x), y)]
4 Ensemble Methods: Further assumptions The number of substructures l 1isfixed. Loss function L can be decomposed as a sum of loss functions l k : Y k R+,i.e. for all y = ( y 1,...,y l) and y = ( y 1,...,y l ), L ( y, y ) = l k=1 l k ( y k, y k) Asetofstructuredpredictionexpertsh 1,..., h p : h j (x) = ( h 1 j (x),..., h l j (x)), j = 1,..., p
5 Path Experts Intuition: one particular expert maybe better at predicting k-th substructure, but not elsewhere. Figure: Path Expert Construct a larger set of experts, path experts : H = {( h 1 j 1,..., h l j l ), jk {1,..., p}, k = 1,..., l }, H = p l When selecting the best h from H, itmaynotcoincidewith any of original experts.
6 Review: Randomized Weighted Majority (WM) Task: find a distribution over experts. Initialize weights for all experts: After receiving sample (x t, y t ), w 1,j = 1/p, j = 1,..., p Update weight for expert j, j = 1,..,p: w t+1,j = Return w T +1 =(w T +1,1,..., w T +1,p ) w t,j e ηl(ŷ t,j,y t) p j=1 w,η >0constant t,je ηl(ŷ t,j,y t)
7 Randomized Weighted Majority (WM) Apply the Weighted Majority algorithm to path experts set H: h H, w t+1 (h) = w t (h) e ηl(h(xt),yt) h H w t (h) e ηl(h(xt),yt) p l updates per round! Using the structure of Y and additive loss assumption, we can reduce to pl updates per round.
8 Weighted Majority Weight Pushing (WMWP) For every h = ( h 1 j 1,..., h l j l ), w (h) = l k=1 w k j k Enforce the weights at each substructure sum to one: p j=1 w k j = 1, k {1,..., l} The overall weights h H w (h) will automatically sum to one! Example: p = 2,l = 2, w 1 1 w w 1 1 w w 1 2 w w 1 2 w 2 2 = ( w w 1 2 )( w w 2 2 ) = 1
9 Weighted Majority Weight Pushing (WMWP) Initialize weights: w k 1,j = 1/p, (k, j) {1,..., l} {1,..., p}, At round t, (k, j) {1,..., l} {1,..., p}, updateweight w k t+1,j = w k t,j e ηl k(h k j (xt),y k t ) p j=1 w k t,j e ηl k(h k j (xt),y k t ), Return W 1,..., W T +1,whereW t = ( w k t,j).
10 On-line to batch conversion How to define a hypothesis based on {W 1,..., W T +1 }? Two steps: 1 Choose a good collection of distributions. P = {W 1,...,W T +1 } is one choice, but not necessarily the best. 2 Use P to define hypotheses for prediction.
11 On-line to batch conversion Step 1: Choose a good collection of distributions. For any collection P, definescore Γ(P) = 1 W P t P h H W log 1 δ t (h) L (h (x t ), y t )+M P Chose P δ = arg min P P Γ(P)
12 On-line to batch conversion Step 2: define hypotheses. Randomized: randomly select a path expert h Haccording to p (h) = 1 P δ W t P δ W t (h), denote the set of generated hypotheses as H Rand. Deterministic: defined a scoring function ( ) l 1 p h MVote (x, y) = wt,j k P δ 1 hj k (x)=y k k=1 W t P δ j=1 H MVote (x) =arg max h MVote (x, y) y Y
13 Learning Guarantees Regret R T R T = T t=1 E h Wt [L (h (x t ), y t )] inf h H T L (h (x t ), y t ) t=1 For any δ>0, with probability at least 1 δ over sample (x i, y i ) T i=1 drawn i.i.d from D, thefollowinghold: E [L (H Rand (x), y)] 1 T T log T δ E h Wt [L (h (x t ), y t )] + M T, t=1 E [L (H Rand (x), y)] inf h H E [L (h (x), y)] + R T T + 2M log 2T δ T.
14 Learning Guarantees The following inequality always hold: with expectations taken over (x, y) Dand h p for H Rand, E [L Ham (H MVote (x), y)] 2 E [L Ham (H Rand (x), y)] Proof: by definition of H MVote,thefollowingalwayshold: H k MVote (x) y k 1 P δ W t P δ p wt,j k 1 hj k(x) y k j=1 summing over k and take expectations over D yields the desired results.
15 AdaBoost: Review { N } Hypothesis space: H = j=1 α jh j,α j 0,whereh j H 0 are base classifiers, h j : X R. N j=1 α j h j (x i ) Objective Function: F (α) = m i=1 e y i Apply Coordinate Descent to F (α) ( N ) Return hypothesis: h (x) =sgn j=1 α jh j (x)
16 ESPBoost: hypothesis space How to make a convex combination of path experts? Prediction Score Convex Combination of Score Combined Prediction For each path experts h t H,definethescore h t : X Y R + : (x, y) X Y, h t (x, y) = l k=1 1 h k t (x)=y k Convex combination of score: { } T h = α t ht : h t derived from path experts,α t 0 t=1 Combined Prediction: h (x) =arg max h (x, y) =arg max y Y y Y T t=1 k=1 l α t 1 h k t (x)=y k
17 Loss Function and Upper Bound Normalized Hamming Loss: 1 m = 1 ml 1 ml m [L Ham (H ESPBoost (x i ), y i )] i=1 m i=1 m i=1 l k=1 1 hk (x i,y i ) max y yi hk (x i,y)<0 { l exp k=1 ( hk ) Margin: ρ, x i, y i T t=1 ) } α t ρ ( hk t, x i, y i := F (α) = h ( ) k x i, yi k arg maxy k y h ( k x i k i, y k)
18 ESPBoost Hypothesis space: the set of combined predictions { } H ESPBoost (x) = h (x) :arg max h (x, y) y Y Objective Function F : R T + R F (α) = 1 ml m i=1 { l exp k=1 T t=1 ) } α t ρ ( hk t, x i, y i Apply coordinate descent.
19 ESPBoost: algorithm
20 Generalized Kernel Approach Problem Setting: Given (x i, y i ) n i=1 X Y,wewanttolearna mapping f from X to Y. l : Y Y R akernelfunction F Y :RKHSassociatedwithl Φ l :themappingfromy to F Y Step 1: Learn the mapping g from X to F Y Step 2: Find the pre-image of g (x)
21 Generalized Kernel Approach Step 1: learn the mapping g : X F Y Preliminaries Operator-valued kernels Function-valued RKHS
22 Operator-valued Kernels L (F Y ) be the set the bounded operators T : F Y F Y Non-negative L (F Y ) valued kernel K : X X L(F Y ) such that: x i, x j X, K (x i, x j )=K (x j, x } i ). For any m > 0, {(x i,ϕ i ) i=1,...,m X F Y, m i,j=1 K (x i, x j ) ϕ j,ϕ i FY 0. Note: * denotes adjoint operator, i.e. ϕ 1,ϕ 2 F Y, K (ϕ 1 ),ϕ 2 FY = ϕ 1, K (ϕ 2 ) FY
23 Function-valued RKHS AHilbertspaceF XY of functions g : X F Y is a F Y valued RKHS if there is a non-negative L (F Y ) valued kernel K with the following properties: x X, ϕ F Y,thefunctionK (x, ) ϕ F XY g F XY, x X, ϕ F Y, g, K (x, ) ϕ FXY = g (x),ϕ FY Theorem: Bijection between F XY and K, aslongask is non-negative.
24 Kernel Ridge Regression Step 1: Learning g : X F Y Kernel Ridge Regression, with closed form solution: arg min g F XY n g (x i ) Φ l (y i ) 2 F Y + λ g 2 F XY i=1 g (x) =K x (K + λi ) 1 Φ l K x :arowvectorofoperators,[k (, x i ) L(F Y )] n i=1 K: amatrixofoperators,[k (x i, x j ) L(F Y )] n i,j=1 Φ l :acolumnvectoroffunctions[φ l (y i ) F Y ] n i=1
25 Find Pre-image Step 2: Find pre-image of g (x) f (x) =arg min g (x) Φ l (y) 2 F Y y Y = arg min y Y K x (K + λi ) 1 Φ l Φ l (y) = arg min l (y, y) 2 y Y 2 F Y K x (K + λi ) 1 Φ l, Φ l (y) Φ l (y) unknown, use a generalized kernel trick: T Φ l (y i ), Φ l (y) =[T l (y i, )] (y) Express f (x) using only kernel functions: f (x) =arg min l (y, y) 2 y Y where L. is a column vector of [l (y i, )] n i=1 [ K x (K + λi ) 1 L. ] (y) F Y
26 Covariance-based Operator-valued Kernels Covariance-based operator-valued kernels: K (x i, x j )=k (x i, x j ) C YY k : X X R ascalar-valuedkernel; C YY : F Y F Y acovarianceoperatordefinedbyarandom variable Y Y: ϕ 1, C YY ϕ 2 FY = E [ϕ 1 (Y ) ϕ 2 (Y )] Empirical covariance operator Ĉ YY (ϕ) = 1 n n ϕ (y i ) l (, y i ) To account for the effects of input, we could also use conditional covariance operator i=1 C YY X = C YY C YX C 1 XX C XY
27 Conclusion Ensemble methods: ensemble learning with expended path experts. On-line algorithm (WMWP): efficient for learning and inference by exploiting the output structure. On-line-to-batch-conversion: randomized and deterministic algorithms with learning guarantees. Boosting: efficient for output structure. Kernel method: Use a joint feature space. Covariance-based operator-valued kernel to encode interactions between outputs. Conditional Covariance-based operator to correlate input with interaction between outputs. Express the final hypothesis with only kernel functions.
28 Reference Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages ,2014. Hachem Kadri, Mohammad Ghavamzadeh, and Philippe Preux. AGeneralizedKernelApproachtoStructuredOutputLearning. In International Conference on Machine Learning (ICML), Atlanta, United States, June 2013.
Boosting Ensembles of Structured Prediction Rules
Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationOn-Line Learning with Path Experts and Non-Additive Losses
On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT
More informationADANET: adaptive learning of neural networks
ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationAdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
AdaBoost S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology 1 Introduction In this chapter, we are considering AdaBoost algorithm for the two class classification problem.
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationAdvanced Machine Learning
Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLecture 23: Online convex optimization Online convex optimization: generalization of several algorithms
EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationRegularization in Reproducing Kernel Banach Spaces
.... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationRecitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14
Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationLearning Ensembles. 293S T. Yang. UCSB, 2017.
Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)
More informationDecentralized Detection and Classification using Kernel Methods
Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationA Generalized Kernel Approach to Structured Output Learning
A Generalized Kernel Approach to Structured Output Learning Hachem Kadri, Mohammad Ghavamzadeh, Philippe Preux To cite this version: Hachem Kadri, Mohammad Ghavamzadeh, Philippe Preux. A Generalized Kernel
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationA Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University
A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationOutline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual
Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More information1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More information1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.
CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationOnline Kernel PCA with Entropic Matrix Updates
Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationMulticlass Boosting with Repartitioning
Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationLearning Weighted Automata
Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationNonparametric Decentralized Detection using Kernel Methods
Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical
More informationJoint distribution optimal transportation for domain adaptation
Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationLearning Ensembles of Structured Prediction Rules
Learning Ensembles of Structured Prediction Rules Corinna Cortes Google Research 111 8th Avenue, New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street, New York, NY
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More information