Empirical Risk Minimization
|
|
- Dorcas Lang
- 5 years ago
- Views:
Transcription
1 Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018
2 Outline Introduction PAC learning ERM in practice 2
3 General setting Data X the input space and Y the output space D a fixed and unknown distribution on X Y Loss function A loss function l is a function from Y Y to R + such that Y Y, l(y, Y) = 0 Model, loss and risk a model g is a function from X to Y given a loss function l the risk of g is R l (g) = E (X,Y) D (l(g(x), Y)) optimal risk R l = inf g R l (g) 3
4 Supervised learning Data set D = ((X i, Y i )) 1 i N (X i, Y i ) D (i.i.d.) D D N (product distribution) General problem a learning algorithm creates from D a model g D does R l (g D ) reaches Rl when D goes to infinity? if so, how quickly? 4
5 Empirical risk minimization Empirical risk R l (g, D) = 1 N N i=1 l(g(x i ), Y i ) = 1 D (x,y) D l(g(x), y) ERM algorithm choose a class of functions G from X to Y define g ERM,l,G,D = arg min R l (g, D) g G is ERM a good machine learning algorithm? 5
6 Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? 6
7 Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? Focus of this course the estimation problem and then the approximation problem with a few words about the optimization problem 6
8 Outline Introduction PAC learning ERM in practice 7
9 A simplified case Learning concepts a concept c is a mapping from X to Y = {0, 1} in concept learning, the loss function l b with l b (p, t) = 1 p t we consider only a distribution D X over X risk and empirical risk definitions are adapted to this setting: risk: R(g) = E X DX (1 g(x) c(x) ) empirical risk: R(g, D) = 1 N N i=1 1 g(x i ) c(x i ) in essence the pair (D X, c) replaces D: this corresponds to a noise free situation as a consequence a data set is D = {X 1,..., X N } and has to complemented by a concept to learn 8
10 PAC learning Notations If A is a learning algorithm, then A(D) is the model produced by running A on the data set D Definition A concept class C (i.e. a set of concepts) is PAC-learnable if there is an algorithm A and a function N C from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D X and any concept c C, if N N C (ɛ, δ) then P D D N X {R (A(D)) ɛ} 1 δ probably 1 δ approximately correct ɛ 9
11 Concept learning and ERM Remark the concept to learn c is in C thus R G = 0 in addition, for any D, R(g ERM,G,D, D) = 0 then ERM provides PAC-learnability if for any g C such that R(g, D) = 0, P D D N X {R (g) ɛ} 1 δ Theorem Let C be a finite concept class and let A be an algorithm that outputs A(D) such that R(A(D), D) = 0. Then when N, P D D N X {R (A(D)) ɛ} 1 δ 1 ɛ log C δ 10
12 Proof 1. we consider ways to break the AC part, i.e. having both R(g, D) = 0 and R(g) > ɛ. We have Q = P( g C, R(g, D) = 0 and R(g) > ɛ) = P ) ( R(g, D) = 0 and R(g) > ɛ g C 2. union bound Q g C P( R(g, D) = 0 and R(g) > ɛ) 3. then we have P( R(g, D) = 0 and R(g) > ɛ) = P( R(g, D) = 0 R(g) > ɛ)p(r(g) > ɛ) P( R(g, D) = 0 R(g) > ɛ) 11
13 Proof cont. notice that R(g) = P X DX (g(x) c(x)) thus P X DX (g(x) = c(x) R(g) > ɛ) 1 ɛ as the observations are i.i.d, P( R(g, D) = 0 R(g) > ɛ) (1 ɛ) N e Nɛ finally P( g C, R(g, D) = 0 and R(g) > ɛ) C e Nɛ then if R(A(D), D) = 0, P D D N X {R (A(D)) ɛ} 1 C e Nɛ we want C e Nɛ δ, which happens when N 1 ɛ log C δ 12
14 PAC concept learning ERM ERM provides PAC-learnability for finite concept classes optimization computational cost in Θ (N C ) Data consumption the data needed to reach some PAC level grows with the logarithm of the concept class a finite set C can be encoded with log 2 C bits (by numbering the elements) each observation X fixes one bit of the solution 13
15 Generalization Concept learning is too limited no noise fixed loss function Agnostic PAC learnability A class of models G (functions from X to Y) is PAC-learnable with respect to a loss function l if there is an algorithm A and a function N G from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D on X Y if N N G (ɛ, δ) then { P D D N Rl (A(D)) Rl,G + ɛ } 1 δ Main questions does ERM provide agnostic PAC learnability? does that apply to infinite classes of models? 14
16 Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G 15
17 Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G Proof. for any g G, we have R l (g ERM,l,G,D ) R l (g) = R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g ERM,l,G,D, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), 2 sup R l (g) R l (g, D), g G which leads to the conclusion. 15
18 Finite classes Theorem If G < and if l [a, b] them when N log( 2 G δ )(b a) 2 } P D D N X {sup R l (g) R l (g, D) ɛ 1 δ g G 2ɛ 2 Proof. very rough sketch 1. we use the union technique to focus on a single model g 2. then we use the Hoeffding inequality to bound the difference between an empirical average and an expectation. In our context it says ( ) { Rl P D D N (g) R } ɛ 2 l (g, D) ɛ 2 exp 2N X (b a) 2 3. the conclusion is obtained as in the simple case of concept learning 16
19 Finite classes Theorem If G < and if l [0, 1] the ERM provides agnostic PAC-learnability 2 log( with N G (ɛ, δ) = 2 G δ )(b a) 2 Discussion ɛ 2 obvious consequence of the uniform approximation result the limitation l [a, b] can be lifted but only asymptotically the dependency of the data size to the quality (i.e. to ɛ) is far less satisfactory than in the simple case: this is a consequence of allowing noise 17
20 Infinite classes Restriction we keep the noise but move back to a simple case Y = {0, 1} and l = l b Growth function if {v 1,..., v m } is a finite subset of X G {v1,...,v m} = {(g(v 1 ),..., g(v m )) g G} {0, 1} m the growth function of G is S G (m) = sup {v 1,...,v m} X G{v1,...,v m} 18
21 Interpretation Going back to finite things G{v1,...,v m} gives the number of models as seen by the inputs {v 1,..., v m } it corresponds to the number of possible classification decisions (a.k.a. binary labelling) of those inputs the growth function corresponds to the worst case analysis: the set of inputs that can be labelled in the largest number of different ways Vocabulary if G{v1,...,v m} = 2 m then {v 1,..., v m } is said to be shattered by G S G (m) is the m-th shatter coefficient of G 19
22 Uniform approximation Theorem For any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D P D D N X {sup R lb (g) R lb (g, D) 4 + } log(s G (2N)) g G δ 1 δ 2N Consequences strong link between the growth function and uniform approximation useful only if log(s G(2m)) m goes to zero when m if G shatters sets of arbitrary sizes log(s G (2m)) = 2m log 2 20
23 Vapnik Chervonenkis dimension VC-dimension Characterization VCdim(G) = m if and only if VCdim(G) = sup {m N S G (m) = 2 m } 1. there is a set of m points {v 1,..., v m } that is shattered by G 2. no set of m + 1 points {v 1,..., v m+1 } is shattered by G Lemma (Sauer) If VCdim(G) <, for all m S G (m) VCdim(G) k=0 m VCdim(G) ( em S G (m) VCdim(G) ) VCdim(G) ( m ) k. In particular when 21
24 Finite VC-dimension Consequences If VCdim(G) = d <, for any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D, if N d then P D D N X sup g G log( 2eN R lb (g) R lb (g, D) 4 + δ 2N d ) 1 δ Learnability a finite VC-dimension ensures agnostic PAC-learnability of the ERM ( ) VCdim(G)+log 1 δ it can be shown that N G (ɛ, δ) = Θ ɛ 2 22
25 VC-dimension VC-dimension calculation is very difficult! A useful result: Theorem Let F be a vector space of functions from X to R of dimension p. Let G be the class of models given by G = { g : X {0, 1} f F, X X g(x) = 1 f (X) 0 }. Then VCdim(G) p. 23
26 Is a finite VC-dimension needed? Theorem Let G be a class of models from X to Y = {0, 1}. Then the following properties are equivalent: 1. G is agnostic PAC-learnable with the binary loss l b 2. ERM provides agnostic PAC-learnable with the binary loss l b for G 3. VCdim(G) < Interpretation learnability in the PAC sense is therefore uniquely characterized by the VC-dimension of the class of models no algorithmic tricks can be used to circumvent this fact! but this applies only to a fix class! 24
27 Beyond binary classification numerous extensions are available to the regression setting (with quadratic or absolute loss) to classification with more than two classes refined complexity measures are available Rademacher complexity Covering numbers better bounds are also available in general in the noise free situation But the overall message remains the same: learnability is only possible in classes of bounded complexity. 25
28 Outline Introduction PAC learning ERM in practice 26
29 ERM in practice Empirical risk minimization g ERM,l,G,D = arg min R l (g, D) g G Implementation? what class G should we use? potential candidates? how to chose among them? how to implement the minimization part complexity? approximate solutions? 27
30 Model classes Some examples fixed basis models, e.g. for Y = { 1, 1} { ( K )} G = X sign α k f k (X), k=1 where the f k are fixed functions from X to R parametric basis models { ( K )} G = X sign α k f k (w k, X), k=1 where the f k (w k,.) are fixed functions from X to R and the w k are parameters that enable tuning the f k useful also for Y = R (remove the indicator function) 28
31 Examples Linear models X = R P the linearity is with respect to α basic model f k (X) = X k P k=1 α kf k (X) = α T X general models f k (X) can be any polynomial function on R P or more generally a function from R P to R e.g. f k (X) = X 1 X 2 2, f k (X) = log X 3, etc. 29
32 Examples Nonlinear models Radial Basis Function (RBF) neural networks: X = R P f k ((β, w k ), X) = exp ( β X w k 2) one hidden layer perceptron: X = R P 1 f k ((β, w k ), X) = 1 + exp( β w T k X) More complex outputs if Y <, write Y = {y 1,..., y L } possible class G = { X y t(x), with t(x) = arg max l ( exp ( ))} K α kl f kl (w kl, X) k=1 30
33 (Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α 31
34 (Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α Meta-parameters to avoid confusion, we use the term "meta-parameters" to refer to parameters of the machine learning algorithm in ERM, those are class level parameters (G itself): K the f k functions the parametric form of f k (w k,.) 31
35 ERM and optimization Standard optimization problem computing arg min g G Rl (g, D) is a classical optimization problem no closed-form solution in general ERM relies on standard algorithms: gradient based algorithms if possible, combinatorial optimization tools if needed Very different complexities from easy cases: linear models with quadratic loss to NP-hard ones: binary loss even with super simple models 32
36 Linear regression ERM version of linear regression class of models { } G = g : R P R (β 0, β), X R P g β0,β(x) = β 0 + β T x loss: l(p, t) = (p t) 2 empirical risk: R l (g β0,β, D) = 1 N N i=1 (Y i β 0 β T X i ) 2 standard solution ( β 0, β ) T = ( X T X ) 1 X T Y with X = 1 X T 1 1 X T N computational cost in Θ ( NP 2) Y = Y T 1 Y T N 33
37 Linear classification Linear model with binary loss class of models { } G = g : R P {0, 1} (β 0, β), X R P g β0,β(x) = sign(β 0 + β T x) loss: l b (p, t) = 1 p t empirical risk: misclassification rate in this context ERM is NP-hard and tight approximations are also NP-hard notice that the input dimension is the source of complexity Noise if the optimal model makes zero error, then ERM is polynomial! complexity comes from both noise and the binary loss 34
38 Gradient descent Smooth functions Y = R parametric case G = {X F(w, X)} assume the loss function l and the models in the class G are differentiable (can be extended to subgradients) gradient of the empirical loss w Rl (w, D) = 1 N N i=1 l p (F(w, X i), Y i ) w F(w, X i ) ERM through standard gradient based algorithms, such as gradient descent w t = w t 1 γ t w Rl (w t 1, D) 35
39 Stochastic gradient descent Finite sum leverage the structure of R l (w, D) = 1 N N i=1 l(f (w, X i), Y i ) what about updating according to only one example? stochastic gradient descent 1. start with a random w 0 2. iterate 2.1 select i t randomly uniformly in {1,..., N} 2.2 w t = w t 1 γ t l p (F (wt 1, X i t ), Y i t ) wf (w t 1, X i t ) practical tips: use the Polyak-Ruppert averaging: w m = 1 m 1 m t=0 wt γ t = (γ 0 + t) κ, κ ]0.5, 1], γ 0 0 numerous acceleration techniques such as momentum ( averaged" gradients) 36
40 Ad hoc methods Heuristics numerous heuristics have been proposed for ERM and related problems one of the main tools is alternate/separate optimization: optimize with respect to some parameters while holding the other constants Radial basis function { G = X K k=1 α k exp ( β X w k 2)} set β heuristically, e.g. to the inverse of the smallest squared distance between two X in the data set set the w k via a unsupervised method applied to the (X i ) 1 i N only (for instance the K-means algorithm) consider β and the w k fixed and apply standard ERM to the α k, e.g. linear regression if the loss function is quadratic and Y = R 37
41 ERM and statistics Maximum likelihood the classical way of estimating parameters in statistics consists in maximizing the likelihood function this is empirical risk minimization in disguise learnability results apply! Linear regression in linear regression one assumes that the following conditional distribution: Y i X i = x N (β 0 + β T x, σ 2 ) the MLE estimate of β 0 and β is obtained as MLE=ERM here (β 0, β) MLE = arg min β 0,β N ) 2 (Y i β 0 β T X i i=1 38
42 Logistic Regression MLE in logistic regression one assumes that (with Y = {0, 1}) P(Y i = 1 X i = x) = exp( β 0 β T x) = h β 0,β(x) the MLE estimate is obtained by maximizing over (β 0, β) the following function N (Y i log h β0,β(x i ) + (1 Y i ) log(1 h β0,β(x i ))) i=1 39
43 ERM version Machine learning view assume Y = R use again the class of linear models the loss is given by l(p, t) = t log(1 + exp( p)) + (1 t) log(1 + exp(p)) Extended ML framework the standard ML approach consists in looking for g in the set of functions from X to Y the logistic regression does not model directly the link between X and Y but rather a probabilistic link the ML version is based on a new ML paradigm where the loss function is defined on Y Y and g is a function from X to Y 40
44 Relaxation Simplifying the ERM ERM with binary loss is complex: non convex loss with no meaningful gradient goal: keep the binary decision but remove the binary loss solution: ask to the model a score rather than a binary decision build a loss function that compares a score to the binary decision use a decision technique consistent with the score generally simpler to formulate with Y = { 1, 1} and the sign function this a relaxation as we do not look anymore for a crisp 0/1 solution but for a continuous one 41
45 Examples Numerous possible solutions Y = { 1, 1} with decision based on sign(p) logistic loss: l logi (p, t) = log(1 + exp( pt)) perceptron loss: l per (p, t) = max(0, pt) hinge loss (Support Vector Machine): l hinge (p, t) = max(0, 1 pt) exponential loss (Ada boost): l exp (p, t) = exp( pt) quadratic loss: l 2 (p, t) = (p t) 2 42
46 Beyond ERM This is not ERM anymore! sign g is a model in the original sense (a function from X to Y) but l relax (g(x), Y ) l b (sign(g(x)), Y) surrogate loss minimization some theoretical results are available but in general no guarantee to find the best model with a surrogate loss Are there other reasons to avoid ERM? 43
47 Negative results (binary classification) VCdim(G) = consider a fixed ML algorithm that picks up a classifier in G with infinite VC dimension (using whatever criterion) for all ɛ > 0 and all N, there is D such that R G = 0 and E D D N (R(g D )) 1 2e ɛ VCdim(G) < for all ɛ > 0, there is D such that R G R > 1 2 ɛ 44
48 Conclusion Summary The empirical risk minimization framework seems appealing at first but it has several limitations the binary loss is associated to practical difficulties: implementation is difficult (because of the lack of smoothness) complexity can be high in the case of noisy data learnability is guaranteed but only for model classes with finite VC dimension which are strictly limited! Beyond ERM surrogate loss function data adaptive model class 45
49 Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 46
50 Version Last git commit: By: Fabrice Rossi Git hash: 1b39c1bacfc1b07f96d689db230b a62d4 47
Support Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationOptimal decision and naive Bayes
Optimal decision and naive Bayes Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2017 1 Outline Standard supervised learning hypothesis Optimal model The Naive Bayes Classifier 2 Data model Data
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationStatistical Learning Learning From Examples
Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationMachine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationIntroduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results
Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationComputational Learning Theory
1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationVapnik-Chervonenkis Dimension of Neural Nets
P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley
More informationMACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension
MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationA Bound on the Label Complexity of Agnostic Active Learning
A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,
More informationGeneralization theory
Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationMachine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016
Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationLearning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38
Learning Theory Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts p. 1/38 Topics Probability theory meet machine learning Concentration inequalities: Chebyshev, Chernoff, Hoeffding, and
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More information