Empirical Risk Minimization

Size: px
Start display at page:

Download "Empirical Risk Minimization"

Transcription

1 Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018

2 Outline Introduction PAC learning ERM in practice 2

3 General setting Data X the input space and Y the output space D a fixed and unknown distribution on X Y Loss function A loss function l is a function from Y Y to R + such that Y Y, l(y, Y) = 0 Model, loss and risk a model g is a function from X to Y given a loss function l the risk of g is R l (g) = E (X,Y) D (l(g(x), Y)) optimal risk R l = inf g R l (g) 3

4 Supervised learning Data set D = ((X i, Y i )) 1 i N (X i, Y i ) D (i.i.d.) D D N (product distribution) General problem a learning algorithm creates from D a model g D does R l (g D ) reaches Rl when D goes to infinity? if so, how quickly? 4

5 Empirical risk minimization Empirical risk R l (g, D) = 1 N N i=1 l(g(x i ), Y i ) = 1 D (x,y) D l(g(x), y) ERM algorithm choose a class of functions G from X to Y define g ERM,l,G,D = arg min R l (g, D) g G is ERM a good machine learning algorithm? 5

6 Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? 6

7 Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? Focus of this course the estimation problem and then the approximation problem with a few words about the optimization problem 6

8 Outline Introduction PAC learning ERM in practice 7

9 A simplified case Learning concepts a concept c is a mapping from X to Y = {0, 1} in concept learning, the loss function l b with l b (p, t) = 1 p t we consider only a distribution D X over X risk and empirical risk definitions are adapted to this setting: risk: R(g) = E X DX (1 g(x) c(x) ) empirical risk: R(g, D) = 1 N N i=1 1 g(x i ) c(x i ) in essence the pair (D X, c) replaces D: this corresponds to a noise free situation as a consequence a data set is D = {X 1,..., X N } and has to complemented by a concept to learn 8

10 PAC learning Notations If A is a learning algorithm, then A(D) is the model produced by running A on the data set D Definition A concept class C (i.e. a set of concepts) is PAC-learnable if there is an algorithm A and a function N C from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D X and any concept c C, if N N C (ɛ, δ) then P D D N X {R (A(D)) ɛ} 1 δ probably 1 δ approximately correct ɛ 9

11 Concept learning and ERM Remark the concept to learn c is in C thus R G = 0 in addition, for any D, R(g ERM,G,D, D) = 0 then ERM provides PAC-learnability if for any g C such that R(g, D) = 0, P D D N X {R (g) ɛ} 1 δ Theorem Let C be a finite concept class and let A be an algorithm that outputs A(D) such that R(A(D), D) = 0. Then when N, P D D N X {R (A(D)) ɛ} 1 δ 1 ɛ log C δ 10

12 Proof 1. we consider ways to break the AC part, i.e. having both R(g, D) = 0 and R(g) > ɛ. We have Q = P( g C, R(g, D) = 0 and R(g) > ɛ) = P ) ( R(g, D) = 0 and R(g) > ɛ g C 2. union bound Q g C P( R(g, D) = 0 and R(g) > ɛ) 3. then we have P( R(g, D) = 0 and R(g) > ɛ) = P( R(g, D) = 0 R(g) > ɛ)p(r(g) > ɛ) P( R(g, D) = 0 R(g) > ɛ) 11

13 Proof cont. notice that R(g) = P X DX (g(x) c(x)) thus P X DX (g(x) = c(x) R(g) > ɛ) 1 ɛ as the observations are i.i.d, P( R(g, D) = 0 R(g) > ɛ) (1 ɛ) N e Nɛ finally P( g C, R(g, D) = 0 and R(g) > ɛ) C e Nɛ then if R(A(D), D) = 0, P D D N X {R (A(D)) ɛ} 1 C e Nɛ we want C e Nɛ δ, which happens when N 1 ɛ log C δ 12

14 PAC concept learning ERM ERM provides PAC-learnability for finite concept classes optimization computational cost in Θ (N C ) Data consumption the data needed to reach some PAC level grows with the logarithm of the concept class a finite set C can be encoded with log 2 C bits (by numbering the elements) each observation X fixes one bit of the solution 13

15 Generalization Concept learning is too limited no noise fixed loss function Agnostic PAC learnability A class of models G (functions from X to Y) is PAC-learnable with respect to a loss function l if there is an algorithm A and a function N G from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D on X Y if N N G (ɛ, δ) then { P D D N Rl (A(D)) Rl,G + ɛ } 1 δ Main questions does ERM provide agnostic PAC learnability? does that apply to infinite classes of models? 14

16 Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G 15

17 Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G Proof. for any g G, we have R l (g ERM,l,G,D ) R l (g) = R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g ERM,l,G,D, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), 2 sup R l (g) R l (g, D), g G which leads to the conclusion. 15

18 Finite classes Theorem If G < and if l [a, b] them when N log( 2 G δ )(b a) 2 } P D D N X {sup R l (g) R l (g, D) ɛ 1 δ g G 2ɛ 2 Proof. very rough sketch 1. we use the union technique to focus on a single model g 2. then we use the Hoeffding inequality to bound the difference between an empirical average and an expectation. In our context it says ( ) { Rl P D D N (g) R } ɛ 2 l (g, D) ɛ 2 exp 2N X (b a) 2 3. the conclusion is obtained as in the simple case of concept learning 16

19 Finite classes Theorem If G < and if l [0, 1] the ERM provides agnostic PAC-learnability 2 log( with N G (ɛ, δ) = 2 G δ )(b a) 2 Discussion ɛ 2 obvious consequence of the uniform approximation result the limitation l [a, b] can be lifted but only asymptotically the dependency of the data size to the quality (i.e. to ɛ) is far less satisfactory than in the simple case: this is a consequence of allowing noise 17

20 Infinite classes Restriction we keep the noise but move back to a simple case Y = {0, 1} and l = l b Growth function if {v 1,..., v m } is a finite subset of X G {v1,...,v m} = {(g(v 1 ),..., g(v m )) g G} {0, 1} m the growth function of G is S G (m) = sup {v 1,...,v m} X G{v1,...,v m} 18

21 Interpretation Going back to finite things G{v1,...,v m} gives the number of models as seen by the inputs {v 1,..., v m } it corresponds to the number of possible classification decisions (a.k.a. binary labelling) of those inputs the growth function corresponds to the worst case analysis: the set of inputs that can be labelled in the largest number of different ways Vocabulary if G{v1,...,v m} = 2 m then {v 1,..., v m } is said to be shattered by G S G (m) is the m-th shatter coefficient of G 19

22 Uniform approximation Theorem For any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D P D D N X {sup R lb (g) R lb (g, D) 4 + } log(s G (2N)) g G δ 1 δ 2N Consequences strong link between the growth function and uniform approximation useful only if log(s G(2m)) m goes to zero when m if G shatters sets of arbitrary sizes log(s G (2m)) = 2m log 2 20

23 Vapnik Chervonenkis dimension VC-dimension Characterization VCdim(G) = m if and only if VCdim(G) = sup {m N S G (m) = 2 m } 1. there is a set of m points {v 1,..., v m } that is shattered by G 2. no set of m + 1 points {v 1,..., v m+1 } is shattered by G Lemma (Sauer) If VCdim(G) <, for all m S G (m) VCdim(G) k=0 m VCdim(G) ( em S G (m) VCdim(G) ) VCdim(G) ( m ) k. In particular when 21

24 Finite VC-dimension Consequences If VCdim(G) = d <, for any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D, if N d then P D D N X sup g G log( 2eN R lb (g) R lb (g, D) 4 + δ 2N d ) 1 δ Learnability a finite VC-dimension ensures agnostic PAC-learnability of the ERM ( ) VCdim(G)+log 1 δ it can be shown that N G (ɛ, δ) = Θ ɛ 2 22

25 VC-dimension VC-dimension calculation is very difficult! A useful result: Theorem Let F be a vector space of functions from X to R of dimension p. Let G be the class of models given by G = { g : X {0, 1} f F, X X g(x) = 1 f (X) 0 }. Then VCdim(G) p. 23

26 Is a finite VC-dimension needed? Theorem Let G be a class of models from X to Y = {0, 1}. Then the following properties are equivalent: 1. G is agnostic PAC-learnable with the binary loss l b 2. ERM provides agnostic PAC-learnable with the binary loss l b for G 3. VCdim(G) < Interpretation learnability in the PAC sense is therefore uniquely characterized by the VC-dimension of the class of models no algorithmic tricks can be used to circumvent this fact! but this applies only to a fix class! 24

27 Beyond binary classification numerous extensions are available to the regression setting (with quadratic or absolute loss) to classification with more than two classes refined complexity measures are available Rademacher complexity Covering numbers better bounds are also available in general in the noise free situation But the overall message remains the same: learnability is only possible in classes of bounded complexity. 25

28 Outline Introduction PAC learning ERM in practice 26

29 ERM in practice Empirical risk minimization g ERM,l,G,D = arg min R l (g, D) g G Implementation? what class G should we use? potential candidates? how to chose among them? how to implement the minimization part complexity? approximate solutions? 27

30 Model classes Some examples fixed basis models, e.g. for Y = { 1, 1} { ( K )} G = X sign α k f k (X), k=1 where the f k are fixed functions from X to R parametric basis models { ( K )} G = X sign α k f k (w k, X), k=1 where the f k (w k,.) are fixed functions from X to R and the w k are parameters that enable tuning the f k useful also for Y = R (remove the indicator function) 28

31 Examples Linear models X = R P the linearity is with respect to α basic model f k (X) = X k P k=1 α kf k (X) = α T X general models f k (X) can be any polynomial function on R P or more generally a function from R P to R e.g. f k (X) = X 1 X 2 2, f k (X) = log X 3, etc. 29

32 Examples Nonlinear models Radial Basis Function (RBF) neural networks: X = R P f k ((β, w k ), X) = exp ( β X w k 2) one hidden layer perceptron: X = R P 1 f k ((β, w k ), X) = 1 + exp( β w T k X) More complex outputs if Y <, write Y = {y 1,..., y L } possible class G = { X y t(x), with t(x) = arg max l ( exp ( ))} K α kl f kl (w kl, X) k=1 30

33 (Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α 31

34 (Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α Meta-parameters to avoid confusion, we use the term "meta-parameters" to refer to parameters of the machine learning algorithm in ERM, those are class level parameters (G itself): K the f k functions the parametric form of f k (w k,.) 31

35 ERM and optimization Standard optimization problem computing arg min g G Rl (g, D) is a classical optimization problem no closed-form solution in general ERM relies on standard algorithms: gradient based algorithms if possible, combinatorial optimization tools if needed Very different complexities from easy cases: linear models with quadratic loss to NP-hard ones: binary loss even with super simple models 32

36 Linear regression ERM version of linear regression class of models { } G = g : R P R (β 0, β), X R P g β0,β(x) = β 0 + β T x loss: l(p, t) = (p t) 2 empirical risk: R l (g β0,β, D) = 1 N N i=1 (Y i β 0 β T X i ) 2 standard solution ( β 0, β ) T = ( X T X ) 1 X T Y with X = 1 X T 1 1 X T N computational cost in Θ ( NP 2) Y = Y T 1 Y T N 33

37 Linear classification Linear model with binary loss class of models { } G = g : R P {0, 1} (β 0, β), X R P g β0,β(x) = sign(β 0 + β T x) loss: l b (p, t) = 1 p t empirical risk: misclassification rate in this context ERM is NP-hard and tight approximations are also NP-hard notice that the input dimension is the source of complexity Noise if the optimal model makes zero error, then ERM is polynomial! complexity comes from both noise and the binary loss 34

38 Gradient descent Smooth functions Y = R parametric case G = {X F(w, X)} assume the loss function l and the models in the class G are differentiable (can be extended to subgradients) gradient of the empirical loss w Rl (w, D) = 1 N N i=1 l p (F(w, X i), Y i ) w F(w, X i ) ERM through standard gradient based algorithms, such as gradient descent w t = w t 1 γ t w Rl (w t 1, D) 35

39 Stochastic gradient descent Finite sum leverage the structure of R l (w, D) = 1 N N i=1 l(f (w, X i), Y i ) what about updating according to only one example? stochastic gradient descent 1. start with a random w 0 2. iterate 2.1 select i t randomly uniformly in {1,..., N} 2.2 w t = w t 1 γ t l p (F (wt 1, X i t ), Y i t ) wf (w t 1, X i t ) practical tips: use the Polyak-Ruppert averaging: w m = 1 m 1 m t=0 wt γ t = (γ 0 + t) κ, κ ]0.5, 1], γ 0 0 numerous acceleration techniques such as momentum ( averaged" gradients) 36

40 Ad hoc methods Heuristics numerous heuristics have been proposed for ERM and related problems one of the main tools is alternate/separate optimization: optimize with respect to some parameters while holding the other constants Radial basis function { G = X K k=1 α k exp ( β X w k 2)} set β heuristically, e.g. to the inverse of the smallest squared distance between two X in the data set set the w k via a unsupervised method applied to the (X i ) 1 i N only (for instance the K-means algorithm) consider β and the w k fixed and apply standard ERM to the α k, e.g. linear regression if the loss function is quadratic and Y = R 37

41 ERM and statistics Maximum likelihood the classical way of estimating parameters in statistics consists in maximizing the likelihood function this is empirical risk minimization in disguise learnability results apply! Linear regression in linear regression one assumes that the following conditional distribution: Y i X i = x N (β 0 + β T x, σ 2 ) the MLE estimate of β 0 and β is obtained as MLE=ERM here (β 0, β) MLE = arg min β 0,β N ) 2 (Y i β 0 β T X i i=1 38

42 Logistic Regression MLE in logistic regression one assumes that (with Y = {0, 1}) P(Y i = 1 X i = x) = exp( β 0 β T x) = h β 0,β(x) the MLE estimate is obtained by maximizing over (β 0, β) the following function N (Y i log h β0,β(x i ) + (1 Y i ) log(1 h β0,β(x i ))) i=1 39

43 ERM version Machine learning view assume Y = R use again the class of linear models the loss is given by l(p, t) = t log(1 + exp( p)) + (1 t) log(1 + exp(p)) Extended ML framework the standard ML approach consists in looking for g in the set of functions from X to Y the logistic regression does not model directly the link between X and Y but rather a probabilistic link the ML version is based on a new ML paradigm where the loss function is defined on Y Y and g is a function from X to Y 40

44 Relaxation Simplifying the ERM ERM with binary loss is complex: non convex loss with no meaningful gradient goal: keep the binary decision but remove the binary loss solution: ask to the model a score rather than a binary decision build a loss function that compares a score to the binary decision use a decision technique consistent with the score generally simpler to formulate with Y = { 1, 1} and the sign function this a relaxation as we do not look anymore for a crisp 0/1 solution but for a continuous one 41

45 Examples Numerous possible solutions Y = { 1, 1} with decision based on sign(p) logistic loss: l logi (p, t) = log(1 + exp( pt)) perceptron loss: l per (p, t) = max(0, pt) hinge loss (Support Vector Machine): l hinge (p, t) = max(0, 1 pt) exponential loss (Ada boost): l exp (p, t) = exp( pt) quadratic loss: l 2 (p, t) = (p t) 2 42

46 Beyond ERM This is not ERM anymore! sign g is a model in the original sense (a function from X to Y) but l relax (g(x), Y ) l b (sign(g(x)), Y) surrogate loss minimization some theoretical results are available but in general no guarantee to find the best model with a surrogate loss Are there other reasons to avoid ERM? 43

47 Negative results (binary classification) VCdim(G) = consider a fixed ML algorithm that picks up a classifier in G with infinite VC dimension (using whatever criterion) for all ɛ > 0 and all N, there is D such that R G = 0 and E D D N (R(g D )) 1 2e ɛ VCdim(G) < for all ɛ > 0, there is D such that R G R > 1 2 ɛ 44

48 Conclusion Summary The empirical risk minimization framework seems appealing at first but it has several limitations the binary loss is associated to practical difficulties: implementation is difficult (because of the lack of smoothness) complexity can be high in the case of noisy data learnability is guaranteed but only for model classes with finite VC dimension which are strictly limited! Beyond ERM surrogate loss function data adaptive model class 45

49 Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 46

50 Version Last git commit: By: Fabrice Rossi Git hash: 1b39c1bacfc1b07f96d689db230b a62d4 47

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Optimal decision and naive Bayes

Optimal decision and naive Bayes Optimal decision and naive Bayes Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2017 1 Outline Standard supervised learning hypothesis Optimal model The Naive Bayes Classifier 2 Data model Data

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38 Learning Theory Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts p. 1/38 Topics Probability theory meet machine learning Concentration inequalities: Chebyshev, Chernoff, Hoeffding, and

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information