Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Similar documents
6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Machine Learning Theory (CS 6783)

Support vector machine revisited

Empirical Process Theory and Oracle Inequalities

10-701/ Machine Learning Mid-term Exam Solution

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

6.867 Machine learning, lecture 7 (Jaakkola) 1

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Intro to Learning Theory

Machine Learning Brett Bernstein

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

(all terms are scalars).the minimization is clearer in sum notation:

Massachusetts Institute of Technology

6.867 Machine learning

Regression and generalization

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

ECON 3150/4150, Spring term Lecture 3

Topic 9: Sampling Distributions of Estimators

Regression with quadratic loss

Optimally Sparse SVMs

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Section 14. Simple linear regression.

REGRESSION WITH QUADRATIC LOSS

Machine Learning Brett Bernstein

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Estimation for Complete Data

1 Inferential Methods for Correlation and Regression Analysis

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Topic 9: Sampling Distributions of Estimators

Linear Support Vector Machines

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

1 Review of Probability & Statistics

Topic 9: Sampling Distributions of Estimators

Linear Regression Demystified

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

In algebra one spends much time finding common denominators and thus simplifying rational expressions. For example:

Agnostic Learning and Concentration Inequalities

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Machine Learning Theory (CS 6783)

Element sampling: Part 2

Machine Learning for Data Science (CS 4786)

Simple Linear Regression

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 7: October 18, 2017

Lecture 2 October 11

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

CS321. Numerical Analysis and Computing

CS537. Numerical Analysis and Computing

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Problem Set 4 Due Oct, 12

Support Vector Machines and Kernel Methods

Lecture 15: Learning Theory: Concentration Inequalities

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Binary classification, Part 1

Linear Classifiers III

Castiel, Supernatural, Season 6, Episode 18

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Lecture 2: April 3, 2013

PROBLEM SET 5 SOLUTIONS 126 = , 37 = , 15 = , 7 = 7 1.

A survey on penalized empirical risk minimization Sara A. van de Geer

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Stat410 Probability and Statistics II (F16)

Sieve Estimators: Consistency and Rates of Convergence

Step 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b

Notes on iteration and Newton s method. Iteration

Selective Prediction

Information-based Feature Selection

Lecture 33: Bootstrap

Machine Learning for Data Science (CS 4786)

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 14: Graph Entropy

Lecture 7: Properties of Random Samples

Bayesian Methods: Introduction to Multi-parameter Models

15-780: Graduate Artificial Intelligence. Density estimation

Classification with linear models

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Lecture 11 and 12: Basic estimation theory

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Properties and Hypothesis Testing

The Method of Least Squares. To understand least squares fitting of data.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Expectation and Variance of a random variable

Lecture 12: September 27

Introductory statistics

Introduction to Machine Learning DIS10

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

CMSE 820: Math. Foundations of Data Sci.

Transcription:

.87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses Regressio, eample Liear regressio estimatio, errors, aalysis Tommi Jaakkola, MIT CSAIL Review: the learig problem Recall the image (face) recogitio problem Idyk Barzilay Collis Jaakkola Hypothesis class: we cosider some restricted set F of mappigs f : X L from images to labels Estimatio: o the basis of a traiig set of eamples ad labels, {(, y ),..., (, y )}, we fid a estimate ˆf F Evaluatio: we measure how well ˆf geeralizes to yet usee eamples, i.e., whether ˆf( ew ) agrees with y ew Hypotheses ad estimatio We used a simple liear classifier, a parameterized mappig f(; θ) from images X to labels L, to solve a biary image classificatio problem ( s vs 3 s): ŷ = f(; θ) = sig ( θ ) where is a piel image ad ŷ {, }. The parameters θ were adjusted o the basis of the traiig eamples ad labels accordig to a simple mistake drive update rule (writte here i a vector form) θ θ + y i i, wheever y i sig ( θ i ) The update rule attempts to miimize the umber of errors that the classifier makes o the traiig eamples Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL Estimatio criterio We ca formulate the estimatio problem more eplicitly by defiig a zero-oe loss: Loss ( y, ŷ ) {, y = ŷ =, y ŷ so that Loss ( y i, ŷ i ) = gives the fractio of predictio errors o the traiig set. This is a fuctio of the parameters θ ad we ca try to miimize it directly. Estimatio criterio cot d We have reduced the estimatio problem to a miimizatio problem fid θ that miimizes empirical loss {}}{ Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL

Estimatio criterio cot d We have reduced the estimatio problem to a miimizatio problem fid θ that miimizes empirical loss {}}{ valid for ay parameterized class of mappigs from eamples to predictios valid whe the predictios are discrete labels, real valued, or other provided that the loss is defied appropriately may be ill-posed (uder-costraied) as stated Estimatio criterio cot d We have reduced the estimatio problem to a miimizatio problem fid θ that miimizes empirical loss {}}{ valid for ay parameterized class of mappigs from eamples to predictios valid whe the predictios are discrete labels, real valued, or other provided that the loss is defied appropriately may be ill-posed (uder-costraied) as stated But why is it sesible to miimize the empirical loss i the first place sice we are oly iterested i the performace o ew eamples? Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8 Traiig ad test performace: samplig We assume that each traiig ad test eample-label pair, (, y), is draw idepedetly at radom from the same but ukow populatio of eamples ad labels. We ca represet this populatio as a joit probability distributio P (, y) so that each traiig/test eample is a sample from this distributio ( i, y i ) P Idyk Barzilay Collis Jaakkola Traiig ad test performace: samplig We assume that each traiig ad test eample-label pair, (, y), is draw idepedetly at radom from the same but ukow populatio of eamples ad labels. We ca represet this populatio as a joit probability distributio P (, y) so that each traiig/test eample is a sample from this distributio ( i, y i ) P Empirical (traiig) loss = { ( )} Epected (test) loss = E (,y) P Loss y, f(; θ) The traiig loss based o a few sampled eamples ad labels serves as a proy for the test performace measured over the whole populatio. Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses Regressio, eample Liear regressio estimatio, errors, aalysis Regressio The goal is to make quatitative (real valued) predictios o the basis of a (vector of) features or attributes Eample: predictig vehicle fuel efficiecy (mpg) from 8 attributes y cyls disp hp weight... 8. 8 37. 3. 35.... 97.. 835... 33.5 98. 83. 75...... Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL

y y Regressio The goal is to make quatitative (real valued) predictios o the basis of a (vector of) features or attributes Eample: predictig vehicle fuel efficiecy (mpg) from 8 attributes y cyls disp hp weight... 8. 8 37. 3. 35.... 97.. 835... 33.5 98. 83. 75...... We eed to specify the class of fuctios (e.g., liear) select how to measure predictio loss solve the resultig miimizatio problem! Liear regressio!!!!! We begi by cosiderig liear regressio (easy to eted to more comple predictios later o) f : R R f : R d R f(; w) = w + w f(; w) = w + w +... w d d where w = [w, w,..., w d ] T are parameters we eed to set. Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL! Liear regressio: squared loss!! f : R R f : R d R!!! f(; w) = w + w f(; w) = w + w +... w d d We ca measure the predictio loss i terms of squared error, Loss(y, ŷ) = (y ŷ), so that the empirical loss o traiig samples becomes mea squared error J (w) = ( yi f( i ; w) ) Liear regressio: estimatio We have to miimize the empirical squared loss J (w) = ( yi f( i ; w) ) = (y i w w i ) (-dim) By settig the derivatives with respect to w ad w to zero, we get ecessary coditios for the optimal parameter values w J (w) = w J (w) = Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL Optimality coditios: derivatio J (w) = (y i w w i ) w w Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w = (y i w w i ) (y i w w i ) w Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w = (y i w w i ) (y i w w i ) w = (y i w w i )( i ) = Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Optimality coditios: derivatio J (w) = (y i w w i ) w w = (y i w w i ) w = (y i w w i ) (y i w w i ) w = (y i w w i )( i ) = w J (w) = (y i w w i )( ) = Iterpretatio If we deote the predictio error as ɛ i = (y i w w i ) the the optimality coditios ca be writte as ɛ i i =, ɛ i = Thus the predictio error is ucorrelated with ay liear fuctio of the iputs!!!.5.5.5!.5!!.5!!! Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL Iterpretatio If we deote the predictio error as ɛ i = (y i w w i ) the the optimality coditios ca be writte as ɛ i i =, ɛ i = Thus the predictio error is ucorrelated with ay liear fuctio of the iputs but ot with a quadratic fuctio of the iputs ɛ i i (i geeral) Liear regressio: matri otatio We ca epress the solutio a bit more geerally by resortig to a matri otatio y y =, X = y, w = so that (y t w w t ) = y t= y = y Xw [ w w ] [ w w ] Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL

Liear regressio: solutio By settig the derivatives of y Xw / to zero, we get the same optimality coditios as before, ow epressed i a matri form w y Xw = w (y Xw)T (y Xw) Liear regressio: solutio By settig the derivatives of y Xw / to zero, we get the same optimality coditios as before, ow epressed i a matri form w y Xw = w (y Xw)T (y Xw) = XT (y Xw) Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL Liear regressio: solutio By settig the derivatives of y Xw / to zero, we get the same optimality coditios as before, ow epressed i a matri form w y Xw = w (y Xw)T (y Xw) which gives = XT (y Xw) = (XT y X T Xw) = ŵ = (X T X) X T y The solutio is a liear fuctio of the outputs y Liear regressio: geeralizatio As the umber of traiig eamples icreases our solutio gets better! = =! = =!! We d like to uderstad the error a bit better mea squared error.5.5.5 5 5 umber of traiig eamples Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8 Liear regressio: types of errors Structural error measures the error itroduced by the limited fuctio class (ifiite traiig data): mi E (,y) P (y w w ) = E (,y) P (y w w w,w ) where (w, w ) are the optimal liear regressio parameters. Liear regressio: types of errors Structural error measures the error itroduced by the limited fuctio class (ifiite traiig data): mi E (,y) P (y w w ) = E (,y) P (y w w w,w ) where (w, w ) are the optimal liear regressio parameters. Approimatio error measures how close we ca get to the optimal liear predictios with limited traiig data: E (,y) P (w + w ŵ ŵ ) where (ŵ, ŵ ) are the parameter estimates based o a small traiig set (therefore themselves radom variables). Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL 3

Liear regressio: error decompositio The epected error of our liear regressio fuctio decomposes ito the sum of structural ad approimatio errors E (,y) P (y ŵ ŵ ) = E (,y) P (y w w ) + E (,y) P (w + w ŵ ŵ ) mea squared error.5.5 Error decompositio: derivatio E (,y) P (y ŵ ŵ ) = E (,y) P ( (y w w ) + (w + w ŵ ŵ ) ) = E (,y) P (y w w ) +E (,y) P (y w w )(w + w ŵ ŵ ) +E (,y) P (w + w ŵ ŵ ) The secod term has to be zero sice the error (y w w ) of the best liear predictor is ecessarily ucorrelated with ay liear fuctio of the iput icludig (w + w ŵ ŵ ).5 5 5 umber of traiig eamples Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 3