Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Similar documents
Regression and generalization

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Support vector machine revisited

Classification with linear models

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

6.867 Machine learning

10-701/ Machine Learning Mid-term Exam Solution

6.867 Machine learning, lecture 7 (Jaakkola) 1

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Linear Regression Models

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Massachusetts Institute of Technology

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

Lecture 13: Maximum Likelihood Estimation

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Empirical Process Theory and Oracle Inequalities

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Linear Regression Demystified

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Machine Learning Brett Bernstein

Chapter 7 Maximum Likelihood Estimate (MLE)

Machine Learning Brett Bernstein

(all terms are scalars).the minimization is clearer in sum notation:

Pattern Classification

1 Review of Probability & Statistics

Machine Learning Assignment-1

Chapter 7. Support Vector Machine

CSCI567 Machine Learning (Fall 2014)

ECE 901 Lecture 13: Maximum Likelihood Estimation

Introduction to Machine Learning DIS10

Linear Support Vector Machines

Expectation and Variance of a random variable

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Topic 9: Sampling Distributions of Estimators

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Maximum Likelihood Estimation and Complexity Regularization

Algorithms for Clustering

ECON 3150/4150, Spring term Lecture 3

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Lecture 11 and 12: Basic estimation theory

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

Linear Classifiers III

Introductory statistics

The Method of Least Squares. To understand least squares fitting of data.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Optimization Methods MIT 2.098/6.255/ Final exam

CSCI567 Machine Learning (Fall 2014)

Lecture 2 October 11

Statistical Fundamentals and Control Charts

15-780: Graduate Artificial Intelligence. Density estimation

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

STA6938-Logistic Regression Model

Pattern Classification

Topic 9: Sampling Distributions of Estimators

The Expectation-Maximization (EM) Algorithm

Week 1, Lecture 2. Neural Network Basics. Announcements: HW 1 Due on 10/8 Data sets for HW 1 are online Project selection 10/11. Suggested reading :

Section 14. Simple linear regression.

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Topic 9: Sampling Distributions of Estimators

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

Matrix Representation of Data in Experiment

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Local Polynomial Regression

Questions and Answers on Maximum Likelihood

Quantile regression with multilayer perceptrons.

Elementary Statistics

Integer Programming (IP)

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

STATISTICS 593C: Spring, Model Selection and Regularization

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Exponential Families and Bayesian Inference

Information-based Feature Selection

Machine Learning Theory (CS 6783)

Hilbert Space and Least-squares Collocation

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Bayesian Methods: Introduction to Multi-parameter Models

Machine Learning for Data Science (CS 4786)

Naïve Bayes. Naïve Bayes

5.4 The spatial error model Regression model with spatially autocorrelated errors

5 : Exponential Family and Generalized Linear Models

Learning Bounds for Support Vector Machines with Learned Kernels

Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation

1 Review and Overview

ME 539, Fall 2008: Learning-Based Control

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Mixtures of Gaussians and the EM Algorithm

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Elementary manipulations of probabilities

CSIE/GINM, NTU 2009/11/30 1

1.010 Uncertainty in Engineering Fall 2008

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Properties and Hypothesis Testing

Problem Set 4 Due Oct, 12

Transcription:

6.867 Machie learig: lecture 3 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics Beod liear regressio models additive regressio models, eamples geeralizatio ad cross-validatio populatio miimizer Statistical regressio models model formulatio, motivatio maimum likelihood estimatio Tommi Jaakkola, MIT CSAIL fuctios, fuctios, f(; w) = w + w, or f(; w) = w + w, or f(; w) = w + w +... + w d d f(; w) = w + w +... + w d d combied with the squared loss, are coveiet because the combied with the squared loss, are coveiet because the ŵ = (X T X) X T where, for eample, = [,..., ] T. Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL fuctios, f(; w) = w + w, or f(; w) = w + w +... + w d d combied with the squared loss, are coveiet because the ŵ = (X T X) X T where, for eample, = [,..., ] T. the resultig predictio errors ɛ i = i f( i ; ŵ) are ucorrelated with a liear fuctio of the iputs. fuctios, f(; w) = w + w, or f(; w) = w + w +... + w d d combied with the squared loss, are coveiet because the ŵ = (X T X) X T where, for eample, = [,..., ] T. the resultig predictio errors ɛ i = i f( i ; ŵ) are ucorrelated with a liear fuctio of the iputs. we ca easil eted these to o-liear fuctios of the iputs while still keepig them liear i the parameters Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL 6

Beod liear regressio Eample etesio: m th order polomial regressio where is give b f(; w) = w + w +... + w m m + w m m Polomial regressio liear i the parameters, o-liear i the iputs solutio as before ŵ = (X T X) X T where ŵ... m ŵ ŵ =, X =... m ŵ m... m!!!!!! degree = degree = 3!!!!!! degree = degree = 7 Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8 Compleit ad overfittig With limited traiig eamples our polomial regressio model ma achieve zero traiig error but evertless has a large test (geeralizatio) error trai ( t f( t ; ŵ)) t= test E (,) P ( f(; ŵ))!!! We suffer from over-fittig whe the traiig error o loger bears a relatio to the geeralizatio error Avoidig over-fittig: cross-validatio Cross-validatio allows us to estimate the geeralizatio error based o traiig eamples aloe Leave-oe-out cross-validatio treats each traiig eample i tur as a test eample: CV = ( i f( i ; ŵ i ) ) where ŵ i are the least squares estimates of the parameters without the i th traiig eample. 3!!!3!!!!.!!... 3!!!3!!!!.!!... Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Polomial regressio: eample cot d Additive models More geerall, predictios ca be based o a liear combiatio of a set of basis fuctios (or features) {φ (),..., φ m ()}, where each φ i () : R d R, ad!!!!!! degree =, CV =.6 degree = 3, CV =. Eamples: f(; w) = w + w φ () +... + w m φ m () If φ i () = i, i =,..., m, the f(; w) = w + w +... + w m m + w m m!!!!!! degree =, CV = 6. degree = 7, CV =.6 Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL

Additive models More geerall, predictios ca be based o a liear combiatio of a set of basis fuctios (or features) {φ (),..., φ m ()}, where each φ i () : R d R, ad Eamples: f(; w) = w + w φ () +... + w m φ m () If φ i () = i, i =,..., m, the f(; w) = w + w +... + w m m + w m m If m = d, φ i () = i, i =,..., d, the The basis fuctios ca capture various (e.g., qualitative) properties of the iputs. For eample: we ca tr to rate compaies based o tet descriptios = tet documet (collectio of words) { if word i appears i the documet φ i () = otherwise f(; w) = w + i words w i φ i () f(; w) = w + w +... + w d d Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL We ca also make predictios b gaugig the similarit of eamples to prototpes. For eample, our additive regressio fuctio could be f(; w) = w + w φ () +... + w m φ m () where the basis fuctios are radial basis fuctios φ k () = ep{ σ k } measurig the similarit to the prototpes; σ cotrols how quickl the basis fuctio vaishes as a fuctio of the distace to the prototpe. (traiig eamples themselves could serve as prototpes) We ca view the additive models graphicall i terms of simple uits ad weights φ () w w f(; w)... φ m () I eural etworks the basis fuctios themselves have adjustable parameters (cf. prototpes) w m Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL 6 Squared loss ad populatio miimizer What do we get if we have ulimited traiig eamples (the whole populatio) ad o costraits o the regressio fuctio? miimize E (,) P ( f()) with respect to a ucostraied fuctio!!!!! Squared loss ad populatio miimizer To miimize E (,) P ( f()) = E P [ E P ( f()) ] we ca focus o each separatel sice f() ca be chose idepedetl for each differet. For a particular we ca f() E P ( f()) = E P ( f()) = (E{ } f()) = Thus the fuctio we are trig to approimate is the coditioal epectatio f () = E{ } Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

Topics Beod liear regressio models additive regressio models, eamples geeralizatio ad cross-validatio populatio miimizer Statistical regressio models model formulatio, motivatio maimum likelihood estimatio Statistical view of liear regressio I a statistical regressio model we model both the fuctio ad oise Observed output = fuctio + oise where, e.g., ɛ N(, σ ). Whatever we caot capture with our chose famil of fuctios will be iterpreted as oise = f(; w) + ɛ 6!!!6!! Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL Statistical view of liear regressio f(; w) is trig to capture the mea of the observatios give the iput : E{ } = E{ f(; w) + ɛ } = f(; w) where E{ } is the coditioal epectatio of give, evaluated accordig to the model (ot accordig to the uderlig distributio P ) Statistical view of liear regressio Accordig to our statistical model = f(; w) + ɛ, ɛ N(, σ ) the outputs give are ormall distributed with mea f(; w) ad variace σ : p(, w, σ ) = ep{ πσ σ ( f(; w)) } (we model the ucertait i the predictios, ot just the mea) Loss fuctio? Estimatio?!!! Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL Maimum likelihood estimatio Give observatios D = {(, ),..., (, )} we fid the parameters w that maimize the (coditioal) likelihood of the outputs L(D ; w, σ ) = p( i i, w, σ ) Eample: liear fuctio p(, w, σ ) = πσ ep{ σ ( w w ) } 6!!!6!! (wh is this a bad fit accordig to the likelihood criterio?) Maimum likelihood estimatio cot d Likelihood of the observed outputs: L(D; w, σ ) = P ( i i, w, σ ) It is ofte easier (but equivalet) to tr to maimize the log-likelihood: l(d; w, σ ) = log L(D; w, σ ) = log P ( i i, w, σ ) ( = σ ( i f( i ; w)) log ) πσ ( = ) σ ( i f( i ; w)) +... Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL

Maimum likelihood estimatio cot d Maimizig log-likelihood is equivalet to miimizig empirical loss whe the loss is defied accordig to Loss( i, f( i ; w)) = log P ( i i, w, σ ) Loss defied as the egative log-probabilit is kow as the log-loss. Maimum likelihood estimatio cot d The log-likelihood of observatios log L(D; w, σ ) = log P ( i i, w, σ ) is a geeric fittig criterio ad ca be used to estimate the oise variace σ as well. Let ŵ be the maimum likelihood (here least squares) settig of the parameters. What is the maimum likelihood estimate of σ, obtaied b solvig σ log L(D; w, σ ) =? Tommi Jaakkola, MIT CSAIL Tommi Jaakkola, MIT CSAIL 6 Maimum likelihood estimatio cot d The log-likelihood of observatios log L(D; w, σ ) = log P ( i i, w, σ ) is a geeric fittig criterio ad ca be used to estimate the oise variace σ as well. Let ŵ be the maimum likelihood (here least squares) settig of the parameters. The maimum likelihood estimate of the oise variace σ is ˆσ = ( i f( i ; ŵ)) i.e., the mea squared predictio error. Tommi Jaakkola, MIT CSAIL 7