Regression and generalization

Similar documents
Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Exponential Families and Bayesian Inference

Lecture 13: Maximum Likelihood Estimation

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 901 Lecture 13: Maximum Likelihood Estimation

Lecture 11 and 12: Basic estimation theory

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

1.010 Uncertainty in Engineering Fall 2008

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Expectation-Maximization Algorithm.

Algorithms for Clustering

Classification with linear models

Bayesian Methods: Introduction to Multi-parameter Models

Linear Regression Models

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

5 : Exponential Family and Generalized Linear Models

Machine Learning Assignment-1

STATISTICS 593C: Spring, Model Selection and Regularization

Problem Set 4 Due Oct, 12

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Lecture 12: September 27

Empirical Process Theory and Oracle Inequalities

A Note on Box-Cox Quantile Regression Estimation of the Parameters of the Generalized Pareto Distribution

Support vector machine revisited

6.867 Machine learning, lecture 7 (Jaakkola) 1

TAMS24: Notations and Formulas

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Introduction to Machine Learning DIS10

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

EE 6885 Statistical Pattern Recognition

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

The Expectation-Maximization (EM) Algorithm

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 7: Properties of Random Samples

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Probability and MLE.

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

SDS 321: Introduction to Probability and Statistics

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Statistical Pattern Recognition

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Quantile regression with multilayer perceptrons.

15-780: Graduate Artificial Intelligence. Density estimation

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

(all terms are scalars).the minimization is clearer in sum notation:

Summary. Recap ... Last Lecture. Summary. Theorem

Machine Learning Theory (CS 6783)

Machine Learning Brett Bernstein

Maximum Likelihood Estimation

1 Review and Overview

Maximum Likelihood Estimation and Complexity Regularization

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

ECON 3150/4150, Spring term Lecture 3

1 Models for Matched Pairs

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

CSIE/GINM, NTU 2009/11/30 1

10-701/ Machine Learning Mid-term Exam Solution

Machine Learning Brett Bernstein

Correlation Regression

Introductory statistics

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Mathematical Statistics Anna Janicka

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Mixtures of Gaussians and the EM Algorithm

STATISTICAL INFERENCE

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Chapter 6 Principles of Data Reduction

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Simulation. Two Rule For Inverting A Distribution Function

Lecture 2: Monte Carlo Simulation

of the matrix is =-85, so it is not positive definite. Thus, the first

Unsupervised Learning 2001

The Method of Least Squares. To understand least squares fitting of data.

Maximum Likelihood Estimation

Estimation of the Mean and the ACVF

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Lecture 2 October 11

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Nonlinear regression

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Monte Carlo method and application to random processes

Axis Aligned Ellipsoid

5. Likelihood Ratio Tests

Lecture 3: MLE and Regression

REGRESSION WITH QUADRATIC LOSS

ECO 312 Fall 2013 Chris Sims LIKELIHOOD, POSTERIORS, DIAGNOSING NON-NORMALITY

Machine Learning for Data Science (CS 4786)

Chapter 7 Maximum Likelihood Estimate (MLE)

Transcription:

Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016

Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability distributio Example: f(x; w) f(x 0 ; w) p(y x 0, w, σ) 2

The learig diagram icludig oisy target Type equatio here. h: X Y x 1, y (1),, x N, y (N) x 1,, x N f x = h(x) f: X Y P x, y = P x P(y x) Distributio o features Target distributio 3 [Y.S. Abou Mostafa, et. al]

Curve fittig: probabilistic perspective (Example) Special case: Observed output = fuctio + oise y = f x; w + ε e.g., ε~n(0, σ 2 ) Noise: Whatever we caot capture with our chose family of fuctios 4

Curve fittig: probabilistic perspective (Example) Best regressio f x; w E y x = E f(x; w) + ε = f(x; w) ε~n(0,σ 2 ) is tryig to capture the mea of the observatios y give the iput x: E y x : coditioal expectatio of y give x evaluated accordig to the model (ot accordig to the uderlyig distributio P) 5

Curve fittig usig probabilistic estimatio Maximum Likelihood (ML) estimatio Maximum A Posteriori (MAP) estimatio Bayesia approach 6

Maximum likelihood estimatio Give observatios D = x i, y i i=1 Fid the parameters that maximize the (coditioal) likelihood of the outputs: L D; θ = p y X, θ = p(y i x i, θ) i=1 y = y (1) y () X = 1 x 1 (1) 1 () 1 x 1 x 1 (2) (1) x d x (2) d x d () 7

Maximum likelihood estimatio (Cot d) y = f x; w + ε, ε~n(0, σ 2 ) y give x is ormally distributed with mea f(x; w) ad variace σ 2 : we model the ucertaity i the predictios, ot just the mea p(y x, w, σ 2 ) = 1 2πσ exp{ 1 2σ 2 y f x; w 2 } 8

Maximum likelihood estimatio (Cot d) Example: uivariate liear fuctio p(y x, w, σ 2 ) = 1 2πσ exp{ 1 2σ 2 y w 0 w 1 x 2 } Why is this lie a bad fit accordig to the likelihood criterio? p(y x,w,σ 2 ) for most of the poits will be ear zero (as they are far from this lie) 9

Maximum likelihood estimatio (Cot d) Maximize the likelihood of the outputs (i.i.d): L D; w, σ 2 = i=1 p(y i x (i), w, σ 2 ) w = argmax w = argmax w i=1 L D; w, σ 2 p(y i x (i), w, σ 2 ) 10

Maximum likelihood estimatio (Cot d) It is ofte easier (but equivalet) to try to maximize the log-likelihood: l i=1 w = argmax w p(y i x (i), w, σ 2 ) = = l σ 2 l p y X, w, σ 2 i=1 l 2π 1 2σ 2 i=1 l N(y i x i, w, σ 2 ) y i f(x i ; w) 2 sum of squares error 11

Maximum likelihood estimatio (Cot d) Maximizig log-likelihood (whe we assume y = f x; w + ε, ε~n(0, σ 2 )) is equivalet to miimizig SSE Let w be the maximum likelihood (here least squares) settig of the parameters. What is the maximum likelihood estimate of σ 2? 12 log L(D; w, σ 2 ) σ 2 = 0 σ 2 = 1 i=1 y i f(x i ; w) 2 Mea squared predictio error

Maximum likelihood estimatio (Cot d) Geerally, maximizig log-likelihood is equivalet to miimizig empirical loss whe the loss is defied accordig to: Loss y i, f x i, w = l p(y i x (i), w, θ) Loss: egative log-probability More geeral distributios for p(y x) ca be cosidered 13

Maximum A Posterior (MAP) estimatio MAP: Give observatios D Fid the parameters that maximize the probabilities of the parameters after observig the data (posterior probabilities): θ MAP = max θ p θ D ) Sice p θ D p D θ p(θ) θ MAP = max θ p D θ p(θ) 14

Maximum A Posterior (MAP) estimatio Give observatios D = x i, y i i=1 max w p(w X, y) p y X, w p(w) p w = N 0, α 2 I = 1 2πα d+1 exp 1 2α 2 wt w 15

Maximum A Posterior (MAP) estimatio Give observatios D = x i, y i i=1 max w l p y X, w, σ2 p(w) mi w 1 σ 2 i=1 y i f(x i ; w) 2 + 1 α 2 wt w Equivalet to regularized SSE with λ = σ2 α 2 16

Bayesia approach Give observatios D = x i, y i i=1 Fid the parameters that maximize the probabilities of observatios p y x, D = p y w, x p w D dw Example of prior distributio p w = N(0, α 2 I) 17

Bayesia approach Give observatios D = x i, y i i=1 Fid the parameters that maximize the probabilities of observatios p D w = L D; w, θ = p y i w T x i, θ N i=1 p y i f x i, w, θ = N(y i w T x i, σ 2 ) N p w = N(0, α 2 I) p(w D) p D w p(w) Predictive distributio p(y x, D) = p y w, x p w D dw p y x, D = N m N T x, σ N 2 (x) 18

Predictive distributio: example Example: Siusoidal data, 9 Gaussia basis fuctios Red curve shows the mea of the predictive distributio Pik regio spas oe stadard deviatio either side of the mea [Bishop]

Predictive distributio: example Fuctios whose parameters are sampled from p(w D) 20 [Bishop]