Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Similar documents
Lecture Notes on Linear Regression

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

1 Convex Optimization

15-381: Artificial Intelligence. Regression and cross validation

10-701/ Machine Learning, Fall 2005 Homework 3

Generalized Linear Methods

Support Vector Machines

Which Separator? Spring 1

Support Vector Machines

Singular Value Decomposition: Theory and Applications

Lecture 10 Support Vector Machines II

Feature Selection: Part 1

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Support Vector Machines

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Linear Feature Engineering 11

Discriminative classifier: Logistic Regression. CS534-Machine Learning

The Geometry of Logit and Probit

Generative classification models

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Topic 5: Non-Linear Regression

Support Vector Machines

Lecture 10 Support Vector Machines. Oct

Classification as a Regression Problem

Week 5: Neural Networks

JAB Chain. Long-tail claims development. ASTIN - September 2005 B.Verdier A. Klinger

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Lecture 3: Dual problems and Kernels

OPTIMISATION. Introduction Single Variable Unconstrained Optimisation Multivariable Unconstrained Optimisation Linear Programming

Lecture 20: November 7

Linear Approximation with Regularization and Moving Least Squares

Chapter 9: Statistical Inference and the Relationship between Two Variables

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

EEE 241: Linear Systems

CSE 546 Midterm Exam, Fall 2014(with Solution)

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

APPENDIX A Some Linear Algebra

Maximum Likelihood Estimation (MLE)

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Multilayer Perceptron (MLP)

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

CSCI B609: Foundations of Data Science

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Homework Assignment 3 Due in class, Thursday October 15

Polynomial Regression Models

Multilayer neural networks

Non-linear Canonical Correlation Analysis Using a RBF Network

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

The exam is closed book, closed notes except your one-page cheat sheet.

Supporting Information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

β0 + β1xi. You are interested in estimating the unknown parameters β

Parameter estimation class 5

Classification learning II

Multi-layer neural networks

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Neural networks. Nuno Vasconcelos ECE Department, UCSD

e i is a random error

Kernel Methods and SVMs Extension

Report on Image warping

Lecture 21: Numerical methods for pricing American type derivatives

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Feb 14: Spatial analysis of data fields

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Evaluation of classifiers MLPs

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

Logistic Classifier CISC 5800 Professor Daniel Leeds

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Hydrological statistics. Hydrological statistics and extremes

Errors for Linear Systems

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Ensemble Methods: Boosting

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Probabilistic Classification: Bayes Classifiers. Lecture 6:

The Fundamental Theorem of Algebra. Objective To use the Fundamental Theorem of Algebra to solve polynomial equations with complex solutions

Solutions to exam in SF1811 Optimization, Jan 14, 2015

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Kristin P. Bennett. Rensselaer Polytechnic Institute

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

Pattern Classification

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Support Vector Machines

Introduction to Regression

Logistic Regression Maximum Likelihood Estimation

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Transcription:

School of Computer Scence 10-601 Introducton to Machne Learnng Lnear Regresson Readngs: Bshop, 3.1 Matt Gormle Lecture 5 September 14, 016 1

Homework : Remnders Extenson: due Frda (9/16) at 5:30pm Rectaton schedule posted on course webste

Outlne Lnear Regresson Smple example Model Learnng Gradent Descent SGD Closed Form Advanced opcs Geometrc and Probablstc Interpretaton of LMS L Regularzaton L1 Regularzaton Features 3

Outlne Lnear Regresson Smple example Model Learnng (aka. Least Squares) Gradent Descent SGD (aka. Least Mean Squares (LMS)) Closed Form (aka. Normal Equatons) Advanced opcs Geometrc and Probablstc Interpretaton of LMS L Regularzaton (aka. Rdge Regresson) L1 Regularzaton (aka. LASSO) Features (aka. non- lnear bass functons) 4

Lnear regresson Our goal s to estmate w from a tranng data of <x, > pars Y = wx + ε Optmzaton goal: mnmze squared error (least squares): arg mn w ( wx ) Wh least squares? - mnmzes squared dstance between measurements and predcted lne see HW - has a nce probablstc nterpretaton - the math s prett

Solvng lnear regresson o optmze closed form: We just take the dervatve w.r.t. to w and set to 0: w ( wx ) = x ( wx ) x ( wx ) = 0 w = x = wx x x x wx x = 0

Lnear regresson Gven an nput x we would lke to compute an output In lnear regresson we assume that and x are related wth the followng equaton: What we are trng to predct = wx+ε Observed values Y where w s a parameter and ε represents measurement or other nose

Regresson example Generated: w= Recovered: w=.03 Nose: std=1

Regresson example Generated: w= Recovered: w=.05 Nose: std=

Regresson example Generated: w= Recovered: w=.08 Nose: std=4

Bas term So far we assumed that the lne passes through the orgn What f the lne does not? No problem, smpl change the model to = w 0 + w 1 x+ε Can use least squares to determne w 0, w 1 w 0 Y w 0 = n w x 1 x ( w1 = x w 0 )

Lnear Regresson Data: Inputs are contnuous vectors of length K. Outputs are contnuous scalars. D = {x (), () } N =1 where x R K and R What are some example problems of ths form? 1

Lnear Regresson Data: Inputs are contnuous vectors of length K. Outputs are contnuous scalars. D = {x (), () } N =1 where x R K and R Predcton: Output s a lnear functon of the nputs. ŷ = h (x) = 1 x 1 + x +...+ K x K ŷ = h (x) = x (We assume x 1 s 1) Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) 13

Least Squares Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) We mnmze the sum of the squares: J( )= 1 N =1 ( x () () ) Wh? 1. Reduces dstance between true measurements and predcted hperplane (lne n 1D). Has a nce probablstc nterpretaton 14

Least Squares Learnng: fnds the parameters that mnmze some objectve functon. = argmn J( ) We mnmze the sum of the squares: hs s a ver general optmzaton J( )= setup. 1 We could solve t n N =1 ( x () () ) Wh? lots of was. oda, 1. Reduces dstance between true measurements and we ll consder three predcted hperplane (lne n 1D). Has a was. nce probablstc nterpretaton 15

Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) 16

Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return In order to appl GD to Lnear Regresson all we need s the gradent of the objectve functon (.e. vector of partal dervatves). J( )= d d 1 J( ) d d J( ). d d N J( ) 17

Gradent Descent Algorthm 1 Gradent Descent 1: procedure GD(D, : (0) (0) ) 3: whle not converged do 4: + J( ) 5: return here are man possble was to detect convergence. For example, we could check whether the L norm of the gradent s below some small tolerance. J( ) Alternatvel we could check that the reducton n the objectve functon from one teraton to the next s small. 18

Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: + J () ( ) 6: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm We need a per- example objectve: Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). 19

Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + d d k J () ( ) 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm We need a per- example objectve: Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). 0

Stochastc Gradent Descent (SGD) Algorthm Stochastc Gradent Descent (SGD) 1: procedure SGD(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + d d k J () ( ) 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm Let s start b calculatng We need a per- example objectve: ths partal dervatve for Let J( )= N =1 J () the Lnear Regresson ( ) objectve functon. where J () ( )= 1 ( x () () ). 1

Partal Dervatves for Lnear Reg. Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). d d k J () ( )= = 1 d 1 d k ( x () () ) d ( x () d k () ) =( x () () ) d d k ( x () () ) =( x () () ) d d k ( =( x () () )x () k K k=1 kx () k () )

Partal Dervatves for Lnear Reg. Let J( )= N =1 J () ( ) where J () ( )= 1 ( x () () ). d d k J () ( )=( x () d d k J( )= = d d k N =1 N =1 J () ( ) () )x () k ( x () () )x () k Used b SGD (aka. LMS) Used b Gradent Descent 3

Least Mean Squares (LMS) Algorthm 3 Least Mean Squares (LMS) 1: procedure LMS(D, : (0) (0) ) 3: whle not converged do 4: for shu e({1,,...,n}) do 5: for k {1,,...,K} do 6: k k + ( x () () )x () k 7: return Appled to Lnear Regresson, SGD s called the Least Mean Squares (LMS) algorthm 4

Optmzaton for Lnear Reg. vs. Logstc Reg. Can use the same trcks for both: regularzaton tunng learnng rate on development data shuffle examples out- of- core (f can t ft n memor) and stream over them local hll clmbng elds global optmum (both problems are convex) etc. But Logstc Regresson does not have a closed form soluton for MLE parameters what about Lnear Regresson? 5

Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) 6

he normal equatons l Wrte the cost functon n matrx form: l o mnmze J(), take dervatve and set to zero: Erc ng @ CMU, 006-011 7 ( ) ( ) ( ) J n!!!!!! + = = = = 1 1 1 1 ) ( ) ( x = x n x x!!! 1 = n! " 1 ( ) ( ) ( ) 0 1 1 1 = = + = + = + = J!!!!!!!!! tr tr tr tr! = he normal equatons ( )! 1 = *

Some matrx dervatves l For R m f : n! R, defne: l race: A tra = A11 f ( A) = " A1 n = 1 A, m f f! #! A 1n " A tra = a, mn f f tr ABC = trcab = trbca l Some fact of matrx dervatves (wthout proof) A trab = B, 1 traba C = CAB + C AB, A = A( A ) A A Erc ng @ CMU, 006-011 8

Comments on the normal equaton l In most stuatons of practcal nterest, the number of data ponts N s larger than the dmensonalt k of the nput space and the matrx s of full column rank. If ths condton holds, then t s eas to verf that s necessarl nvertble. l he assumpton that s nvertble mples that t s postve defnte, thus the crtcal pont we have found s a mnmum. l What f has less than full column rank? à regularzaton (later). Erc ng @ CMU, 006-011 9

Drect and Iteratve methods l Drect methods: we can acheve the soluton n a sngle step b solvng the normal equaton l l Usng Gaussan elmnaton or QR decomposton, we converge n a fnte number of steps It can be nfeasble when data are streamng n n real tme, or of ver large amount l Iteratve methods: stochastc or steepest gradent l l l Convergng n a lmtng sense But more attractve n large practcal problems Cauton s needed for decdng the learnng rate α Erc ng @ CMU, 006-011 30

Convergence rate l heorem: the steepest descent equaton algorthm converge to the mnmum of the cost characterzed b normal equaton: If l A formal analss of LMS need more math-mussels; n practce, one can use a small α, or graduall decrease α. Erc ng @ CMU, 006-011 31

Convergence Curves l For the batch method, the tranng MSE s ntall large due to unnformed ntalzaton l In the onlne update, N updates for ever epoch reduces MSE to a much smaller value. Erc ng @ CMU, 006-011 3

Least Squares Learnng: hree approaches to solvng = argmn J( ) Approach 1: Gradent Descent (take larger more certan steps opposte the gradent) pros: conceptuall smple, guaranteed convergence cons: batch, often slow to converge Approach : Stochastc Gradent Descent (SGD) (take man small steps opposte the gradent) pros: memor effcent, fast convergence, less prone to local optma cons: convergence n practce requres tunng and fancer varants Approach 3: Closed Form (set dervatves equal to zero and solve for parameters) pros: one shot algorthm! cons: does not scale to large datasets (matrx nverse s bottleneck) 33

Matchng Game Goal: Match the Algorthm to ts Update Rule 1. SGD for Logstc Regresson 4. h (x) = p( x). Least Mean Squares h (x) = 5. x 3. Perceptron (next lecture) h (x) = sgn( x) + (h (x() ) k k k k + k k 6. A. 1=5, =4, 3=6 B. 1=5, =6, 3=4 C. 1=6, =4, 3=4 D. 1=5, =6, 3=6 E. 1=6, =6, 3=6 () ) 1 1 + exp (h (x() ) + (h (x() ) () ) () () )xk 34

Geometrc Interpretaton of LMS he predctons on the tranng data are:! ( ) 1 ˆ! = * = Note that and! ˆ! ( 1! = I ) ( ) (!ˆ! ) = ( ) = ( ) 1! I ( ) 1 = 0!! ( ) ŷ!! s the orthogonal projecton of nto the space spanned b the columns of! =! 1 " n =! x x! 1 x n! Erc ng @ CMU, 006-011 35

Probablstc Interpretaton of LMS Let us assume that the target varable and the nputs are related b the equaton: where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ), then we have: B ndependence assumpton: ε + = x = 1 σ πσ ) ( exp ) ; ( x p x = = = = 1 1 1 σ πσ n n n x p L ) ( exp ) ; ( ) ( x 36 Erc ng @ CMU, 006-011

Probablstc Interpretaton of LMS, cont. Hence the log- lkelhood s: 1 1 1 n l( ) = nlog = ( 1 x ) πσ σ Do ou recognze the last term? Yes t s: n 1 J ( ) = ( x ) = 1 hus under ndependence assumpton, LMS s equvalent to MLE of! Erc ng @ CMU, 006-011 37

Rdge Regresson Adds an L regularzer to Lnear Regresson J RR ( )=J( )+ = 1 N =1 ( x () () ) + K k=1 k Baesan nterpretaton: MAP estmaton wth a Gaussan pror on the parameters MAP = argmax N =1 = argmax J RR ( ) log p ( () x () ) + log p( ) where p( ) N (0, 1 ) 38

LASSO Adds an L1 regularzer to Lnear Regresson J LASSO ( )=J( )+ 1 = 1 N =1 ( x () () ) + K k=1 k Baesan nterpretaton: MAP estmaton wth a Laplace pror on the parameters MAP = argmax N =1 = argmax J LASSO ( ) log p ( () x () ) + log p( ) where p( ) Laplace(0,f( )) 39

Rdge Regresson vs Lasso Rdge Regresson: Lasso: βs wth constant J(β) (level sets of J(β)) βs wth constant l norm β βs wth constant l1 norm β1 Lasso (l1 penalt) results n sparse soluons vector wth more zero coordnates Good for hgh- dmensonal problems don t have to store all coordnates! Erc ng @ CMU, 006-011 40

Non-Lnear bass functon So far we onl used the observed values x 1,x, However, lnear regresson can be appled n the same wa to functons of these values Eg: to add a term w x 1 x add a new varable z=x 1 x so each example becomes: x 1, x,. z As long as these functons can be drectl computed from the observed values the parameters are stll lnear n the data and the problem remans a mult-varate lnear regresson problem = w + w x + + w x k 0 1 1 k + ε

Non-lnear bass functons What tpe of functons can we use? A few common examples: - Polnomal: φ j (x) = x j for j=0 n - Gaussan: - Sgmod: φ j (x) = (x µ j ) σ j φ j (x) = 1 1+ exp( s j x) An functon of the nput values can be used. he soluton for the parameters of the regresson remans the same. - Logs: φ j (x) = log(x +1)

General lnear regresson problem Usng our new notatons for the bass functon lnear regresson can be wrtten as = n j= 0 w j φ j (x) Where φ j (x) can be ether x j for multvarate regresson or one of the non-lnear bass functons we defned and φ 0 (x)=1 for the ntercept term

An example: polnomal bass vectors on a small dataset From Bshop Ch 1

0 th Order Polnomal n=10

1 st Order Polnomal

3 rd Order Polnomal

9 th Order Polnomal

Over-fttng Root-Mean-Square (RMS) Error:

Polnomal Coeffcents

9 th Order Polnomal Data Set Sze:

Regularzaton Penalze large coeffcent values J, (w) = 1 # & % w j φ j (x ) ( $ j ' λ w

Regularzaton: +

Polnomal Coeffcents none exp(18) huge

Over Regularzaton: