Lecture Notes on Linear Regression

Similar documents
CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Generalized Linear Methods

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Singular Value Decomposition: Theory and Applications

Lecture 20: November 7

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Maximum Likelihood Estimation

Lecture 10 Support Vector Machines. Oct

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

10-701/ Machine Learning, Fall 2005 Homework 3

Linear Approximation with Regularization and Moving Least Squares

The Geometry of Logit and Probit

Classification as a Regression Problem

Learning Theory: Lecture Notes

Week 5: Neural Networks

Problem Set 9 Solutions

MMA and GCMMA two methods for nonlinear optimization

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Multilayer Perceptron (MLP)

Solutions to exam in SF1811 Optimization, Jan 14, 2015

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Lecture 10 Support Vector Machines II

The exam is closed book, closed notes except your one-page cheat sheet.

1 Convex Optimization

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Logistic Regression Maximum Likelihood Estimation

Feature Selection: Part 1

Which Separator? Spring 1

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Supporting Information

Kernel Methods and SVMs Extension

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Chapter Newton s Method

Composite Hypotheses testing

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Chapter - 2. Distribution System Power Flow Analysis

Support Vector Machines CS434

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Ensemble Methods: Boosting

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

Assortment Optimization under MNL

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Lecture 2: Prelude to the big shrink

Linear Classification, SVMs and Nearest Neighbors

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

The Expectation-Maximization Algorithm

Neural networks. Nuno Vasconcelos ECE Department, UCSD

VQ widely used in coding speech, image, and video

Review: Fit a line to N data points

Support Vector Machines CS434

Newton s Method for One - Dimensional Optimization - Theory

Estimation: Part 2. Chapter GREG estimation

SDMML HT MSc Problem Sheet 4

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Limited Dependent Variables

Goodness of fit and Wilks theorem

Markov Chain Monte Carlo Lecture 6

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

First Year Examination Department of Statistics, University of Florida

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Introduction to the R Statistical Computing Environment R Programming

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Errors for Linear Systems

The Minimum Universal Cost Flow in an Infeasible Flow Network

Maximum Likelihood Estimation (MLE)

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Maximal Margin Classifier

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Finding Dense Subgraphs in G(n, 1/2)

Hydrological statistics. Hydrological statistics and extremes

Support Vector Machines

Lecture 3: Shannon s Theorem

Laboratory 1c: Method of Least Squares

A 2D Bounded Linear Program (H,c) 2D Linear Programming

EEE 241: Linear Systems

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Laboratory 3: Method of Least Squares

IV. Performance Optimization

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

CSC 411 / CSC D11 / CSC C11

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

STAT 3008 Applied Regression Analysis

Linear Regression Analysis: Terminology and Notation

Transcription:

Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume a n-dmensonal feature vector s denoted by x R n, whle y R s the output varable In lnear regresson models, the hypothess functon s defned by h θ (x) = θ T x () where θ R n+ s a parameter vector Takng 2D lnear regresson for example, we have x R and θ = [θ, θ 0 ] T Therefore, the hypothess functon s h θ (x) = θ x + θ 0 The 2D hypothess functon actually s a straght lne defne on a 2D plane It s apparent that the hypothess functon s parameterzed by θ Snce our goal s to make predctons accordng to the hypothess functon gven a new test data, we need to fnd the optmal value of θ such that the resultng predcton s as accurate as possble Such a procedure s so-called tranng The tranng procedure s performed based on a gven set of m tranng data {x (), y () },,m In partcular, we are supposed to fnd a hypothess functon (parameterzed by θ) whch fts the tranng data as closely as possble To measure the error between h θ and the tranng data, we defne a cost functon (also called error functon) J(θ) : R n+ R as follows J(θ) = 2 m (h θ (x () ) y ()) 2 Our lnear regresson problem can be formulated as mn θ J(θ) = 2 m (θ T x () y ()) 2 We gve an llustraton n Fg to explan lnear regresson n 3D space (e, n = 2) In the 3D space, the hypothess functon s represented by a hyperplane The red ponts denote the tranng data, and t s shown that, J(θ) s the sum of the dfferences between the target values of the tranng data and the hyperplane

2 Gradent Descent Fgure : 3D lnear regresson Gradent Descent (GD) method s a frst-order teratve optmzaton algorthm for fndng the mnmum of a functon If the mult-varable functon J(θ) s dfferentable n a neghborhood of a pont θ, then J(θ) decreases fastest f one goes from θ n the drecton of the negatve gradent of J at θ Let J(θ) = [ J θ 0, J θ,, J θ n ] T (2) denote the gradent of J(θ) In each teraton, we update θ accordng to the followng rule: θ θ α J(θ) (3) where α s a step sze In more detals, θ j θ j α J(θ) θ j (4) The update s termnated when convergence s acheved In our lnear regresson model, the gradent can be calculated as J(θ) = θ j θ j 2 m m (θ T x () y () ) 2 = (θ T x () y () )x () j (5) We summarze the GD method as follows n Algorthm The algorthm usually starts wth a randomly ntalzed θ In each teraton, we update θ such that the objectve functon s decreased monotoncally The algorthm s sad to be converged when the dfference of J n successve teratons s less than (or equal to) a predefned threshold (say ε) Assumng θ (t) and θ (t+) are the values of θ n the t-th teraton and the (t + )-th teraton, respectvely, the algorthm s converged when J(θ (t+) ) J(θ (t) ) ε (6) 2

Algorthm : Gradent Descent Gven a startng pont θ dom J repeat Calculate gradent J(θ); 2 Update θ θ α J(θ) untl convergence crteron s satsfed Fgure 2: The convergence of GD algorthm Another convergence crteron s to set a fxed value for the maxmum number of teratons, such that the algorthm s termnated after the number of the teratons exceeds the threshold We llustrate how the algorthm converges teratvely n Fg 2 The colored contours represent the objectve functon, and the GD algorthm converges nto the mnmum step-by-step The choce of the step sze α actually has a very mportant nfluence on the convergence of the GD algorthm We llustrate the convergence processes under dfferent step szes n Fg 3 3 Stochastc Gradent Descent Accordng to Eq 5, t s observed that we have to vst all tranng data n each teraton Therefore, the nduced cost s consderable especally when the tranng data are of bg sze Stochastc Gradent Descent (SGD), also known as ncremental gradent descent, s a stochastc approxmaton of the gradent descent optmzaton method In each teraton, the parameters are updated accordng to the gradent of the error (e, the cost functon) wth respect to one tranng sample only Hence, t entals very lmted cost We summarze the SGD method n Algorthm 2 In each teraton, we frst randomly shuffle the tranng data, and then choose only one tranng example to 3

Objectve functon J 06, = 006, = 007, = 007 05 04 03 02 0 0 0 00 200 300 400 500 600 700 800 900 000 Iteratons Fgure 3: The convergence of GD algorthm under dfferent step szes calculate the gradent (e, J(θ; x (), y () )) to update θ In our lnear regresson model, J(θ; x (), y () ) s defned as and the update rule s J(θ; x (), y () ) = (θ T x () y () )x () (7) θ j θ j α(θ T x () y () )x () j (8) Algorthm 2: Stochastc Gradent Descent for Lnear Regresson : Gven a startng pont θ dom J 2: repeat 3: Randomly shuffle the tranng data; 4: for =, 2,, m do 5: θ θ α J(θ; x (), y () ) 6: end for 7: untl convergence crteron s satsfed Compared wth GD where the objectve cost functon s decreased for each step, SGD does not have such a guarantee In fact, SGD entals more steps to converge, but each step s cheaper One varants of SGD s so-called mnbatch SGD, where we pck up a small group of tranng data and do average to accelerate and smoothen the convergence For example, by randomly choosng k tranng data, we can calculate the average the gradent k k J(θ; x (), y () ) (9) 4

4 A Closed-Form Soluton to Lnear Regresson We frst look at the vector form of the lnear regresson model Assume (x () ) T y () X = Y = (x (m) ) T Therefore, we have (x () ) T θ Xθ Y = x (m) ) T θ y () y (m) = Then, the cost functon J(θ) can be redefned as J(θ) = 2 y (m) (0) h θ (x () ) y () h θ (x (m) ) y (m) m (h θ (x () ) y () ) = 2 (Xθ Y )T (Xθ Y ) () To mnmze the cost functon, we calculate ts dervatve and let t be zero θ J(θ) = θ 2 (Y Xθ)T (Y Xθ) = 2 θ(y T θ T X T )(Y Xθ) = 2 θtr(y T Y Y T Xθ θ T X T Y + θ T X T Xθ) = 2 θtr(θ T X T Xθ) X T Y = 2 (XT Xθ + X T Xθ) X T Y = X T Xθ X T Y Snce X T Xθ X T Y = 0, we have θ = (X T X) X T Y Note that the nverse of X T X does not always exst In fact, the matrx X T X s nvertble f and only f the columns of X are lnearly ndependent 5 A Probablstc Interpretaton An nterestng queston s why the least square form of the lnear regresson model s reasonable We hereby gve a probablstc nterpretaton We suppose a target value y are sampled from a lne θ T x wth certan nose Therefore, we have y () = x () + ε () where ε () denote the nose and s ndependently and dentcally dstrbuted (d) accordng to a Gaussan dstrbuton N (0, σ 2 ) The densty of ε () s gven by f(ɛ () ) = exp ( (ɛ() ) 2 ) 2πσ 2σ 2 5

and we equvalently have the followng condtonal probablty Pr(y () x () ; θ) = exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 Therefore, t s shown that the dstrbuton of y () gven x () s parameterzed by θ, e, y () x () ; θ N (θ T x (), σ 2 ) Recalled that Y and X are the vector of the target values and the matrx of the features (see Eq (0)), and Y = Xθ Consderng ε () s are d, we defne the followng lkelhood functon L(θ) = Pr(y () x () ; θ) = exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 whch denote the probablty of the gven target values n tranng data To calculatng the optmal θ such that the resultng lnear regresson model fts the gven tranng data best, we need to maxmze the lkelhood L(θ) To smplfy the computaton, we use the log lkelhood functon nstead, e, l(θ) = log L(θ) m = log exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 m = log exp ( (y() θ T x () ) 2 ) 2πσ 2σ 2 = m log 2πσ 2σ 2 (y () θ T x () ) 2 Apparently, maxmzng L(θ) s equvalent to mnmzng 2 (y () θ T x () ) 2 6