Linear Regression. S. Sumitra

Similar documents
Least Mean Squares Regression

Least Mean Squares Regression. Machine Learning Fall 2018

ECS171: Machine Learning

CSCI567 Machine Learning (Fall 2014)

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. Lecture 2: Linear regression. Feng Li.

Regression with Numerical Optimization. Logistic

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Learning from Data Logistic Regression

CS540 Machine learning Lecture 5

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

Machine Learning. Regression. Manfred Huber

Support Vector Machines and Kernel Methods

Learning with Momentum, Conjugate Gradient Learning

Linear Models for Regression CS534

Optimization and Gradient Descent

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Models for Regression CS534

CSC 411: Lecture 02: Linear Regression

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Linear & nonlinear classifiers

Optimization methods

Least Mean Squares Regression. Machine Learning Fall 2017

Ch 4. Linear Models for Classification

CS260: Machine Learning Algorithms

Machine Learning Lecture 5

Linear Models for Regression CS534

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Warm up: risk prediction with logistic regression

Linear Discrimination Functions

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Linear & nonlinear classifiers

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Machine Learning and Adaptive Systems. Lectures 3 & 4

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Lecture 3 - Linear and Logistic Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

Deep Learning for Computer Vision

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Gradient Descent. Stephan Robert

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Introduction to gradient descent

Machine Learning Linear Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 7

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Midterm exam CS 189/289, Fall 2015

Neural Network Training

Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem.

Classification with Perceptrons. Reading:

Big Data Analytics: Optimization and Randomization

Simple Neural Nets For Pattern Classification

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Introduction to Optimization

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Support Vector Machine

Bits of Machine Learning Part 1: Supervised Learning

Stochastic gradient descent; Classification

Gradient Descent. Model

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CSC242: Intro to AI. Lecture 21

Constrained optimization: direct methods (cont.)

The Perceptron Algorithm 1

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Linear discriminant functions

Machine Learning (CS 567) Lecture 3

A summary of Deep Learning without Poor Local Minima

Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

Constrained Optimization

Numerical Optimization

Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Optimization methods

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Reading Group on Deep Learning Session 1

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture 2: Linear Algebra Review

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Max Margin-Classifier

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Midterm: CS 6375 Spring 2015 Solutions

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression. COMP 527 Danushka Bollegala

Binary Classification / Perceptron

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Transcription:

Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D denote the space of input values and Y denote the space of output values If the target variable is continuous, the learning problem is known as regression If the target variable y takes only discrete values, it is a classification problem In this paper, D =R n and Y= R The simplest linear model is the representation of f as a linear combination of x That is, f(x i ) = w 0 + w 1 x i1 + w x i + w n x in (1) where, x i = (x i1, x i, x in ) T and w = (w 0, w 1,, w n ) T R n+1 Here, w is are the parameters, parameterizing the space of linear functions mapping from X to Y (1) is the equation of a hyperplane By taking x i0 = 1, (1) can be written as n f(x i ) = w j x ij = w T x i () Given a training set, how to choose the values of w is? As there are N data points, we could write N equations such that, f(x i ) = j=0 n w j x ij = w T x i, i = 1,, N (3) j=0 Define the design matrix to be X = (x 1 ) T (x ) T (x N ) T 1

Here, x is are the training data (each x i is a n dimensional vector) X is a N n + 1 matrix, if we include the intercept term, that is if we set x i = (x i0, x i1, x in ) T and set x i0 = 1 Let y be the N dimensional vector that contains all the target values, that is, y = Now the matrix representation for (3) is y 1 y y N Xw = y (4) The range space of X is spanned by the columns of X If X is a square matrix and inverse exists, w = X 1 y This might not be the case always (4) may have no solution or more than one solution The former case happens when X is not onto and later case happens X is not one to one Next we consider the case of rectangular matrices 1 y is not in the range of X (n + 1 < N) We will first consider the case when y is not in the range of X In this case we find the preimage of the projection of y onto the range space of X That is, the optimal solution is w = arg min J(w) w Rn+1 where J(w) = 1 (d(xw, y)) = 1 Xw y (5) J(w) is called the least square cost function Now J(w) = 1(d(Xw, y)) = 1 Xw y = 1 Xw y, Xw y At the minimum value of w, J = 0 That is, J = X T Xw X T y = 0

Hence, (6) is called the normal equation Hence, X T Xw = X T y (6) w = (X T X) 1 X T y (7) provided (X T X) 1 X T exists (X T X) 1 X T is called the pseudo-inverse For determining w using derivative method, the inverse of X T X is to be found, which is not computationally effective for large data sets Hence we resort to iterative search algorithms for finding w 11 Least Mean Squares Algorithm For finding w that minimizes J, we apply an iterative search algorithm An iterative search algorithm that minimizes J(w), starts with an initial guess of w and then repeatedly change w to make J(w) smaller, until it converge to the values that minimizes J(w) We consider gradient descent for finding w [Gradient descent: If a real valued function F (x) is defined and differentiable in a neighbourhood of point a, then F (x) decreases fastest if one goes from a in the direction of the negative gradient of F at a, F (a) The gradient of the function is always perpendicular to the contour lines(a contour line of a function of two variables is a curve along which the function has a constant value)] (5) can also be written as J(w) = 1 (f(x i ) y i ) (8) For applying gradient descent, consider the following steps Choose an initial w = (w 0, w 1, w n ) T R n+1 Then repeatedly performs the update w := w α J (9) Here, α > 0 is called the learning rate Since f is a function of w 0, w 1,, w n, J is a function of w 0, w 1,, w n Therefore, J = ( J, J,, J ) T (10) w 0 w 1 w n 3

Sub: (10) in (9), ( J (w 0, w 1, w n ) T := (w 0, w 1, w n ) T α, J,, J ) T (11) w 1 w w n Hence, Now Therefore, w j := w j α J w j, j = 0, 1, n (1) J = 1 w j w j = = w j := w j + α (f(x i ) y i ) (f(x i ) y i ) w j (f(x i ) y i )x ij ( n ) w j x ij j=0 (y i f(x i ))x ij, j = 0, 1,, n (13) (13) is called LMS (least mean squares) update or Widrow -Hoff learning rule Pseudocode of the algorithm can be written as: Iterate until convergence { w j := w j + α (y i f(x i ))x ij, j = 0, 1,, n } The magnitude of the update of parameter is proportional to the error term (y i f(x i )) Larger change to the parameter is made when the error is large and vice versa For updating the parameter, the algorithm looks at every data point in the training set at every step and hence it called batch gradient descent In general, gradient descent does not guarantee a global minimum However as J is a convex quadratic function, the algorithm converges to the global minimum (assuming α is not too large) 4

There is an alternative to batch gradient descent called stochastic gradient descent which can be stated as follows: Iterate until convergence { for to N{ w j := w j + α(y i f(x i ))x ij, j = 0, 1,, n }} In contrast to batch gradient, stochastic gradient process only one training point at each step Hence when N becomes large, that is, for large data sets, stochastic gradient descent is more computationally efficient than batch gradient descent y has more than one pre-image (N < (n + 1)) The second case is when y is in the range of X but has more than one pre-image As there would be more than one w that satisfies the given equation, the following constrained optimization problem has to be considered: By applying lagrangian theory, minimize w R n+1 w subject to Xw = y (14) L(w, λ) = w + λ T (Xw y) where λ T = (λ 1, λ, λ N ) λ i, i = 1,, N are the lagrangian parameters By equating L w = 0 w + X T λ = 0 Hence w = XT λ By equating L λ = 0 we get, Xw y = 0 (16) (15) 5

Using (15), the above equation becomes Therefore Sub: (17) to (15), XX T λ = y λ = (XX T ) 1 y (17) w = X T (XX T ) 1 y References (1)Andrew Ng s lecture note (CS 9) () Bishop s Book: Pattern Recognition and Machine Learning 6