Linear Models for Regression

Similar documents
Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Regression. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Models for Regression

CSCI567 Machine Learning (Fall 2014)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Regression (continued)

Overfitting, Bias / Variance Analysis

Machine Learning. 7. Logistic and Linear Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

y(x n, w) t n 2. (1)

ECE521 week 3: 23/26 January 2017

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Models in Machine Learning

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Models for Classification

Bayesian Linear Regression. Sargur Srihari

Reading Group on Deep Learning Session 1

Linear classifiers: Overfitting and regularization

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Today. Calculus. Linear Regression. Lagrange Multipliers

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Neural Network Training

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines

Introduction to Machine Learning

Classification Logistic Regression

Lecture 2 Machine Learning Review

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Bayesian Machine Learning

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 6. Regression

Linear Models for Regression CS534

PATTERN RECOGNITION AND MACHINE LEARNING

Stochastic Gradient Descent

Machine Learning and Data Mining. Linear regression. Kalev Kask

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Logistic Regression and Support Vector Machine

Ways to make neural networks generalize better

Bias-Variance Tradeoff

Learning from Data: Regression

CSC 411: Lecture 04: Logistic Regression

Gradient Descent. Sargur Srihari

Error Functions & Linear Regression (1)

Logistic Regression Logistic

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Logistic Regression & Neural Networks

Machine Learning Lecture 7

Machine Learning Basics III

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Machine Learning Basics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSC321 Lecture 2: Linear Regression

18.6 Regression and Classification with Linear Models

Linear Regression (9/11/13)

Midterm: CS 6375 Spring 2015 Solutions

Artificial Neural Networks

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

CMU-Q Lecture 24:

Machine Learning Lecture 5

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. COMP 527 Danushka Bollegala

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Practice Page 2 of 2 10/28/13

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Feedforward Networks

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Support Vector Machines

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Hidden Markov Models Part 1: Introduction

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Multiclass Logistic Regression

Regression with Numerical Optimization. Logistic

CSC321 Lecture 9: Generalization

Classifier Complexity and Support Vector Classifiers

Gradient Descent Training Rule: The Details

Ad Placement Strategies

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Linear & nonlinear classifiers

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear Models for Regression CS534

Transcription:

Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

The Regression Problem Training data: A set of input-output pairs: (x n, t n ) x n is the n-th training input. t n is the target output for x n. Goal: learn a function y(x), that can predict the target value t for a new input x. So far, this is the standard definition of a generic supervised learning problem. What differentiates regression problems is that the target outputs come from a continuous space. 2

A Regression Example circles: training data. red curve: one possible solution: a line. green curve: another possible solution: a cubic polynomial. 3

Linear Models for Regression M 1 y x, w = w 0 + w j φ j (x) j=1 Functions φ j are called basis functions. You must decide what these functions should be, before you start training. They can be any functions you want. Parameters w j are weights. They are real numbers. The goal of linear regression is to estimate these weights. The output of training is the values of these weights. 4

The Dummy φ 0 Function M 1 y x, w = w 0 + w j φ j (x) j=1 To simplify notation, we define a "dummy" basis function φ 0 x = 1. Then, the above formula becomes: M 1 y x, w = w j φ j (x) j=0 5

Dot Product Version M 1 y x, w = w j φ j x j=0 = w T φ x w 0 w 1 w is a column vector of weights: w = w M 1 w T is the transpose of w: w T = [w 0, w 1,, w M 1 ]. φ(x) is a column vector: φ(x) = φ 0 (x) φ 1 (x) φ M 1 (x) 6

Common Choices for Basis Functions Gaussian basis functions: φ j x = e (x μ j )2 2s 2 7

Common Choices for Basis Functions Gaussian basis functions: φ j x = e (x μ j )2 2s 2 8

Common Choices for Basis Functions Sigmoidal basis functions: φ j x = σ x μ j s σ a = 1 1 + e a σ a is called the logistic sigmoid function. We will see it again in neural networks. 9

Common Choices for Basis Functions Sigmoidal basis functions: φ j x = σ x μ j s σ a = 1 1 + e a σ a is called the logistic sigmoid function. We will see it again in neural networks. 10

Common Choices for Basis Functions Polynomial basis functions. For example: powers of x: φ j x = x j If the basis functions are powers of x, then the regression process fits a polynomial to the data. In other words, the regression process estimates the parameters w j of a polynomial of degree M-1: M 1 y x, w = w j x j j=0 11

Global Versus Local Functions Polynomial basis functions are global functions. Their values are far from zero for most of the input range from to +. Changing a single weight w j affects the output y(x, w) in the entire input range. Gaussian functions are local functions. Their values are practically (but not mathematically) zero for most of the input range from to +. Changing a single weight w j practically does not affect the output y(x, w) except in a specific small interval. It is often easier to fit data with local basis functions. Each basis function fits a small region of the input space. 12

Linear Versus onlinear Functions A linear function y(x, w) produces an output that depends linearly on both x and w. ote that polynomial, Gaussian, and sigmoidal basis functions are nonlinear. If we use nonlinear basis functions, then the regression process produces a function y(x, w) which is: Linear to w. onlinear to x. It is important to remember: linear regression can be used to estimate nonlinear functions of x. It is called linear regression because y is linear to w, OT because y is linear to x. 13

Solving Regression Problems There are different methods for solving regression problems. We will study two approaches: Least squares: find the weights w that minimize the squared error. Regularized least squares: find the weights w that minimize the squared error, using some hand-picked regularization parameter λ. 14

The Gaussian oise Assumption Suppose that we want to find the most likely solution. We want to find the weights w that maximize the likelihood of the training data. In order to do that, we need to make an additional assumption about the process that generates outputs based on inputs. A common approach is to assume a Gaussian noise model. t = y x, w + ε In words, t is generated by computing y(x, w) and then adding some noise. The noise ε is a random variable from a zero-mean Gaussian distribution. 15

The Gaussian oise Assumption t = y x, w + ε The noise ε is a random variable from a zero-mean Gaussian distribution. Therefore: p t x, w, β = t y(x, w, β 1 ) In the above equation, β is called the precision of the Gaussian, and is defined as the inverse of the variance: β = 1 σ 2 The likelihood that input x leads to output t is a Gaussian, whose mean is y(x, w), and its variance is β 1. 16

Finding the Most Likely Solution What is the value of w that is most likely given the data? Suppose we have a set of training inputs: X = x 1,, x We also have a set of corresponding outputs: t = t 1,, t We assume that training inputs are independent of each other. We assume that outputs are conditionally independent of each other, given their inputs and noise parameter β. Then: p w X, β, t = p t X, w, β p(w X, β) p t X, β 17

Finding the Most Likely Solution p w X, β, t = p t X, w, β p(w X, β) p t X, β We assume that, given X and β, all values of w are equally likely. Then, p(w X,β) is a constant that does not depend on w. p t X,β Therefore, finding the w that maximizes p w X, β, t is the same as finding the w that maximizes p t X, w, β. p t X, w, β is the likelihood of the training data. So, to find the most likely answer w, we must find the value of w that maximizes the likelihood of the training data. 18

Likelihood of the Training Data What is the probability of the training data given w? We assume that outputs are conditionally independent of each other, given their inputs and noise parameter β. Then: p t X, w, β = t n y(x n, w, β 1 ) 19

Likelihood of the Training Data Remember that, using dot product notation: y x n, w = w T φ x n Therefore: p t X, w, β = t n y(x n, w, β 1 ) p t X, w, β = t n w T φ(x n, β 1 ) Our goal is to find the weights w that maximize p t X, w, β. 20

Log Likelihood of the Training Data p t X, w, β = t n w T φ(x n, β 1 ) Maximizing p t X, w, β is the same as maximizing ln(p t X, w, β ), where ln is the natural logarithm. ln p t X, w, β = ln( t n w T φ(x n, β 1 )) 21

Log Likelihood of the Training Data ln p t X, w, β = ln( t n w T φ(x n, β 1 )) = ln = ln = ln 1 β β t n w T φ(x n ) 2 2π e 2 β 2π β 1 2π e + ln e β t n w T φ(x n ) 2 t n w T φ(x n ) 2 2β 1 2 22

Log Likelihood of the Training Data ln p t X, w, β = ln β 2π + ln e β t n w T φ(x n ) 2 2 = ln β 2π + β t n w T φ(x n ) 2 2 ote that ln β 2π is independent of w. Therefore, to maximize ln p t X, w, β we must maximize β t n w T φ(x n ) 2 2 23

Log Likelihood and Sum-of-Squares Error To maximize ln p t X, w, β we must maximize: β t n w T φ(x n ) 2 2 Remember that the sum-of-squares error is defined in the textbook as: E D w = 1 t 2 n w T φ(x n ) 2 Therefore, we want to maximize βe D w, which is the same as saying that we want to minimize E D w. Therefore, we have proven that: the w that maximizes the likelihood of the training data is the same w that minimizes the sum-of-squares error. 24

Minimizing Sum-of-Squares Error We want to find the w that minimizes: E D w = 1 t 2 n w T φ(x n ) 2 E D w is a function mapping an M-dimensional vector to a real number. Remember from calculus: to minimize any such function: Compute the gradient vector E D w. This gradient is an M-dimensional row vector. Finding values of w that solve equation E D w = 0. These values of w can be possible maxima or minima. 25

Minimizing Sum-of-Squares Error We want to find the w that minimizes: E D w = 1 2 t n w T φ(x n ) 2 To calculate E d w : First, let's calculate the gradient E D,n w of function: E D,n w = t n w T φ(x n ) 2 E D,n w is the composition f g(w) of: f x = x 2 g w = t n w T φ x n According to the chain rule: f g = f g g. 26

Minimizing Sum-of-Squares Error E D,n w = t n w T φ(x n ) 2 E D,n w is the composition f g(w) of: f x = x 2 g w = t n w T φ x n According to the chain rule: f g = f g g f x g w = 2x = φ(x n ) T f g(w) g w = 2 t n w T φ x n ( φ x n T ) 27

Minimizing Sum-of-Squares Error E D,n w = t n w T φ(x n ) 2 E D,n w = 2 t n w T φ x n φ x n T E D w = 1 2 t n w T φ(x n ) 2 = 1 2 E D,n w Therefore: E D w = 1 2 = 1 2 E D,n w 2 t n w T φ x n φ x n T 28

Minimizing Sum-of-Squares Error E D w = 1 2 2 t n w T φ x n φ x n T = w T φ x n t n φ x n T = w T φ x n φ x n T t n φ x n T 29

Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T We want to solve equation E D w = 0. We can simplify expressions using vector and matrix notation: Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 30

Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T We want to solve equation E D w = 0. We can simplify expressions using vector and matrix notation: Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 31

Minimizing Sum-of-Squares Error φ x n φ x n T = φ 0 x n φ M 1 x n φ 0 x n,, φ M 1 x n = φ 0 x n 2, φ 0 x n φ 1 x n,, φ 0 x n φ M 1 x n φ 0 x n φ 1 x n, φ 1 x n 2,, φ 1 x n φ M 1 x n φ 0 x n φ M 1 x n, φ 1 x n φ M 1 x n,, φ M 1 x n 2 = Φ T Φ 32

Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 33

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 34

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 35

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 36

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Multiplying both sides by (Φ T Φ) 1 w T = t T Φ (Φ T Φ) 1 37

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Transposing both sides w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T 38

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (AB) T = B T A T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t 39

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (A T ) 1 = (A 1 ) T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t w = (Φ T Φ) T 1 Φ T t 40

Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (AB) T = B T A T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t w = (Φ T Φ) T 1 Φ T t w = (Φ T Φ) 1 Φ T t 41

otation: w ML From the previous slides, we got the formula w = (Φ T Φ) 1 Φ T t We denote this value of w as w ML, because it is the maximum likelihood estimate of w. In other words, w ML is the most likely value of w given the data. So, we rewrite the formula as: w ML = (Φ T Φ) 1 Φ T t 42

Solving for β Remember the Gaussian oise model: p t x, w, β = t y(x, w, β 1 ) In the above equation, β is the precision of the Gaussian, and is defined as the inverse of the variance: β = 1 σ 2 Given our estimate w ML, we can also estimate the precision β and the variance σ 2 : σ ML 2 = 1 β ML = 1 t n w ML T φ(x n ) 2 43

Least Squares Solution - Summary w = w 0 w 1 w M 1 t = t 1 t 2 t Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 φ 0 x,, φ M 1 x Given the above notation: σ ML 2 = 1 β ML = 1 w ML = (Φ T Φ) 1 Φ T t t n w ML T φ(x n ) 2 44

umerical Issues Black circles: training data. Green curve: true function y(x)=sin(2πx). Red curve: Top figure: Fitted 7-degree polynomial. Bottom figure: Fitted 8-degree polynomial. Do you see anything strange? 45

umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Is this possible? 46

umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Mathematically, this is impossible. 8-degree polynomials are a superset of 7-degree polynomials. Thus, the best-fitting 8-degree polynomial cannot fit the data worse than the best-fitting 7- degree polynomial. 47

umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Mathematically, this is impossible. Cause of the problem: numerical issues. Computing the inverse of Φ T Φ may produce a result that is far from the correct one. 48

umerical Issues Work-around in Matlab: inv(phi' * phi) leads to the incorrect 8-degree polynomial fit below. pinv(phi' * phi) leads to the correct 8-degree polynomial fit above (almost identical to the 7-degree result). 49

Sequential Learning The w ML estimate was obtained by processing the training data as a batch. We use all the data at once to estimate w ML. However, batch processing can be computationally expensive for large datasets. It involves matrix multiplications of Mx matrices. An alternative is sequential learning. This is also called online learning. In this scenario: We first, somehow, get an initial estimate w (0). Then, we observe training examples, one by one. Every time we observe a new training example, we update the estimate. 50

Sequential Learning We first, somehow, get an initial estimate w (0). That can be obtained, for example, by computing w ML using the first few training examples as a batch. Or, we initialize w to a random value. Then, we observe training examples, one by one. When we observe the n th training example, we update the estimate from w (τ) to w (τ+1). Remember that the n th training example contributes to the overall error E D w a term E D,n w defined as: E D,n w = t n w T φ(x n ) 2 51

Sequential Learning The n th training example contributes to the overall error E D w a term E D,n w defined as: E D,n w = t n w T φ(x n ) 2 When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n η is called the learning rate. It is picked manually. This whole process is called stochastic gradient descent. 52

Sequential Learning - Intuition When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n What is the intuition behind this update? The gradient E D,n (w) is a vector that points in the direction where E D,n (w) increases. Therefore, subtracting a very small amount of E D,n w from w should make E D,n (w) a little bit smaller. 53

Sequential Learning - Intuition When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n Choosing a good value for η is important. If η is too small, the minimization may happen too slowly and require too many training examples. If η is large, the update may overfit the most recent training example, and overall w may fluctuate too much from one update to the next. Unfortunately, picking a good η is more of an art than a science, and involves trial-and-error. 54

Regularized Least Squares We estimated w ML by minimizing the sum-of-squares error E D w : E D w = 1 2 t n w T φ(x n ) 2 We have seen in Chapter 1 the regularized version of this error: E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w 55

Solving Regularized Least Squares E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w The value of w that minimizes E D w is: w = (λi + Φ T Φ) 1 Φ T t In the above equation, I is the MxM identity matrix. 56

Why Use Regularization? E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w We use regularization when we know, in advance, that some solutions should be preferred over other solutions. Then, we add to the error function a term that penalizes undesirable solutions. In the linear regression case: people know (from past experience) that high values of weights are indicative of overfitting. So, for each high value w i we add to the error a penalty (w i ) 2. 57

Impact of Parameter λ 58

Impact of Parameter λ 59

Impact of Parameter λ As the previous figures show: Low values of λ lead to polynomials whose values fluctuate more and more rapidly. This can lead to increased overfitting. High values of λ lead to flatter and flatter polynomials, that look more and more like straight lines. This can lead to increased underfitting, or not fitting the data sufficiently. 60