Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1
The Regression Problem Training data: A set of input-output pairs: (x n, t n ) x n is the n-th training input. t n is the target output for x n. Goal: learn a function y(x), that can predict the target value t for a new input x. So far, this is the standard definition of a generic supervised learning problem. What differentiates regression problems is that the target outputs come from a continuous space. 2
A Regression Example circles: training data. red curve: one possible solution: a line. green curve: another possible solution: a cubic polynomial. 3
Linear Models for Regression M 1 y x, w = w 0 + w j φ j (x) j=1 Functions φ j are called basis functions. You must decide what these functions should be, before you start training. They can be any functions you want. Parameters w j are weights. They are real numbers. The goal of linear regression is to estimate these weights. The output of training is the values of these weights. 4
The Dummy φ 0 Function M 1 y x, w = w 0 + w j φ j (x) j=1 To simplify notation, we define a "dummy" basis function φ 0 x = 1. Then, the above formula becomes: M 1 y x, w = w j φ j (x) j=0 5
Dot Product Version M 1 y x, w = w j φ j x j=0 = w T φ x w 0 w 1 w is a column vector of weights: w = w M 1 w T is the transpose of w: w T = [w 0, w 1,, w M 1 ]. φ(x) is a column vector: φ(x) = φ 0 (x) φ 1 (x) φ M 1 (x) 6
Common Choices for Basis Functions Gaussian basis functions: φ j x = e (x μ j )2 2s 2 7
Common Choices for Basis Functions Gaussian basis functions: φ j x = e (x μ j )2 2s 2 8
Common Choices for Basis Functions Sigmoidal basis functions: φ j x = σ x μ j s σ a = 1 1 + e a σ a is called the logistic sigmoid function. We will see it again in neural networks. 9
Common Choices for Basis Functions Sigmoidal basis functions: φ j x = σ x μ j s σ a = 1 1 + e a σ a is called the logistic sigmoid function. We will see it again in neural networks. 10
Common Choices for Basis Functions Polynomial basis functions. For example: powers of x: φ j x = x j If the basis functions are powers of x, then the regression process fits a polynomial to the data. In other words, the regression process estimates the parameters w j of a polynomial of degree M-1: M 1 y x, w = w j x j j=0 11
Global Versus Local Functions Polynomial basis functions are global functions. Their values are far from zero for most of the input range from to +. Changing a single weight w j affects the output y(x, w) in the entire input range. Gaussian functions are local functions. Their values are practically (but not mathematically) zero for most of the input range from to +. Changing a single weight w j practically does not affect the output y(x, w) except in a specific small interval. It is often easier to fit data with local basis functions. Each basis function fits a small region of the input space. 12
Linear Versus onlinear Functions A linear function y(x, w) produces an output that depends linearly on both x and w. ote that polynomial, Gaussian, and sigmoidal basis functions are nonlinear. If we use nonlinear basis functions, then the regression process produces a function y(x, w) which is: Linear to w. onlinear to x. It is important to remember: linear regression can be used to estimate nonlinear functions of x. It is called linear regression because y is linear to w, OT because y is linear to x. 13
Solving Regression Problems There are different methods for solving regression problems. We will study two approaches: Least squares: find the weights w that minimize the squared error. Regularized least squares: find the weights w that minimize the squared error, using some hand-picked regularization parameter λ. 14
The Gaussian oise Assumption Suppose that we want to find the most likely solution. We want to find the weights w that maximize the likelihood of the training data. In order to do that, we need to make an additional assumption about the process that generates outputs based on inputs. A common approach is to assume a Gaussian noise model. t = y x, w + ε In words, t is generated by computing y(x, w) and then adding some noise. The noise ε is a random variable from a zero-mean Gaussian distribution. 15
The Gaussian oise Assumption t = y x, w + ε The noise ε is a random variable from a zero-mean Gaussian distribution. Therefore: p t x, w, β = t y(x, w, β 1 ) In the above equation, β is called the precision of the Gaussian, and is defined as the inverse of the variance: β = 1 σ 2 The likelihood that input x leads to output t is a Gaussian, whose mean is y(x, w), and its variance is β 1. 16
Finding the Most Likely Solution What is the value of w that is most likely given the data? Suppose we have a set of training inputs: X = x 1,, x We also have a set of corresponding outputs: t = t 1,, t We assume that training inputs are independent of each other. We assume that outputs are conditionally independent of each other, given their inputs and noise parameter β. Then: p w X, β, t = p t X, w, β p(w X, β) p t X, β 17
Finding the Most Likely Solution p w X, β, t = p t X, w, β p(w X, β) p t X, β We assume that, given X and β, all values of w are equally likely. Then, p(w X,β) is a constant that does not depend on w. p t X,β Therefore, finding the w that maximizes p w X, β, t is the same as finding the w that maximizes p t X, w, β. p t X, w, β is the likelihood of the training data. So, to find the most likely answer w, we must find the value of w that maximizes the likelihood of the training data. 18
Likelihood of the Training Data What is the probability of the training data given w? We assume that outputs are conditionally independent of each other, given their inputs and noise parameter β. Then: p t X, w, β = t n y(x n, w, β 1 ) 19
Likelihood of the Training Data Remember that, using dot product notation: y x n, w = w T φ x n Therefore: p t X, w, β = t n y(x n, w, β 1 ) p t X, w, β = t n w T φ(x n, β 1 ) Our goal is to find the weights w that maximize p t X, w, β. 20
Log Likelihood of the Training Data p t X, w, β = t n w T φ(x n, β 1 ) Maximizing p t X, w, β is the same as maximizing ln(p t X, w, β ), where ln is the natural logarithm. ln p t X, w, β = ln( t n w T φ(x n, β 1 )) 21
Log Likelihood of the Training Data ln p t X, w, β = ln( t n w T φ(x n, β 1 )) = ln = ln = ln 1 β β t n w T φ(x n ) 2 2π e 2 β 2π β 1 2π e + ln e β t n w T φ(x n ) 2 t n w T φ(x n ) 2 2β 1 2 22
Log Likelihood of the Training Data ln p t X, w, β = ln β 2π + ln e β t n w T φ(x n ) 2 2 = ln β 2π + β t n w T φ(x n ) 2 2 ote that ln β 2π is independent of w. Therefore, to maximize ln p t X, w, β we must maximize β t n w T φ(x n ) 2 2 23
Log Likelihood and Sum-of-Squares Error To maximize ln p t X, w, β we must maximize: β t n w T φ(x n ) 2 2 Remember that the sum-of-squares error is defined in the textbook as: E D w = 1 t 2 n w T φ(x n ) 2 Therefore, we want to maximize βe D w, which is the same as saying that we want to minimize E D w. Therefore, we have proven that: the w that maximizes the likelihood of the training data is the same w that minimizes the sum-of-squares error. 24
Minimizing Sum-of-Squares Error We want to find the w that minimizes: E D w = 1 t 2 n w T φ(x n ) 2 E D w is a function mapping an M-dimensional vector to a real number. Remember from calculus: to minimize any such function: Compute the gradient vector E D w. This gradient is an M-dimensional row vector. Finding values of w that solve equation E D w = 0. These values of w can be possible maxima or minima. 25
Minimizing Sum-of-Squares Error We want to find the w that minimizes: E D w = 1 2 t n w T φ(x n ) 2 To calculate E d w : First, let's calculate the gradient E D,n w of function: E D,n w = t n w T φ(x n ) 2 E D,n w is the composition f g(w) of: f x = x 2 g w = t n w T φ x n According to the chain rule: f g = f g g. 26
Minimizing Sum-of-Squares Error E D,n w = t n w T φ(x n ) 2 E D,n w is the composition f g(w) of: f x = x 2 g w = t n w T φ x n According to the chain rule: f g = f g g f x g w = 2x = φ(x n ) T f g(w) g w = 2 t n w T φ x n ( φ x n T ) 27
Minimizing Sum-of-Squares Error E D,n w = t n w T φ(x n ) 2 E D,n w = 2 t n w T φ x n φ x n T E D w = 1 2 t n w T φ(x n ) 2 = 1 2 E D,n w Therefore: E D w = 1 2 = 1 2 E D,n w 2 t n w T φ x n φ x n T 28
Minimizing Sum-of-Squares Error E D w = 1 2 2 t n w T φ x n φ x n T = w T φ x n t n φ x n T = w T φ x n φ x n T t n φ x n T 29
Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T We want to solve equation E D w = 0. We can simplify expressions using vector and matrix notation: Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 30
Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T We want to solve equation E D w = 0. We can simplify expressions using vector and matrix notation: Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 31
Minimizing Sum-of-Squares Error φ x n φ x n T = φ 0 x n φ M 1 x n φ 0 x n,, φ M 1 x n = φ 0 x n 2, φ 0 x n φ 1 x n,, φ 0 x n φ M 1 x n φ 0 x n φ 1 x n, φ 1 x n 2,, φ 1 x n φ M 1 x n φ 0 x n φ M 1 x n, φ 1 x n φ M 1 x n,, φ M 1 x n 2 = Φ T Φ 32
Minimizing Sum-of-Squares Error E D w = w T φ x n φ x n T t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 33
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 34
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t n φ x n T Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 35
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 t = φ 0 x,, φ M 1 x t 1 t 2 t 36
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Multiplying both sides by (Φ T Φ) 1 w T = t T Φ (Φ T Φ) 1 37
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Transposing both sides w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T 38
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (AB) T = B T A T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t 39
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (A T ) 1 = (A 1 ) T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t w = (Φ T Φ) T 1 Φ T t 40
Minimizing Sum-of-Squares Error E D w = w T Φ T Φ t T Φ E D w = 0 w T Φ T Φ = t T Φ Using rule: (AB) T = B T A T w T = w = t T Φ (Φ T Φ) 1 t T Φ (Φ T Φ) 1 T w = (Φ T Φ) 1 T Φ T t w = (Φ T Φ) T 1 Φ T t w = (Φ T Φ) 1 Φ T t 41
otation: w ML From the previous slides, we got the formula w = (Φ T Φ) 1 Φ T t We denote this value of w as w ML, because it is the maximum likelihood estimate of w. In other words, w ML is the most likely value of w given the data. So, we rewrite the formula as: w ML = (Φ T Φ) 1 Φ T t 42
Solving for β Remember the Gaussian oise model: p t x, w, β = t y(x, w, β 1 ) In the above equation, β is the precision of the Gaussian, and is defined as the inverse of the variance: β = 1 σ 2 Given our estimate w ML, we can also estimate the precision β and the variance σ 2 : σ ML 2 = 1 β ML = 1 t n w ML T φ(x n ) 2 43
Least Squares Solution - Summary w = w 0 w 1 w M 1 t = t 1 t 2 t Φ = φ 0 x 1,, φ M 1 x 1 φ 0 x 2,, φ M 1 x 2 φ 0 x,, φ M 1 x Given the above notation: σ ML 2 = 1 β ML = 1 w ML = (Φ T Φ) 1 Φ T t t n w ML T φ(x n ) 2 44
umerical Issues Black circles: training data. Green curve: true function y(x)=sin(2πx). Red curve: Top figure: Fitted 7-degree polynomial. Bottom figure: Fitted 8-degree polynomial. Do you see anything strange? 45
umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Is this possible? 46
umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Mathematically, this is impossible. 8-degree polynomials are a superset of 7-degree polynomials. Thus, the best-fitting 8-degree polynomial cannot fit the data worse than the best-fitting 7- degree polynomial. 47
umerical Issues The 8-degree polynomial fits the data worse than the 7- degree polynomial. Mathematically, this is impossible. Cause of the problem: numerical issues. Computing the inverse of Φ T Φ may produce a result that is far from the correct one. 48
umerical Issues Work-around in Matlab: inv(phi' * phi) leads to the incorrect 8-degree polynomial fit below. pinv(phi' * phi) leads to the correct 8-degree polynomial fit above (almost identical to the 7-degree result). 49
Sequential Learning The w ML estimate was obtained by processing the training data as a batch. We use all the data at once to estimate w ML. However, batch processing can be computationally expensive for large datasets. It involves matrix multiplications of Mx matrices. An alternative is sequential learning. This is also called online learning. In this scenario: We first, somehow, get an initial estimate w (0). Then, we observe training examples, one by one. Every time we observe a new training example, we update the estimate. 50
Sequential Learning We first, somehow, get an initial estimate w (0). That can be obtained, for example, by computing w ML using the first few training examples as a batch. Or, we initialize w to a random value. Then, we observe training examples, one by one. When we observe the n th training example, we update the estimate from w (τ) to w (τ+1). Remember that the n th training example contributes to the overall error E D w a term E D,n w defined as: E D,n w = t n w T φ(x n ) 2 51
Sequential Learning The n th training example contributes to the overall error E D w a term E D,n w defined as: E D,n w = t n w T φ(x n ) 2 When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n η is called the learning rate. It is picked manually. This whole process is called stochastic gradient descent. 52
Sequential Learning - Intuition When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n What is the intuition behind this update? The gradient E D,n (w) is a vector that points in the direction where E D,n (w) increases. Therefore, subtracting a very small amount of E D,n w from w should make E D,n (w) a little bit smaller. 53
Sequential Learning - Intuition When we observe the n th training example, we update the estimate from w (τ) to w (τ+1) as follows: w (τ+1) = w (τ) η E D,n (w) = w (τ) +η(t n w τ T φ n )φ n Choosing a good value for η is important. If η is too small, the minimization may happen too slowly and require too many training examples. If η is large, the update may overfit the most recent training example, and overall w may fluctuate too much from one update to the next. Unfortunately, picking a good η is more of an art than a science, and involves trial-and-error. 54
Regularized Least Squares We estimated w ML by minimizing the sum-of-squares error E D w : E D w = 1 2 t n w T φ(x n ) 2 We have seen in Chapter 1 the regularized version of this error: E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w 55
Solving Regularized Least Squares E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w The value of w that minimizes E D w is: w = (λi + Φ T Φ) 1 Φ T t In the above equation, I is the MxM identity matrix. 56
Why Use Regularization? E D w = 1 2 t n w T φ(x n ) 2 + λ 2 wt w We use regularization when we know, in advance, that some solutions should be preferred over other solutions. Then, we add to the error function a term that penalizes undesirable solutions. In the linear regression case: people know (from past experience) that high values of weights are indicative of overfitting. So, for each high value w i we add to the error a penalty (w i ) 2. 57
Impact of Parameter λ 58
Impact of Parameter λ 59
Impact of Parameter λ As the previous figures show: Low values of λ lead to polynomials whose values fluctuate more and more rapidly. This can lead to increased overfitting. High values of λ lead to flatter and flatter polynomials, that look more and more like straight lines. This can lead to increased underfitting, or not fitting the data sufficiently. 60