Linear Models for Regression CS534

Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the abundance of a species based on Environmental conditions General set up: Given a set of training examples (x i, t i ), i=1, N Goal: learn a function yˆ ( x) to minimize some loss function: L( yˆ, t)

A example regression problem We want to learn to predict a person s height based on his/her knee height and/or arm span This is useful for patients who are bed bound and cannot stand to take an accurate measurement of their height Training data: Some measurements taken from past class

Target function Let y represent a person s height, and x represents the measurements we would use to predict y Here x contains two measurements referred to as features, knee height and arm span x = x 1 x 2 Knee height Arm span Supervised learning

Hypothesis space Linear function ( thus the name linear regression): To simplify: 1 1 2 2 We will rename b to, and define 1, we have: Let w,,,x,,, we have w x Supervised learning

Mean Squared Loss Given a set of training examples x,, x,, x, Our hypothesis space consider functions of T the form: y( x, w) w x Loss function:

Optimization of the loss function There are many ways one could use to optimize this objective function One of the simplest approach is called gradient descent The idea is quite simple Start with some random guess of the parameter Improve the parameter by following the steepest descent direction

Gradient descent 1. Start from some initial guess 2. Find the direction of steepest descent opposite of the gradient direction 3. Take a step toward that direction 4. Repeat until no local improvement is possible Different starting point may lead to different local minimum

Batch Gradient Descent for LMS Start with some initial guess Take a gradient descent step: Repeat until convergence Convergence is achieved if the magnitude of the gradient goes to zero. Controlling the learning rate controls how fast it converges.

Stochastic gradient descent Start with some initial guess Loop until converged for m=1 to n for i=0,1,,d Stochastic gradient descent takes a updating step every time it sees a single training example Appropriate for very large data set In such cases, converge faster than batch gradient descent

Alternative way to optimizing We will now represent w using vector form Define, w 1 2 w 1 2 w w w w 0 w 0 X Xw T X T Y Normal equation

, The normal equation: X T Xw A set of linear equations with variables Existence: there is always a solution to this equation Uniqueness: the solution is unique if the columns of X are linearly independent, i.e., the features are linearly independent, is invertible In this case, we can directly solve for w X T Y w ( X T X) 1 X T Y Supervised learning

Probabilistic Interpretation of the Least Squared Objective One might ask, why is the least squared objective a good choice? There is a natural probabilistic explanation for this objective when we make some simple assumptions Let s assume that the target variable and the input vector x are related as follows: w x where is an error term that is distributed according to Gaussian distribution with zero mean and some variance, i.e., ~0, This is equivalent to assuming ~w x, σ

Likelihood Function Now we are given the observed data x,, x,, x, We assume the error term of each of these examples is Independently and Identically Distributed (i.i.d) Given the inputs x, x,,x, what is the probability that their target outputs are,,,? ;w x ; w One can view this as a function of Y for fixed w and X Alternatively, this can also be viewed as a function of w, with fixed Y and X This is called the likelihood function w

Maximum likelihood principle Maximum likelihood estimation seeks to find the parameter that maximizes the likelihood function It is often easier to work with the log of the likelihood function: log x ; w log log 2 w x 2 1 2 exp w x 2 arg max w w argmin w 1 2 w x

Background: Maximum Likelihood Estimation

Parameter Estimation Given observation of some random variable(s), assuming the variable(s) follow some fixed distribution, how to estimate the parameters? Example: coin tosses (Bernoulli distribution, parameter: p probability of head) Dice or grades (Discrete distribution, parameters: for 1,,1with categories) Height and weight of a person (continuous variable, parameters depends on the distribution, for example for Gaussian Distribution, the parametes are, the mean and Σ, the covariance

Maximum Likelihood Principle We will use a general term to denote the model (parameters) we try to estimate We will use a general term to denote the data that we observe is a set of examples, let denote the th example in Assuming a fixed model, what is the probability of observing? This can be written as ; The maximum likelihood principle seeks to find that maximizes this probability This is called the likelihood function ; ; here we are making the IID assumption, that is examples in D are independently and identically distributed, thus we can use the product It is often more convenient to work with the log of log log ;

Example: Coin Toss You are given a coin, and want to decide the probability of head of this coin You toss it for 500 times, and observe heads 242 times What is your estimate of? 242 0.484 500 But why? this is doing maximum likelihood estimation In other words, 0.484 leads to the highest probability of observing 242 heads out of 500 tosses Now let s go over the math to convince ourselves

Example Cont. We have,,, where is the outcome of the th coin toss, 1 means head, and 0 means tail Lets write down the likelihood function: ; ; For binary variable, we have a nice compact form to write down its probability mass function ; 1 So we have Take the log: 1 log log 1 log 1 log1 log log1 and

Maximizing the likelihood We take derivative of : Setting it to zero: Solving this leads to:

Revisit example: Polynomial Curve Fitting In this example, there is only one feature. We learn a function of M order polynomial. Alternatively, we could also view this as learning a linear model considering 1,,,, as the features. Note that this new feature space is derived from the original feature We refer to these features as the basis functions

Linear Basis Function Models (1) Generally, the linear models we have been discussing can be used to learn nonlinear models when using appropriate basis functions where s are known as basis functions. Typically 1 so that acts as the bias term. 1 Simplest case: use linear basis functions : x Using nonlinear basis functions allows us to use the basic linear regression algorithm to learn nonlinear functions

Linear Basis Function Models (2) Polynomial basis functions: These are global basis functions A small change in x affects all basis functions.

Linear Basis Function Models (3) Gaussian basis functions: These are local A small change in x only affects nearby basis functions and s control the location and scale (width)

Linear Basis Function Models (4) Sigmoidal basis functions: where Also local; a small change in x only affects nearby basis functions. and s control location and scale (slope).

Over fitting issue What can we do to curb overfitting Use less complex model Use more training examples Regularization

Regularized Least Squares (1) Consider the error function: With the sum of squares error function and a quadratic regularizer, we get which is minimized by Data term + Regularization term (penalize complex models) 1 2 w ϕ x 2 w w w Encourage small weight values is called the regularization coefficient.

Regularized Least Squares (2) With a more general regularizer, we have 1 2 w ϕ x 2 Note this is equivalent to minimizing the mean squared loss subject to a constraint that the Lasso Quadratic (Ridge)

Regularized Least Squares (3) Lasso tends to generate sparser solutions (majority of the weights shrink to zero) than a quadratic regularizer.

What we are doing next We see that the model complexity can be controlled by choosing different basis functions or adjusting the parameter under the regularization framework (larger less complexity) By controlling model complexity (via different mechanisms), we are trading off the risk of over fitting with the risk of using a oversimplified model that is not flexible enough to capture the underlying function (under fitting) In the next few slides, we will see some theoretical analysis to understand this trade off from a statistical point of view In particular, we will analyze the expected loss of our prediction, and see how it can be decomposed

Analyzing the Expected Loss Let x be our estimation of the true value Expected loss is then written as:, x x, x x x, x x x x x, x x x, x x x x, x 2 x x x x 0

The Bias Variance Decomposition (1) Consider the expected squared loss, x x x, x x x, x The second term of E[L] corresponds to the noise inherent in the target variable. This is part of the data, we have no control over it What about the first term?

The Bias Variance Decomposition (2) Suppose we were given multiple data sets, each of size. Any particular training data set D, will give a particular function. Let ], the first term can be represented as: x x; x x; x; x; x x; x; x; 2 x x; x; x; Take the expectation of, the last term goes to zero

The Bias Variance Decomposition (3) Taking the expectation over D yields x x; x x; x; x;

The Bias Variance Decomposition (4) Thus we can write where

The Bias Variance Decomposition (5) Example: 25 training data sets from the sinusoidal, varying the degree of regularization,. The model uses 24 Gaussian Basis functions

The Bias Variance Decomposition (6)

The Bias Variance Decomposition (7)

The Bias Variance Trade off From these plots, we note that an overregularized model (large ) will have a high bias, while an underregularized model (small ) will have a high variance.

Summary Linear regression with least squared objective Gradient descent algorithm for optimization Closed form solution by solving the normal equation Can be viewed as maximum likelihood estimation assuming IID Gaussian noise with zero mean Using basis functions allows us to learn nonlinear functions Controlling over fitting by Regularization Ridge regression is equivalent to adding a diagonal matrix to LASSO encourages sparse solution