Today. Calculus. Linear Regression. Lagrange Multipliers

Today Calculus Lagrange Multipliers Linear Regression 1

Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject to a constraint. Two functions: An objective function to maximize An inequality that must be satisfied 2

Lagrange Multipliers Find maxima of f (x,y) subject to a constraint. f(x, y) =x +2y x 2 + y 2 =1 3

General form Maximizing: Subject to: f(x, y) g(x, y) =c Introduce a new variable, and find a maxima. Λ(x, y, λ) =f(x, y)+λ(g(x, y) c) 4

Example Maximizing: Subject to: f(x, y) =x +2y x 2 + y 2 =1 Introduce a new variable, and find a maxima. Λ(x, y, λ) =x +2y + λ(x 2 + y 2 1) 5

Example Λ(x, y, λ) x Λ(x, y, λ) y =1+2λx =0 =2+2λy =0 Λ(x, y, λ) λ =(x 2 + y 2 1) = 0 Now have 3 equations with 3 unknowns. 6

Example Eliminate Lambda 1=2λx 2=2λy 1 x =2λ = 2 y y =2x Substitute and Solve x 2 + y 2 =1 x 2 +(2x) 2 =1 5x 2 =1 x = ± 1 5 y = ± 2 5 7

Basics of Linear Regression Regression algorithm Supervised technique. In one dimension: Identify In D-dimensions: Identify Given: training data: And targets: y : R R y : R D R {x0, x 1,...,x N } {t 0,t 1,...,t N } 8

Graphical Example of Regression t? x 9

Graphical Example of Regression t x 10

Graphical Example of Regression t x 11

Definition In linear regression, we assume that the model that generates the data involved only a linear combination of input variables. y(x, w) =w 0 + w 1 x 1 +...+ w D x D y(x, w) =w 0 + D j=1 w j x j Where w is a vector of weights which define the D parameters of the model 12

How can we evaluate the performance of a regression solution? Error Functions (or Loss functions) Squared Error Linear Error Evaluation E(t E(t i,y(x i, w)) = 1 i,y(x i, w)) = t 2 (t i y(x i, w) w)) 2 13

Regression Error 14

Empirical Risk Empirical risk is the measure of the loss from data. R emp = 1 N = 1 N N i=1 N E(t i,y(x i, w)) 1 2 (t i y(x i, w)) 2 i=1 By minimizing risk on the training data, we optimize the fit with respect to the loss function w R =0 15

Model Likelihood and Empirical Risk Two related but distinct ways to look at a model. 1. Model Likelihood. 1. What is the likelihood that a model generated the observed data? 2. Empirical Risk 1. How much error does the model have on the training data? 16

Model Likelihood p(t x, w, β) =N(t; y(x, w), β 1 ) where β = 1 σ 2 p(t x, w, β) = N(t i ; y(x i, w), β 1 ) Assuming Independently Identically Distributed (iid) data. 17

Understanding Model Likelihood p(t x, w, β) = p(t x, w, β) = n p(t x, w, β) =ln 1 β exp 2π N(t i ; y(x i, w), β 1 ) β 2 (y(x i, w) t i ) 2 1 β β exp 2π 2 (y(x i, w) t i ) 2 Substitution for the eqn of a gaussian Apply a log function = β 2 (y(xi, w) t i ) 2 + N 2 ln β N 2 ln 2π Let the log dissolve products into sums 18

Understanding Model Likelihood ln p(t x, w, β) = β 2 w ln p(t x, w, β) = w w β 2 R emp = 1 N β 2 (y(xi, w) t i ) 2 + N 2 ln β N 2 (y(xi, w) t i ) 2 (y(xi, w) t i ) 2 =0 N i=1 1 2 (t i y(x i, w)) 2 ln 2π Optimize the weights. (Maximum Likelihood Estimation) Log Likelihood Empirical Risk w/ Squared Loss Function 19

Maximizing Log Likelihood (1-D) Find the optimal settings of w. w R = 0 R( w) = 1 w = w 0 2N w 1 T R w 0 R w 1 = 0 0 (t i w 1 x i w 0 ) 2 20

1 N Maximizing Log Likelihood w R( w) = 1 2N (t i w 1 x i w 0 ) 2 R w 0 = 1 N (t i w 1 x i w 0 )( 1) = 0 1 N w 0 = 1 N w 0 = 1 N w 0 = 1 N (t i w 1 x i w 0 )( 1) (t i w 1 x i ) (t i w 1 x i ) t i w 1 1 N x i Partial derivative Set to zero Separate the sum to isolate w 0 21

1 N 1 N Maximizing Log Likelihood w R( w) = 1 2N R w 1 = 1 N (t i w 1 x i w 0 )( x i )=0 (t i x i w 1 x 2 i w 0 x i )=0 1 N w 1 w 1 x 2 i = 1 N x 2 i = (t i w 1 x i w 0 ) 2 (t i w 1 x i w 0 )( x i ) t i x i 1 N t i x i w 0 x i w 0 x i Partial derivative Set to zero Separate the sum to isolate w 0 22

Maximizing Log Likelihood w 1 w 1 x 2 i = w 0 = 1 N x 2 i = t i x i 1 N t i w 1 1 N t i x i w 0 t i w 1 1 N x i x i x i x i From previous partial From prev. slide Substitute 1 x 2 i 1 N w 1 = x i x i = t i x i 1 N t ix i 1 N x2 i 1 N t i x i Isolate w 1 t i x i x i x i 23

Maximizing Log Likelihood Clean and easy. w 0 w 1 = Or not 1 N t i w1 1 N t ix i 1 N x2 i 1 N t i x i x i x i x i Apply some linear algebra. 24

Likelihood using linear algebra Representing the linear regression function in terms of vectors. y = w 0 + w 1 x 1 + w 2 x 2 +...+ w x x = 1 x 1 x 2... x T w = w 0 w 1 w 2... w T y = x T w 25

Likelihood using linear algebra Stack x T into a matrix of data points, X. R emp ( w) = 1 2N = 1 2N = 1 2N = 1 2N (t i w 1 x i w 0 ) 2 t i 2 w 1 x 0 i w 1 t 0 1 x 0 t 1. 1 x 1 w0.. w 1 t 1 x t X w 2 2 Representation as vectors Stack the data into a matrix and use the Norm operation to handle the sum 26

Likelihood in multiple dimensions This representation of risk has no inherent dimensionality. R emp ( w) = 1 2N t X w 2 w 1 2N w R emp ( w) =0 t X 2 w =0 27

Maximum Likelihood Estimation redux 1 2N w w R emp ( w) =0 1 w t X 2N w 2 =0 1 2N w (t X w) T (t X w) =0 (t T t t T X w w T X Tt + w T X T X w) =0 1 X 2N T t X T t +2X T X w =0 1 2N 2 X T t +2 X T X w =0 X T X w = X T t Decompose the norm FOIL linear algebra style Differentiate Combine terms Isolate w w =( X T X) 1 X Tt 28

Extension to polynomial regression 29

Extension to polynomial regression y = c 0 + c 1 x 1 + c 2 x 2 y = c 0 + c 1 x + c 2 x 2 Polynomial regression is the same as linear regression in D dimensions 30

Generate new features Standard Polynomial with coefficients, w y(x, w) = D d=1 w d x d + w 0 Risk = 1 2 t 0 t 1. t n 1 1 x 0... x p 0 1 x 1... x p 1.... 1 x n 1... x p n 1 w 0 w 1. w p 2 31

Generate new features Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from x i x i = x 0 i x 1 i... x P i T Then standard linear regression in D dimensions 32

How is this still linear regression? The regression is linear in the parameters, despite projecting x i from one dimension to D dimensions. Now we fit a plane (or hyperplane) to a representation of x i in a higher dimensional feature space. This generalizes to any set of functions φ i : R R x i = φ 0 (x i ) φ 1 (x i )... φ P (x i ) T 33

Basis functions as feature extraction These functions are called basis functions. They define the bases of the feature space Allows linear decomposition of any type of function to data points Common Choices: Polynomial Gaussian Sigmoids Wave functions (sine, etc.) φ i : R R 34

Training data vs. Testing Data Evaluating the performance of a classifier on training data is meaningless. With enough parameters, a model can simply memorize (encode) every training point To evaluate performance, data is divided into training and testing (or evaluation) data. Training data is used to learn model parameters Testing data is used to evaluate performance 35

Overfitting 36

Overfitting 37

Overfitting performance 38

Definition of overfitting When the model describes the noise, rather than the signal. How can you tell the difference between overfitting, and a bad model? 39

Possible detection of overfitting Stability An appropriately fit model is stable under different samples of the training data An overfit model generates inconsistent performance Performance A good model has low test error A bad model has high test error 40

What is the optimal model size? The best model size generalizes to unseen data the best. Approximate this by testing error. One way to optimize parameters is to minimize testing error. This operation uses testing data as tuning or development data Sacrifices training data in favor of parameter optimization Can we do this without explicit evaluation data? 41

Context for linear regression Simple approach Efficient learning Extensible Regularization provides robust models 42

Break Coffee. Stretch. 43

Linear Regression Identify the best parameters, w, for a regression function y = w 0 + N i=1 w i x i w =( X T X) 1 X Tt 44

Overfitting Recall: overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities. Often caused by too many parameters relative to the amount of training data. E.g. an order-n polynomial can intersect any N+1 data points 45

Dealing with Overfitting Use more data Use a tuning set Regularization Be a Bayesian 46

Regularization In a linear regression model overfitting is characterized by large weights. M =0 M =1 M =3 M =9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 47

Penalize large weights Introduce a penalty term in the loss function. E( w) = 1 2 E( w) = 1 2 n=0 n=0 {t n y(x n, w)} 2 Regularized Regression (L2-Regularization or Ridge Regression) (t n y(x n, w)) 2 + λ 2 w2 48

Regularization Derivation w 1 2 w (E( w)) = 0 (y(x i, w) t i ) 2 + λ 2 w2 =0 w 1 2 t X w 2 + λ 2 w2 =0 w 1 2 ( t X w) T (t X w)+ λ 2 wt w =0 49

1 w 2 ( t X w) T (t X w)+ λ 2 wt w =0 X T t + X T λ X w + w 2 wt w =0 X T t + X T X w + λw =0 X T t + X T X w + λi w =0 X T t +( X T X + λi) w =0 ( X T X + λi) w = X Tt w =( X T X + λi) 1 X Tt 50

Regularization in Practice 51

Regularization Results 52

More regularization The penalty term defines the styles of regularization L2-Regularization L1-Regularization L0-Regularization L0-norm is the optimal subset of features E( w) = 1 2 E( w) = 1 2 E( w) = 1 2 (t n y(x n, w)) 2 + λ 2 w2 n=0 (t n y(x n, w)) 2 + λ w 1 n=0 (t n y(x n, w)) 2 + λ n=0 n=0 δ(w n = 0) 53

Curse of dimensionality Increasing dimensionality of features increases the data requirements exponentially. For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. Models should be small relative to the amount of available data Dimensionality reduction techniques feature selection can help. L0-regularization is explicit feature selection L1- and L2-regularizations approximate feature selection. 54

Bayesians v. Frequentists What is a probability? Frequentists A probability is the likelihood that an event will happen It is approximated by the ratio of the number of observed events to the number of total events Assessment is vital to selecting a model Point estimates are absolutely fine Bayesians A probability is a degree of believability of a proposition. Bayesians require that probabilities be prior beliefs conditioned on data. The Bayesian approach is optimal, given a good model, a good prior and a good loss function. Don t worry so much about assessment. If you are ever making a point estimate, you ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior 55

Bayesian Linear Regression The previous MLE derivation of linear regression uses point estimates for the weight vector, w. Bayesians say, hold it right there. Use a prior distribution over w to estimate parameters p( w α) =N( w 0, α 1 I)= α 2π (M+1)/2 exp α 2 wt w Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. Now optimize: p( w x,t, α, β) p(t x, w, β)p( w α) 56

Optimize the Bayesian posterior p( w x,t, α, β) p(t x, w, β)p( w α) As usual it s easier to optimize after a log transform. p(t x, w, β) = ln p(t x, w, β)+lnp( w α) n=0 β 2π exp n p(t x, w, β) = N 2 ln β N 2 ln 2π β 2 β2 (t n y(x n, w)) 2 (t n y(x n, w)) 2 n=0 57

Optimize the Bayesian posterior p( w x,t, α, β) p(t x, w, β)p( w α) As usual it s easier to optimize after a log transform. ln p(t x, w, β)+lnp( w α) p( w α) =N( w 0, α 1 I)= α 2π n p( w α) = M +1 2 ln α M +1 2 (M+1)/2 exp α 2 wt w ln 2π α 2 wt w 58

Optimize the Bayesian posterior ln p(t x, w, β)+lnp( w α) n p(t x, w, β) = N 2 ln β N 2 ln 2π β 2 ln p( w α) = M +1 2 ln α M +1 2 (t n y(x n, w)) 2 n=0 Ignoring terms that do not depend on w n p(t x, w, β)+lnp( w α) β 2 ln 2π α 2 wt w (t n y(x n, w)) 2 + α 2 wt w n=0 IDENTICAL formulation as L2-regularization 59

Context Overfitting is bad. Bayesians vs. Frequentists Is one better? Machine Learning uses techniques from both camps. 60

Next Time Logistic Regression Read Chapter 4.1, 4.3 61