Linear regression COMS PDF Free Download

Linear regression COMS 4771

1. Old Faithful and prediction functions

Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40

Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 2 / 40

Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) 2 / 40

Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = 70.7941. Y 2 / 40

Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. 3 / 40

Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. time until next eruption 90 80 70 60 50 1.5 2 2.5 3 3.5 4 4.5 5 5.5 duration of last eruption 3 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t data X Y 4 / 40

Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. Y 4 / 40

2. Optimal predictors and linear regression models

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). 5 / 40

Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. 5 / 40

Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? 6 / 40

Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. 6 / 40

Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. 7 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. 8 / 40

Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 8 / 40

Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. 9 / 40

Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? Instead, use Can learn coefficients β such that φ(x) = (1, x temp, x 2 temp). β T φ(x) = (x temp 98.6) 2, or any other quadratic polynomial in x temp (which may be better!). 9 / 40

Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, 10 / 40

Quadratic expansion Quadratic function f : R R f(x) = ax 2 + bx + c, x R, for a, b, c R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x 2 ), since where β = (c, b, a). f(x) = β T φ(x) 10 / 40

Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, This can be written as a linear function of φ(x), where since where β = (c, b, a). φ(x) := (1, x, x 2 ), f(x) = β T φ(x) For multivariate quadratic function f : R d R, use φ(x) := (1, x 1,..., x d, x 2 }{{} 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). }{{}}{{} linear terms squared terms cross terms 10 / 40

Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 11 / 40

Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 100 time until next eruption 80 60 40 20 affine function 0 0 1 2 3 4 5 6 duration of last eruption Affine function f a,b : R R for a, b R, f a,b (x) = a + bx, is a linear function f β of φ(x) for β = (a, b). (This easily generalizes to multivariate affine functions.) 11 / 40

Why linear regression models? 12 / 40

Why linear regression models? 1. Linear regression models benefit from good choice of features. 12 / 40

Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 12 / 40

Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 3. Many well-understood and efficient algorithms for learning linear functions from data, even when n and d are large. 12 / 40

3. From data to prediction functions

Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) 2 + 1 } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 13 / 40

Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). 14 / 40

Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. 15 / 40

Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. In general: MLE makes sense in context of statistical model for which it is derived. ERM makes sense in context of general iid model for supervised learning. 15 / 40

Empirical risk minimization in pictures Red dots: data points. Affine hyperplane: linear function β (via affine expansion (x 1, x 2 ) (1, x 1, x 2 )). ERM: minimize sum of squared vertical lengths from hyperplane to points. 16 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n y n 17 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. y n 17 / 40

Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. 17 / 40

Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? 18 / 40

Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? Yes, if f is a convex function: f((1 t)x + tx ) (1 t)f(x) + tf(x ), for any 0 t 1 and any x, x R d. 18 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b 2 2 + t 2 Ax b 2 2 + 2(1 t)t(ax b) T (Ax b) 19 / 40

Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b 2 2 + t 2 Ax b 2 2 + 2(1 t)t(ax b) T (Ax b) = (1 t) Ax b 2 2 + t Ax b 2 2 (1 t)t[ Ax b 2 2 + Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) (1 t) Ax b 2 2 + t Ax b 2 2 where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometric mean (AM/GM) inequality. 19 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. 20 / 40

Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. 20 / 40

Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. 21 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. 22 / 40

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. 22 / 40

a 2 22 / 40 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb a 1

Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 22 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? 23 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: R(β) = 0, i.e., β is a critical point of R. 23 / 40

Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. 23 / 40

4. Risk, empirical risk, and estimating risk

Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? Theorem. Under mild assumptions on distribution of X, ( ) tr(cov(εw )) R(ˆβ) R(β ) = O n asymptotically, where W := E[XX T ] 1 2 X and ε := Y X T β. 24 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and n(ˆβ β ) = ( 1 n n ) 1 1 X ix T i n i=1 n i=1 ε ix i. 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: n(ˆβ β ) = ( 1 n 1 n n X ix T i i=1 n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: 2. By CLT: n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). 25 / 40

Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. A few more steps gives ( ) n E[(X T ˆβ Y ) 2 ] E[(X T β Y ) 2 d ] E[XX T ] 1 1 2 cov(εx) 2 Z 2 2. Random variable on RHS is concentrated around its mean tr(cov(εw )). 25 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. 26 / 40

Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. 26 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) 27 / 40

Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) 27 / 40

Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion φ(x) = (1, x 1,..., x k ), x R, so dimension is d = k + 1. 28 / 40

Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y 3 2 1 0-1 -2-3 0 0.2 0.4 0.6 0.8 1 x 28 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. 29 / 40

Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). 29 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. 30 / 40

Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) at n 1/2 rate. 30 / 40

Old Faithful example 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = (35.0929, 10.3258) from 136 past observations. 31 / 40

Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = (35.0929, 10.3258) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is 35.9404. (Recall: mean squared loss of ˆµ = 70.7941 was 187.1894.) 100 time until next eruption 80 60 40 20 linear model constant prediction 0 0 1 2 3 4 5 6 duration of last eruption 31 / 40

5. Regularization

Inductive bias Suppose ERM solution is not unique. What should we do? 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. 32 / 40

Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. 32 / 40

Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ). 2 2 3 3 33 / 40

Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ). 2 2 3 3 f(x) 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 Training data: -2.5 0 1 2 3 4 5 6 x 33 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 34 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! 34 / 40

Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). 34 / 40

Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by Ã := 1 n x T 1... x T n nλ... nλ, b := 1 n y 1... y n 0... 0. 35 / 40

Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by x T 1 y 1 Ã := 1.. x T n 1 y n nλ, b := n n. 0.... nλ 0 Then Ãβ b 2 2 = R(β) + λ β 2 2. 35 / 40

Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by x T 1 y 1 Ã := 1.. x T n 1 y n nλ, b := n n. 0.... nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix Ã has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are 0. 35 / 40

Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. 36 / 40

Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. Fact: Vectors with small l 1-norm are well-approximated by sparse vectors. If β contains just the 1/ε 2 -largest coefficients (by magnitude) of β, then β β 2 ε β 1. 36 / 40