Linear regression COMS 4771

Size: px

Start display at page:

Download "Linear regression COMS 4771"

Henry Snow
5 years ago
Views:

1 Linear regression COMS 4771

2 1. Old Faithful and prediction functions

3 Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40

4 Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 2 / 40

5 Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) 2 / 40

6 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. Y 2 / 40

7 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Y 2 / 40

8 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Y 2 / 40

9 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Mean squared loss of ˆµ on next 136 observations is Y 2 / 40

10 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Mean squared loss of ˆµ on next 136 observations is (Easier to interpret the square root, , which has same units as Y.) Y 2 / 40

11 Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. 3 / 40

12 Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. time until next eruption duration of last eruption 3 / 40

13 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t data X Y 4 / 40

14 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. Y 4 / 40

15 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. Y 4 / 40

16 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). Y 4 / 40

17 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. Y 4 / 40

18 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. How should we choose ˆf based on data? Y 4 / 40

19 2. Optimal predictors and linear regression models

20 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). 5 / 40

21 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 5 / 40

22 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 5 / 40

23 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. 5 / 40

24 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. This lecture: Y = R (regression problems). 5 / 40

25 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? 6 / 40

26 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean (Recall Galton board!) E[Y X = x]. 6 / 40

27 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. 6 / 40

28 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. f is called the regression function or conditional mean function. 6 / 40

29 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. 7 / 40

30 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. 7 / 40

31 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). 7 / 40

32 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. 7 / 40

33 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. 7 / 40

34 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. In this model, the regression function f is a linear function f β : R d R, 5 d f β (x) = x T β = x iβ i, x R d. i=1 (We ll often refer to f β just by β.) y 0 f * x 7 / 40

35 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. 8 / 40

36 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 8 / 40

37 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). 8 / 40

38 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 8 / 40

39 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 4. Polynomial expansion: for x = (x 1,..., x d ) R d, φ(x) = (1, x 1,..., x d, x 2 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). 8 / 40

40 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. 9 / 40

41 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. 9 / 40

42 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? 9 / 40

43 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? Instead, use Can learn coefficients β such that φ(x) = (1, x temp, x 2 temp). β T φ(x) = (x temp 98.6) 2, or any other quadratic polynomial in x temp (which may be better!). 9 / 40

44 Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, 10 / 40

45 Quadratic expansion Quadratic function f : R R f(x) = ax 2 + bx + c, x R, for a, b, c R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x 2 ), since where β = (c, b, a). f(x) = β T φ(x) 10 / 40

46 Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, This can be written as a linear function of φ(x), where since where β = (c, b, a). φ(x) := (1, x, x 2 ), f(x) = β T φ(x) For multivariate quadratic function f : R d R, use φ(x) := (1, x 1,..., x d, x 2 }{{} 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). }{{}}{{} linear terms squared terms cross terms 10 / 40

47 Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 11 / 40

48 Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 100 time until next eruption affine function duration of last eruption Affine function f a,b : R R for a, b R, f a,b (x) = a + bx, is a linear function f β of φ(x) for β = (a, b). (This easily generalizes to multivariate affine functions.) 11 / 40

49 Why linear regression models? 12 / 40

50 Why linear regression models? 1. Linear regression models benefit from good choice of features. 12 / 40

51 Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 12 / 40

52 Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 3. Many well-understood and efficient algorithms for learning linear functions from data, even when n and d are large. 12 / 40

53 3. From data to prediction functions

54 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) 13 / 40

55 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 13 / 40

56 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 13 / 40

57 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 This coincides with another approach, called empirical risk minimization, which is studied beyond the context of the linear regression model / 40

58 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 14 / 40

59 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). 14 / 40

60 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). Replace P with P n Empirical (squared loss) risk R(f): R(f) := 1 n n (f(x i) y i) 2. i=1 14 / 40

61 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. 15 / 40

62 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. 15 / 40

63 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. In general: MLE makes sense in context of statistical model for which it is derived. ERM makes sense in context of general iid model for supervised learning. 15 / 40

64 Empirical risk minimization in pictures Red dots: data points. Affine hyperplane: linear function β (via affine expansion (x 1, x 2 ) (1, x 1, x 2 )). ERM: minimize sum of squared vertical lengths from hyperplane to points. 16 / 40

65 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n y n 17 / 40

66 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. y n 17 / 40

67 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. 17 / 40

68 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. 17 / 40

69 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. It can be proved that every critical point of R is a minimizer of R: 17 / 40

70 Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? 18 / 40

71 Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? Yes, if f is a convex function: f((1 t)x + tx ) (1 t)f(x) + tf(x ), for any 0 t 1 and any x, x R d. 18 / 40

72 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) 19 / 40

73 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) / 40

74 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) 19 / 40

75 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) = (1 t) Ax b t Ax b 2 2 (1 t)t[ Ax b Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) 19 / 40

76 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) = (1 t) Ax b t Ax b 2 2 (1 t)t[ Ax b Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) (1 t) Ax b t Ax b 2 2 where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometric mean (AM/GM) inequality. 19 / 40

77 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 20 / 40

78 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c / 40

79 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. 20 / 40

80 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. 20 / 40

81 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. 20 / 40

82 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. 20 / 40

83 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. Convexity is a useful mathematical property to understand! (We ll study more convex analysis in a few weeks.) 20 / 40

84 Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. 21 / 40

85 Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. Running time (dominated by Gaussian elimination): O(nd 2 ). Note: there are many approximate solvers that run in nearly linear time! 21 / 40

86 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. 22 / 40

87 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. 22 / 40

88 a 2 22 / 40 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb a 1

89 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. ˆb a 1 a 2 22 / 40

90 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 22 / 40

91 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. If rank(a) < d, then ERM solution is not unique. ˆb a 1 a 2 22 / 40

92 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 If rank(a) < d, then ERM solution is not unique. To get β from ˆb: solve system of linear equations Aβ = ˆb. 22 / 40

93 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? 23 / 40

94 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: R(β) = 0, i.e., β is a critical point of R. 23 / 40

95 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. 23 / 40

96 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? 23 / 40

97 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? If (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, then E[A T A] = E[XX T ] and E[A T b] = E[Y X], so ERM can be regarded as a plug-in estimator for a minimizer of R. 23 / 40

98 4. Risk, empirical risk, and estimating risk

99 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. 24 / 40

100 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). 24 / 40

101 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? 24 / 40

102 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? Theorem. Under mild assumptions on distribution of X, ( ) tr(cov(εw )) R(ˆβ) R(β ) = O n asymptotically, where W := E[XX T ] 1 2 X and ε := Y X T β. 24 / 40

103 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and n(ˆβ β ) = ( 1 n n ) 1 1 X ix T i n i=1 n i=1 ε ix i. 25 / 40

104 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: n(ˆβ β ) = ( 1 n 1 n n X ix T i i=1 n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. 25 / 40

105 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: 2. By CLT: n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). 25 / 40

106 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. 25 / 40

107 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. A few more steps gives ( ) n E[(X T ˆβ Y ) 2 ] E[(X T β Y ) 2 d ] E[XX T ] cov(εx) 2 Z 2 2. Random variable on RHS is concentrated around its mean tr(cov(εw )). 25 / 40

108 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. 26 / 40

109 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. 26 / 40

110 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. 26 / 40

111 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. 26 / 40

112 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. In homework/reading, we look at a simpler setting for studying ERM for linear regression, called fixed design. 26 / 40

113 Risk vs empirical risk Let ˆβ be ERM solution. 27 / 40

114 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 27 / 40

115 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) 27 / 40

116 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) 27 / 40

117 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) Overfitting: empirical risk is small, but true risk is much higher. 27 / 40

118 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion φ(x) = (1, x 1,..., x k ), x R, so dimension is d = k / 40

119 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y x 28 / 40

120 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y x Conclusion: If n k + 1 = d, ERM solution ˆβ with this feature expansion has R(ˆβ) = 0 always, regardless of its true risk (which can be 0). 28 / 40

121 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. 29 / 40

122 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. 29 / 40

123 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. i=1 29 / 40

124 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 29 / 40

125 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). 29 / 40

126 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. 29 / 40

127 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. By CLT, the rate of convergence is m 1/2. 29 / 40

128 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. 30 / 40

129 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n 30 / 40

130 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However / 40

131 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) at n 1/2 rate. 30 / 40

132 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. 30 / 40

133 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. Implication: Selecting a good predictor can be easier than estimating how good predictors are! 30 / 40

134 Old Faithful example 31 / 40

135 Old Faithful example Linear regression model + affine expansion on duration of last eruption. 31 / 40

136 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. 31 / 40

137 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is (Recall: mean squared loss of ˆµ = was ) 100 time until next eruption linear model constant prediction duration of last eruption 31 / 40

138 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is (Recall: mean squared loss of ˆµ = was ) 100 time until next eruption linear model constant prediction duration of last eruption (Unfortunately, 35.9 > mean duration 3.5.) 31 / 40

139 5. Regularization

140 Inductive bias Suppose ERM solution is not unique. What should we do? 32 / 40

141 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. 32 / 40

142 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. 32 / 40

143 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. 32 / 40

144 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? 32 / 40

145 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. 32 / 40

146 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. 32 / 40

147 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. All learning algorithms encode some kind of inductive bias. 32 / 40

148 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) / 40

149 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data: x 33 / 40

150 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data and some arbitrary ERM: x 33 / 40

151 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data and least l 2 norm ERM: x It is not a given that the least norm ERM is better than the other ERM! 33 / 40

152 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β / 40

153 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! 34 / 40

154 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) 34 / 40

155 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). 34 / 40

156 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). Choose λ using cross-validation. 34 / 40

157 Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by Ã := 1 n x T 1... x T n nλ... nλ, b := 1 n y 1... y n / 40

158 Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by x T 1 y 1 Ã := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Ãβ b 2 2 = R(β) + λ β / 40

159 Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by x T 1 y 1 Ã := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix Ã has rank d. 35 / 40

160 Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by x T 1 y 1 Ã := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix Ã has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are / 40

161 Another interpretation of ridge regression Define (n + d) d matrix Ã and (n + d) 1 column vector b by x T 1 y 1 Ã := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix Ã has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are 0. Prediction of β on i-th fake feature vector is nλβ i. 35 / 40

162 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. 36 / 40

163 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. 36 / 40

164 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. 36 / 40

165 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. Fact: Vectors with small l 1-norm are well-approximated by sparse vectors. If β contains just the 1/ε 2 -largest coefficients (by magnitude) of β, then β β 2 ε β / 40

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the