Linear regression COMS 4771

Size: px
Start display at page:

Download "Linear regression COMS 4771"

Transcription

1 Linear regression COMS 4771

2 1. Old Faithful and prediction functions

3 Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40

4 Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 2 / 40

5 Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) 2 / 40

6 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. Y 2 / 40

7 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Y 2 / 40

8 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Y 2 / 40

9 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Mean squared loss of ˆµ on next 136 observations is Y 2 / 40

10 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Mean squared loss of ˆµ on next 136 observations is (Easier to interpret the square root, , which has same units as Y.) Y 2 / 40

11 Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. 3 / 40

12 Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. time until next eruption duration of last eruption 3 / 40

13 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t data X Y 4 / 40

14 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. Y 4 / 40

15 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. Y 4 / 40

16 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). Y 4 / 40

17 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. Y 4 / 40

18 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. How should we choose ˆf based on data? Y 4 / 40

19 2. Optimal predictors and linear regression models

20 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). 5 / 40

21 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 5 / 40

22 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 5 / 40

23 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. 5 / 40

24 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. This lecture: Y = R (regression problems). 5 / 40

25 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? 6 / 40

26 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean (Recall Galton board!) E[Y X = x]. 6 / 40

27 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. 6 / 40

28 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. f is called the regression function or conditional mean function. 6 / 40

29 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. 7 / 40

30 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. 7 / 40

31 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). 7 / 40

32 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. 7 / 40

33 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. 7 / 40

34 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. In this model, the regression function f is a linear function f β : R d R, 5 d f β (x) = x T β = x iβ i, x R d. i=1 (We ll often refer to f β just by β.) y 0 f * x 7 / 40

35 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. 8 / 40

36 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 8 / 40

37 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). 8 / 40

38 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 8 / 40

39 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 4. Polynomial expansion: for x = (x 1,..., x d ) R d, φ(x) = (1, x 1,..., x d, x 2 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). 8 / 40

40 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. 9 / 40

41 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. 9 / 40

42 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? 9 / 40

43 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? Instead, use Can learn coefficients β such that φ(x) = (1, x temp, x 2 temp). β T φ(x) = (x temp 98.6) 2, or any other quadratic polynomial in x temp (which may be better!). 9 / 40

44 Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, 10 / 40

45 Quadratic expansion Quadratic function f : R R f(x) = ax 2 + bx + c, x R, for a, b, c R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x 2 ), since where β = (c, b, a). f(x) = β T φ(x) 10 / 40

46 Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, This can be written as a linear function of φ(x), where since where β = (c, b, a). φ(x) := (1, x, x 2 ), f(x) = β T φ(x) For multivariate quadratic function f : R d R, use φ(x) := (1, x 1,..., x d, x 2 }{{} 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). }{{}}{{} linear terms squared terms cross terms 10 / 40

47 Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 11 / 40

48 Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 100 time until next eruption affine function duration of last eruption Affine function f a,b : R R for a, b R, f a,b (x) = a + bx, is a linear function f β of φ(x) for β = (a, b). (This easily generalizes to multivariate affine functions.) 11 / 40

49 Why linear regression models? 12 / 40

50 Why linear regression models? 1. Linear regression models benefit from good choice of features. 12 / 40

51 Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 12 / 40

52 Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 3. Many well-understood and efficient algorithms for learning linear functions from data, even when n and d are large. 12 / 40

53 3. From data to prediction functions

54 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) 13 / 40

55 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 13 / 40

56 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 13 / 40

57 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 This coincides with another approach, called empirical risk minimization, which is studied beyond the context of the linear regression model / 40

58 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 14 / 40

59 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). 14 / 40

60 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). Replace P with P n Empirical (squared loss) risk R(f): R(f) := 1 n n (f(x i) y i) 2. i=1 14 / 40

61 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. 15 / 40

62 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. 15 / 40

63 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. In general: MLE makes sense in context of statistical model for which it is derived. ERM makes sense in context of general iid model for supervised learning. 15 / 40

64 Empirical risk minimization in pictures Red dots: data points. Affine hyperplane: linear function β (via affine expansion (x 1, x 2 ) (1, x 1, x 2 )). ERM: minimize sum of squared vertical lengths from hyperplane to points. 16 / 40

65 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n y n 17 / 40

66 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. y n 17 / 40

67 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. 17 / 40

68 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. 17 / 40

69 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. It can be proved that every critical point of R is a minimizer of R: 17 / 40

70 Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? 18 / 40

71 Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? Yes, if f is a convex function: f((1 t)x + tx ) (1 t)f(x) + tf(x ), for any 0 t 1 and any x, x R d. 18 / 40

72 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) 19 / 40

73 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) / 40

74 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) 19 / 40

75 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) = (1 t) Ax b t Ax b 2 2 (1 t)t[ Ax b Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) 19 / 40

76 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) = (1 t) Ax b t Ax b 2 2 (1 t)t[ Ax b Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) (1 t) Ax b t Ax b 2 2 where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometric mean (AM/GM) inequality. 19 / 40

77 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 20 / 40

78 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c / 40

79 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. 20 / 40

80 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. 20 / 40

81 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. 20 / 40

82 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. 20 / 40

83 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. Convexity is a useful mathematical property to understand! (We ll study more convex analysis in a few weeks.) 20 / 40

84 Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. 21 / 40

85 Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. Running time (dominated by Gaussian elimination): O(nd 2 ). Note: there are many approximate solvers that run in nearly linear time! 21 / 40

86 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. 22 / 40

87 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. 22 / 40

88 a 2 22 / 40 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb a 1

89 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. ˆb a 1 a 2 22 / 40

90 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 22 / 40

91 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. If rank(a) < d, then ERM solution is not unique. ˆb a 1 a 2 22 / 40

92 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 If rank(a) < d, then ERM solution is not unique. To get β from ˆb: solve system of linear equations Aβ = ˆb. 22 / 40

93 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? 23 / 40

94 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: R(β) = 0, i.e., β is a critical point of R. 23 / 40

95 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. 23 / 40

96 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? 23 / 40

97 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? If (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, then E[A T A] = E[XX T ] and E[A T b] = E[Y X], so ERM can be regarded as a plug-in estimator for a minimizer of R. 23 / 40

98 4. Risk, empirical risk, and estimating risk

99 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. 24 / 40

100 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). 24 / 40

101 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? 24 / 40

102 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? Theorem. Under mild assumptions on distribution of X, ( ) tr(cov(εw )) R(ˆβ) R(β ) = O n asymptotically, where W := E[XX T ] 1 2 X and ε := Y X T β. 24 / 40

103 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and n(ˆβ β ) = ( 1 n n ) 1 1 X ix T i n i=1 n i=1 ε ix i. 25 / 40

104 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: n(ˆβ β ) = ( 1 n 1 n n X ix T i i=1 n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. 25 / 40

105 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: 2. By CLT: n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). 25 / 40

106 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. 25 / 40

107 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. A few more steps gives ( ) n E[(X T ˆβ Y ) 2 ] E[(X T β Y ) 2 d ] E[XX T ] cov(εx) 2 Z 2 2. Random variable on RHS is concentrated around its mean tr(cov(εw )). 25 / 40

108 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. 26 / 40

109 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. 26 / 40

110 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. 26 / 40

111 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. 26 / 40

112 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. In homework/reading, we look at a simpler setting for studying ERM for linear regression, called fixed design. 26 / 40

113 Risk vs empirical risk Let ˆβ be ERM solution. 27 / 40

114 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 27 / 40

115 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) 27 / 40

116 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) 27 / 40

117 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) Overfitting: empirical risk is small, but true risk is much higher. 27 / 40

118 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion φ(x) = (1, x 1,..., x k ), x R, so dimension is d = k / 40

119 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y x 28 / 40

120 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y x Conclusion: If n k + 1 = d, ERM solution ˆβ with this feature expansion has R(ˆβ) = 0 always, regardless of its true risk (which can be 0). 28 / 40

121 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. 29 / 40

122 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. 29 / 40

123 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. i=1 29 / 40

124 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 29 / 40

125 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). 29 / 40

126 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. 29 / 40

127 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. By CLT, the rate of convergence is m 1/2. 29 / 40

128 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. 30 / 40

129 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n 30 / 40

130 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However / 40

131 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) at n 1/2 rate. 30 / 40

132 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. 30 / 40

133 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. Implication: Selecting a good predictor can be easier than estimating how good predictors are! 30 / 40

134 Old Faithful example 31 / 40

135 Old Faithful example Linear regression model + affine expansion on duration of last eruption. 31 / 40

136 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. 31 / 40

137 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is (Recall: mean squared loss of ˆµ = was ) 100 time until next eruption linear model constant prediction duration of last eruption 31 / 40

138 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is (Recall: mean squared loss of ˆµ = was ) 100 time until next eruption linear model constant prediction duration of last eruption (Unfortunately, 35.9 > mean duration 3.5.) 31 / 40

139 5. Regularization

140 Inductive bias Suppose ERM solution is not unique. What should we do? 32 / 40

141 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. 32 / 40

142 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. 32 / 40

143 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. 32 / 40

144 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? 32 / 40

145 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. 32 / 40

146 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. 32 / 40

147 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. All learning algorithms encode some kind of inductive bias. 32 / 40

148 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) / 40

149 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data: x 33 / 40

150 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data and some arbitrary ERM: x 33 / 40

151 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data and least l 2 norm ERM: x It is not a given that the least norm ERM is better than the other ERM! 33 / 40

152 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β / 40

153 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! 34 / 40

154 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) 34 / 40

155 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). 34 / 40

156 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). Choose λ using cross-validation. 34 / 40

157 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by à := 1 n x T 1... x T n nλ... nλ, b := 1 n y 1... y n / 40

158 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Ãβ b 2 2 = R(β) + λ β / 40

159 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. 35 / 40

160 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are / 40

161 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are 0. Prediction of β on i-th fake feature vector is nλβ i. 35 / 40

162 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. 36 / 40

163 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. 36 / 40

164 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. 36 / 40

165 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. Fact: Vectors with small l 1-norm are well-approximated by sparse vectors. If β contains just the 1/ε 2 -largest coefficients (by magnitude) of β, then β β 2 ε β / 40

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Lasso, Ridge, and Elastic Net

Lasso, Ridge, and Elastic Net Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

STAT5044: Regression and Anova. Inyoung Kim

STAT5044: Regression and Anova. Inyoung Kim STAT5044: Regression and Anova Inyoung Kim 2 / 47 Outline 1 Regression 2 Simple Linear regression 3 Basic concepts in regression 4 How to estimate unknown parameters 5 Properties of Least Squares Estimators:

More information

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Fitting Linear Statistical Models to Data by Least Squares: Introduction

Fitting Linear Statistical Models to Data by Least Squares: Introduction Fitting Linear Statistical Models to Data by Least Squares: Introduction Radu Balan, Brian R. Hunt and C. David Levermore University of Maryland, College Park University of Maryland, College Park, MD Math

More information

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1 CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems.

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Vector Calculus. Lecture Notes

Vector Calculus. Lecture Notes Vector Calculus Lecture Notes Adolfo J. Rumbos c Draft date November 23, 211 2 Contents 1 Motivation for the course 5 2 Euclidean Space 7 2.1 Definition of n Dimensional Euclidean Space........... 7 2.2

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information