Linear regression COMS 4771
|
|
- Henry Snow
- 5 years ago
- Views:
Transcription
1 Linear regression COMS 4771
2 1. Old Faithful and prediction functions
3 Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40
4 Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 2 / 40
5 Statistical model for time between eruptions Historical records of eruptions: b 0 a 0 a 1 b 1 a 2 b 2 a 3 b 3... Y 1 Y 2 Y 3 Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) 2 / 40
6 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. Y 2 / 40
7 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Y 2 / 40
8 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Y 2 / 40
9 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Mean squared loss of ˆµ on next 136 observations is Y 2 / 40
10 Statistical model for time between eruptions Historical records of eruptions:... a n 1 b n 1... t data a n b n Galton board iid model: Y 1,..., Y n, Y iid N(µ, σ 2 ). Data: Y i := a i b i 1, i = 1,..., n. (Admittedly not a great model, since durations are non-negative. Better: take Y i from exponential distribution.) Task: At later time t (when an eruption ends), predict time of next eruption t + Y. On Old Faithful data: Using 136 past observations, we form estimate ˆµ = Mean squared loss of ˆµ on next 136 observations is (Easier to interpret the square root, , which has same units as Y.) Y 2 / 40
11 Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. 3 / 40
12 Looking at the data Naturalist Harry Woodward observed that time until the next eruption seems to be related to duration of last eruption. time until next eruption duration of last eruption 3 / 40
13 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t data X Y 4 / 40
14 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. Y 4 / 40
15 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. Y 4 / 40
16 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). Y 4 / 40
17 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. Y 4 / 40
18 Using side-information At prediction time t, duration of last eruption is available as side-information.... a n 1 b n 1 a n b n... t... X n Y n IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X X takes values in X (e.g., X = R), Y takes values in R. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (a.k.a. predictor) ˆf : X R, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2. How should we choose ˆf based on data? Y 4 / 40
19 2. Optimal predictors and linear regression models
20 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). 5 / 40
21 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 5 / 40
22 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 5 / 40
23 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. 5 / 40
24 Distributions over labeled examples X : Space of possible side-information (feature space). Y: Space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x for each x X : P Y X=x is a probability distribution on Y. This lecture: Y = R (regression problems). 5 / 40
25 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? 6 / 40
26 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean (Recall Galton board!) E[Y X = x]. 6 / 40
27 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. 6 / 40
28 Optimal predictor What function f : X R has smallest (squared loss) risk R(f) := E[(f(X) Y ) 2 ]? Conditional on X = x, the minimizer of conditional risk ŷ E[(ŷ Y ) 2 X = x] is the conditional mean E[Y X = x]. (Recall Galton board!) Therefore, the function f : R R where f (x) = E[Y X = x], x R has the smallest risk. f is called the regression function or conditional mean function. 6 / 40
29 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. 7 / 40
30 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. 7 / 40
31 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). 7 / 40
32 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. 7 / 40
33 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. 7 / 40
34 Linear regression models When side-information is encoded as vectors of real numbers x = (x 1,..., x d ) (called features or variables), it is common to use a linear regression model, such as the following: Y X = x N(x T β, σ 2 ), x R d. Parameters: β = (β 1,..., β d ) R d, σ 2 > 0. X = (X 1,..., X d ), a random vector (i.e., a vector of random variables). Conditional distribution of Y given X is normal. Marginal distribution of X not specified. In this model, the regression function f is a linear function f β : R d R, 5 d f β (x) = x T β = x iβ i, x R d. i=1 (We ll often refer to f β just by β.) y 0 f * x 7 / 40
35 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. 8 / 40
36 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 8 / 40
37 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). 8 / 40
38 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 8 / 40
39 Enhancing linear regression models Linear functions might sound rather restricted, but actually they can be quite powerful if you are creative about side-information. Examples: 1. Non-linear transformations of existing variables: for x R, φ(x) = ln(1 + x). 2. Logical formula of binary variables: for x = (x 1,..., x d ) {0, 1} d, 3. Trigonometric expansion: for x R, φ(x) = (x 1 x 5 x 10) ( x 2 x 7). φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x),... ). 4. Polynomial expansion: for x = (x 1,..., x d ) R d, φ(x) = (1, x 1,..., x d, x 2 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). 8 / 40
40 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. 9 / 40
41 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. 9 / 40
42 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? 9 / 40
43 Example: Taking advantage of linearity Suppose you are trying to predict some health outcome. Physician suggests that body temperature is relevant, specifically the (square) deviation from normal body temperature: φ(x) = (x temp 98.6) 2. What if you didn t know about this magic constant 98.6? Instead, use Can learn coefficients β such that φ(x) = (1, x temp, x 2 temp). β T φ(x) = (x temp 98.6) 2, or any other quadratic polynomial in x temp (which may be better!). 9 / 40
44 Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, 10 / 40
45 Quadratic expansion Quadratic function f : R R f(x) = ax 2 + bx + c, x R, for a, b, c R. This can be written as a linear function of φ(x), where φ(x) := (1, x, x 2 ), since where β = (c, b, a). f(x) = β T φ(x) 10 / 40
46 Quadratic expansion Quadratic function f : R R for a, b, c R. f(x) = ax 2 + bx + c, x R, This can be written as a linear function of φ(x), where since where β = (c, b, a). φ(x) := (1, x, x 2 ), f(x) = β T φ(x) For multivariate quadratic function f : R d R, use φ(x) := (1, x 1,..., x d, x 2 }{{} 1,..., x 2 d, x 1x 2,..., x 1x d,..., x d 1 x d ). }{{}}{{} linear terms squared terms cross terms 10 / 40
47 Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 11 / 40
48 Affine expansion and Old Faithful Woodward needed an affine expansion for Old Faithful data: φ(x) := (1, x). 100 time until next eruption affine function duration of last eruption Affine function f a,b : R R for a, b R, f a,b (x) = a + bx, is a linear function f β of φ(x) for β = (a, b). (This easily generalizes to multivariate affine functions.) 11 / 40
49 Why linear regression models? 12 / 40
50 Why linear regression models? 1. Linear regression models benefit from good choice of features. 12 / 40
51 Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 12 / 40
52 Why linear regression models? 1. Linear regression models benefit from good choice of features. 2. Structure of linear functions is very well-understood. 3. Many well-understood and efficient algorithms for learning linear functions from data, even when n and d are large. 12 / 40
53 3. From data to prediction functions
54 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) 13 / 40
55 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 13 / 40
56 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 13 / 40
57 Maximum likelihood estimation for linear regression Linear regression model with Gaussian noise: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, with Y X = x N(x T β, σ 2 ), x R d. (Traditional to study linear regression in context of this model.) Log-likelihood of (β, σ 2 ), given data (X i, Y i) = (x i, y i) for i = 1,..., n: n i=1 { 1 2σ 2 (xt i β y i) } 2 ln 1 { } + terms not involving (β, σ 2 ). 2πσ 2 The β that maximizes log-likelihood is also β that minimizes 1 n n (x T i β y i) 2. i=1 This coincides with another approach, called empirical risk minimization, which is studied beyond the context of the linear regression model / 40
58 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 14 / 40
59 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). 14 / 40
60 Empirical distribution and empirical risk Empirical distribution P n on (x 1, y 1),..., (x n, y n) has probability mass function p n given by p n((x, y)) := 1 n n 1{(x, y) = (x i, y i)}, (x, y) R d R. i=1 Plug-in principle: Goal is to find function f that minimizes (squared loss) risk R(f) = E[(f(X) Y ) 2 ]. But we don t know the distribution P of (X, Y ). Replace P with P n Empirical (squared loss) risk R(f): R(f) := 1 n n (f(x i) y i) 2. i=1 14 / 40
61 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. 15 / 40
62 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. 15 / 40
63 Empirical risk minimization Empirical risk minimization (ERM) is the learning method that returns a function (from a specified function class) that minimizes the empirical risk. For linear functions and squared loss: ERM returns ˆβ arg min β R d R(β), which coincides with MLE under the basic linear regression model. In general: MLE makes sense in context of statistical model for which it is derived. ERM makes sense in context of general iid model for supervised learning. 15 / 40
64 Empirical risk minimization in pictures Red dots: data points. Affine hyperplane: linear function β (via affine expansion (x 1, x 2 ) (1, x 1, x 2 )). ERM: minimize sum of squared vertical lengths from hyperplane to points. 16 / 40
65 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n y n 17 / 40
66 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. y n 17 / 40
67 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. 17 / 40
68 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. 17 / 40
69 Empirical risk minimization in matrix notation Define n d matrix A and n 1 column vector b by A := 1 x T 1 n., b := 1 y 1 n.. x T n Can write empirical risk as R(β) = Aβ b 2 2. Necessary condition for β to be a minimizer of R: y n R(β) = 0, i.e., β is a critical point of R. This translates to (A T A)β = A T b, a system of linear equations called the normal equations. It can be proved that every critical point of R is a minimizer of R: 17 / 40
70 Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? 18 / 40
71 Aside: Convexity Let f : R d R be a differentiable function. Suppose we find x R d such that f(x) = 0. Is x a minimizer of f? Yes, if f is a convex function: f((1 t)x + tx ) (1 t)f(x) + tf(x ), for any 0 t 1 and any x, x R d. 18 / 40
72 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) 19 / 40
73 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) / 40
74 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) 19 / 40
75 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) = (1 t) Ax b t Ax b 2 2 (1 t)t[ Ax b Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) 19 / 40
76 Convexity of empirical risk Checking convexity of g(x) = Ax b 2 2: g((1 t)x + tx ) = (1 t)(ax b) + t(ax b) 2 2 = (1 t) 2 Ax b t 2 Ax b (1 t)t(ax b) T (Ax b) = (1 t) Ax b t Ax b 2 2 (1 t)t[ Ax b Ax b 2 2] + 2(1 t)t(ax b) T (Ax b) (1 t) Ax b t Ax b 2 2 where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometric mean (AM/GM) inequality. 19 / 40
77 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 20 / 40
78 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c / 40
79 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. 20 / 40
80 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. 20 / 40
81 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. 20 / 40
82 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. 20 / 40
83 Convexity of empirical risk, another way Preview of convex analysis Recall R(β) = 1 n n (x T i β y i) 2. i=1 Scalar function g(z) = cz 2 is convex for any c 0. Composition (g a): R d R of any convex function g : R R and any affine function a: R d R is convex. Therefore, function β 1 n (xt i β y i) 2 is convex. Sum of convex functions is convex. Therefore R is convex. Convexity is a useful mathematical property to understand! (We ll study more convex analysis in a few weeks.) 20 / 40
84 Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. 21 / 40
85 Algorithm for ERM Algorithm for ERM with linear functions and squared loss input Data (x 1, y 1 ),..., (x n, y n ) from R d R. output Linear function ˆβ R d. 1: Find solution ˆβ to the normal equations defined by the data (using, e.g., Gaussian elimination). 2: return ˆβ. Also called ordinary least squares in this context. Running time (dominated by Gaussian elimination): O(nd 2 ). Note: there are many approximate solvers that run in nearly linear time! 21 / 40
86 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. 22 / 40
87 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. 22 / 40
88 a 2 22 / 40 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb a 1
89 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. ˆb a 1 a 2 22 / 40
90 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 22 / 40
91 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. If rank(a) < d, then ERM solution is not unique. ˆb a 1 a 2 22 / 40
92 Geometric interpretation of ERM Let a j R n be the j-th column of matrix A R n d, so A = a 1 a d. Minimizing Aβ b 2 2 is the same as finding vector ˆb range(a) closest to b. Solution ˆb is orthogonal projection of b onto range(a) = {Aβ : β R d }. b ˆb is uniquely determined. If rank(a) < d, then >1 way to write ˆb as linear combination of a 1,..., a d. ˆb a 1 a 2 If rank(a) < d, then ERM solution is not unique. To get β from ˆb: solve system of linear equations Aβ = ˆb. 22 / 40
93 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? 23 / 40
94 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: R(β) = 0, i.e., β is a critical point of R. 23 / 40
95 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. 23 / 40
96 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? 23 / 40
97 Statistical interpretation of ERM Let (X, Y ) P, where P is some distribution on R d R. Which β have smallest risk R(β) = E[(X T β Y ) 2 ]? Necessary condition for β to be a minimizer of R: This translates to R(β) = 0, i.e., β is a critical point of R. E[XX T ]β = E[Y X], a system of linear equations called the population normal equations. It can be proved that every critical point of R is a minimizer of R. Looks familiar? If (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, then E[A T A] = E[XX T ] and E[A T b] = E[Y X], so ERM can be regarded as a plug-in estimator for a minimizer of R. 23 / 40
98 4. Risk, empirical risk, and estimating risk
99 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. 24 / 40
100 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). 24 / 40
101 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? 24 / 40
102 Risk of ERM IID model: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid, taking values in R d R. Let β be a minimizer of R over all β R d, i.e., β satisfies population normal equations E[XX T ]β = E[Y X]. If ERM solution ˆβ is not unique (e.g., if n < d), then R(ˆβ) can be arbitrarily worse than R(β ). What about when ERM solution is unique? Theorem. Under mild assumptions on distribution of X, ( ) tr(cov(εw )) R(ˆβ) R(β ) = O n asymptotically, where W := E[XX T ] 1 2 X and ε := Y X T β. 24 / 40
103 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and n(ˆβ β ) = ( 1 n n ) 1 1 X ix T i n i=1 n i=1 ε ix i. 25 / 40
104 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: n(ˆβ β ) = ( 1 n 1 n n X ix T i i=1 n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. 25 / 40
105 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 and 1. By LLN: 2. By CLT: n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). 25 / 40
106 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. 25 / 40
107 Risk of ERM analysis (rough sketch) Let ε i := Y i X T i β for each i = 1,..., n, so and 1. By LLN: 2. By CLT: E[ε ix i] = E[Y ix i] E[X ix T i ]β = 0 n(ˆβ β ) = ( 1 n 1 n 1 n n X ix T i i=1 n i=1 ε ix i n ) 1 1 X ix T i n i=1 p E[XX T ] n i=1 ε ix i. d cov(εx) 1 2 Z, where Z N(0, I). Therefore, asymptotic distribution of n(ˆβ β ) is n(ˆβ β ) d E[XX T ] 1 cov(εx) 1 2 Z. A few more steps gives ( ) n E[(X T ˆβ Y ) 2 ] E[(X T β Y ) 2 d ] E[XX T ] cov(εx) 2 Z 2 2. Random variable on RHS is concentrated around its mean tr(cov(εw )). 25 / 40
108 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. 26 / 40
109 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. 26 / 40
110 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. 26 / 40
111 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. 26 / 40
112 Risk of ERM: postscript Analysis does not assume that the linear regression model is correct ; the data distribution need not be from normal linear regression model. Only assumptions are those needed for LLN and CLT to hold. However, if normal linear regression model holds, i.e., Y X = x N(x T β, σ 2 ), then the bound from the theorem becomes ( ) R(ˆβ) R(β σ 2 d ) = O, n which is familiar to those who have taken introductory statistics. With more work, can also prove non-asymptotic risk bound of similar form. In homework/reading, we look at a simpler setting for studying ERM for linear regression, called fixed design. 26 / 40
113 Risk vs empirical risk Let ˆβ be ERM solution. 27 / 40
114 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 27 / 40
115 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) 27 / 40
116 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) 27 / 40
117 Risk vs empirical risk Let ˆβ be ERM solution. 1. Empirical risk of ERM: R(ˆβ) 2. True risk of ERM: R(ˆβ) Theorem. E [ ] [ ] R(ˆβ) E R(ˆβ). (Empirical risk can sometimes be larger than true risk, but not on average.) Overfitting: empirical risk is small, but true risk is much higher. 27 / 40
118 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion φ(x) = (1, x 1,..., x k ), x R, so dimension is d = k / 40
119 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y x 28 / 40
120 Overfitting example (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid; X is continuous random variable in R. Suppose we use degree-k polynomial expansion so dimension is d = k + 1. φ(x) = (1, x 1,..., x k ), x R, Fact: Any function on k + 1 points can be interpolated by a polynomial of degree at most k. y x Conclusion: If n k + 1 = d, ERM solution ˆβ with this feature expansion has R(ˆβ) = 0 always, regardless of its true risk (which can be 0). 28 / 40
121 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. 29 / 40
122 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. 29 / 40
123 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. i=1 29 / 40
124 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 29 / 40
125 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). 29 / 40
126 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. 29 / 40
127 Estimating risk IID model: (X 1, Y 1),..., (X n, Y n), ( X 1, Ỹ1),..., ( X m, Ỹm) iid P. training data (X 1, Y 1),..., (X n, Y n) used to learn ˆf. test data ( X 1, Ỹ1),..., ( X m, Ỹm) used to estimate risk, via test risk R test(f) := 1 m m (f( X i) Ỹi)2. Training data is independent of test data, so ˆf is independent of test data. i=1 Let L i := ( ˆf( X i) Ỹi)2 for each i = 1,..., m, so [ E Rtest( ˆf) ˆf ] = 1 m m i=1 [ E L i ˆf ] = R( ˆf). Moreover, L 1,..., L m are conditionally iid given ˆf, and hence by Law of Large Numbers, R test( ˆf) p R( ˆf) as m. By CLT, the rate of convergence is m 1/2. 29 / 40
128 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. 30 / 40
129 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n 30 / 40
130 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However / 40
131 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) at n 1/2 rate. 30 / 40
132 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. 30 / 40
133 Rates for risk minimization vs. rates for risk estimation One may think that ERM works because, somehow, training risk is a good plug-in estimate of true risk. Sometimes this is partially true we ll revisit this when we discuss generalization theory. Roughly speaking, under some assumptions, can expect that ( ) R(β) d R(β) O for all β R d. n However... By CLT, we know the following holds for a fixed β: R(β) p R(β) (Here, we ignore the dependence on d.) Yet, for ERM ˆβ, R(ˆβ) p R(β ) (Also ignoring dependence on d.) at n 1/2 rate. at n 1 rate. Implication: Selecting a good predictor can be easier than estimating how good predictors are! 30 / 40
134 Old Faithful example 31 / 40
135 Old Faithful example Linear regression model + affine expansion on duration of last eruption. 31 / 40
136 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. 31 / 40
137 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is (Recall: mean squared loss of ˆµ = was ) 100 time until next eruption linear model constant prediction duration of last eruption 31 / 40
138 Old Faithful example Linear regression model + affine expansion on duration of last eruption. Learn ˆβ = ( , ) from 136 past observations. Mean squared loss of ˆβ on next 136 observations is (Recall: mean squared loss of ˆµ = was ) 100 time until next eruption linear model constant prediction duration of last eruption (Unfortunately, 35.9 > mean duration 3.5.) 31 / 40
139 5. Regularization
140 Inductive bias Suppose ERM solution is not unique. What should we do? 32 / 40
141 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. 32 / 40
142 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. 32 / 40
143 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. 32 / 40
144 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? 32 / 40
145 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. 32 / 40
146 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. 32 / 40
147 Inductive bias Suppose ERM solution is not unique. What should we do? One possible answer: Pick the β of shortest length. Fact: The shortest solution ˆβ to (A T A)β = A T b is always unique. Obtain ˆβ via ˆβ = A b where A is the (Moore-Penrose) pseudoinverse of A. Why should this be a good idea? Data does not give reason to choose a shorter β over a longer β. The preference for shorter β is an inductive bias: it will work well for some problems (e.g., when true β is short), not for others. All learning algorithms encode some kind of inductive bias. 32 / 40
148 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) / 40
149 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data: x 33 / 40
150 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data and some arbitrary ERM: x 33 / 40
151 Example ERM with scaled trigonometric feature expansion: φ(x) = (1, sin(x), cos(x), 1 sin(2x), 1 cos(2x), 1 sin(3x), 1 cos(3x),... ) f(x) Training data and least l 2 norm ERM: x It is not a given that the least norm ERM is better than the other ERM! 33 / 40
152 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β / 40
153 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! 34 / 40
154 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) 34 / 40
155 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). 34 / 40
156 Regularized ERM Combine the two concerns: For a given λ 0, find minimizer of over β R d. R(β) + λ β 2 2 Fact: If λ > 0, then the solution is always unique (even if n < d)! This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Parameter λ controls how much attention is paid to the regularizer β 2 2 relative to the data fitting term R(β). Choose λ using cross-validation. 34 / 40
157 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by à := 1 n x T 1... x T n nλ... nλ, b := 1 n y 1... y n / 40
158 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Ãβ b 2 2 = R(β) + λ β / 40
159 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. 35 / 40
160 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are / 40
161 Another interpretation of ridge regression Define (n + d) d matrix à and (n + d) 1 column vector b by x T 1 y 1 à := 1.. x T n 1 y n nλ, b := n n nλ 0 Then Interpretation: Ãβ b 2 2 = R(β) + λ β 2 2. d fake data points; ensure that augmented data matrix à has rank d. Squared length of each fake feature vector is nλ. All corresponding labels are 0. Prediction of β on i-th fake feature vector is nλβ i. 35 / 40
162 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. 36 / 40
163 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. 36 / 40
164 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. 36 / 40
165 Regularization with a different norm Lasso: For a given λ 0, find minimizer of R(β) + λ β 1 over β R d. Here, v 1 = d i=1 vi is the l1-norm. Prefers shorter β, but using a different notion of length than ridge. Tends to produce β that are sparse i.e., have few non-zero coordinates or at least well-approximated by sparse vectors. Fact: Vectors with small l 1-norm are well-approximated by sparse vectors. If β contains just the 1/ε 2 -largest coefficients (by magnitude) of β, then β β 2 ε β / 40
COMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationConvex optimization COMS 4771
Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationLinear Classifiers and the Perceptron
Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationHT Introduction. P(X i = x i ) = e λ λ x i
MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationLasso, Ridge, and Elastic Net
Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationSTAT5044: Regression and Anova. Inyoung Kim
STAT5044: Regression and Anova Inyoung Kim 2 / 47 Outline 1 Regression 2 Simple Linear regression 3 Basic concepts in regression 4 How to estimate unknown parameters 5 Properties of Least Squares Estimators:
More informationCOMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017
COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationConvex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)
ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationWeek 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationMultivariate Regression
Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationFitting Linear Statistical Models to Data by Least Squares: Introduction
Fitting Linear Statistical Models to Data by Least Squares: Introduction Radu Balan, Brian R. Hunt and C. David Levermore University of Maryland, College Park University of Maryland, College Park, MD Math
More informationCSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1
CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems.
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate
More informationLecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices
Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is
More informationSupport Vector Machines
EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationVector Calculus. Lecture Notes
Vector Calculus Lecture Notes Adolfo J. Rumbos c Draft date November 23, 211 2 Contents 1 Motivation for the course 5 2 Euclidean Space 7 2.1 Definition of n Dimensional Euclidean Space........... 7 2.2
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationLecture 24 May 30, 2018
Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationP n. This is called the law of large numbers but it comes in two forms: Strong and Weak.
Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More information