Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience is encoded in data
Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s
Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s Desired outcome: a rule (function) for deciding whether a given 10 10 pattern of black-and-white pixels is a 4 or a 9 x {0, 1} 100 by { 1, +1}
Supervised learning: example Goal: teach a machine to distinguish handwritten 4 s from handwritten 9 s Desired outcome: a rule (function) for deciding whether a given 10 10 pattern of black-and-white pixels is a 4 or a 9 x {0, 1} 100 by { 1, +1} How to get there? assuming that the training data are representative of a stable probabilistic relationship between the input (x) and the output (y), do what you would do if you actually knew this relationship and could find an optimal predictor. Figure taken from G. Shakhnarovich
Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X
Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x)
Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) Expected prediction error of a candidate predictor f: E(f) = E[l(Y, f(x))] Z = l(y, f(x))p XY (dx, dy)
Statistical decision theory: general framework Given: ( X Y ) : input (or feature) : output (or label) P XY : joint probability distribution Goal: find a function f that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) Expected prediction error of a candidate predictor f: E(f) = E[l(Y, f(x))] Z = l(y, f(x))p XY (dx, dy) Optimal predictor: f = arg min E(f), E E(f ) f
Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X
Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use 0-1 loss: l(y, f(x)) = 1 {Y f(x)}
Statistical decision theory: classification Given: ( X R p Y {0, 1} ) : feature : label P XY : joint probability distribution Goal: find a function f : R p {0, 1} that accurately predicts the label Y given the feature X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use 0-1 loss: l(y, f(x)) = 1 {Y f(x)} Expected prediction error of a candidate predictor f: E(f) = P[Y f(x)] Z = 1 {y f(x)} P XY (dx, dy)
Minimizing expected classification error How to pick f to minimize E(f)?
Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy)
Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = «1 {y f(x)} P Y X (dy x) P X (dx)
Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx)
Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx) = 1 Z `P[Y = 0 X = x] 1{f(x)=0} + P[Y = 1 X = x] 1 {f(x)=1} PX (dx)
Minimizing expected classification error How to pick f to minimize E(f)? Conditioning: Z E(f) = 1 {y f(x)} P XY (dx, dy) Z Z = Z Z = 1 «1 {y f(x)} P Y X (dy x) P X (dx) «1 {y=f(x)} P Y X (dy x) P X (dx) = 1 Z `P[Y = 0 X = x] 1{f(x)=0} + P[Y = 1 X = x] 1 {f(x)=1} PX (dx) Maximize the inner integral for each fixed x by setting f (x) = arg max P[Y = y X = x] y {0,1} This is known as the Bayes classifier: classify to the most probable label given the observation X = x. E(f ) is known as the Bayes rate.
The Bayes classifier Classify to the most probable label: Define the regression function: f (x) = arg max P[Y = y X = x] y {0,1} η(x) = P[Y = 1 X = x].
The Bayes classifier Classify to the most probable label: Define the regression function: f (x) = arg max P[Y = y X = x] y {0,1} This leads to the following: Bayes classifier η(x) = P[Y = 1 X = x]. η(x) ( f 1, if η(x) 1/2 (x) = 0, if η(x) < 1/2 E = 1 2 1 Z 1 2η(x) P X (dx) 2 1/2 0 1 0 1 x
Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X
Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use squared error loss: l(y, f(x)) = (Y f(x)) 2
Statistical decision theory: regression Given: ( X R p Y R ) : input variable : output variable P XY : joint probability distribution Goal: find a function f : R p R that predicts Y given the value of X Evaluating predictors: loss function l(y, f(x)) = loss (or penalty) for predicting the true Y with b Y = f(x) For example, use squared error loss: l(y, f(x)) = (Y f(x)) 2 Expected prediction error of a candidate predictor f: E(f) = E ˆ(Y f(x)) 2 Z = (y f(x)) 2 P XY (dx, dy)
Minimizing mean squared error How to pick f to minimize E(f)?
Minimizing mean squared error How to pick f to minimize E(f)? Conditioning: Z E(f) = (y f(x)) 2 P XY (dx, dy) Z Z «= (y f(x)) 2 P Y X (dy x) P X (dx) {z } E[(Y f(x)) 2 X=x]
Minimizing mean squared error How to pick f to minimize E(f)? Conditioning: Z E(f) = (y f(x)) 2 P XY (dx, dy) Z Z «= (y f(x)) 2 P Y X (dy x) P X (dx) {z } E[(Y f(x)) 2 X=x] Minimize the inner integral for each fixed x by setting The solution is the regression function: f (x) = arg min E[(Y c) 2 X = x] c f (x) = E[Y X = x]
Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ]
Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ] We now show that the middle term is zero: E[(Y f (X))(f (X) f(x))] = E ˆE ˆ(Y f (X))(f (X) f(x)) X = E ˆE ˆY f (X) X (f (X) f(x)) " # = E E[Y X] f (X) (f (X) f(x)) {z } f (X) 0
Proof that regression function is optimal For any predictor f, write E(f) = E[(Y f(x)) 2 ] = E[(Y f (X) + f (X) f(x)) 2 ] = E[(Y f (X)) 2 ] + 2 E[(Y f (X))(f (X) f(x))] + E[(f(X) f (X)) 2 ] We now show that the middle term is zero: E[(Y f (X))(f (X) f(x))] = E ˆE ˆ(Y f (X))(f (X) f(x)) X Therefore, for any f we have = E ˆE ˆY f (X) X (f (X) f(x)) " # = E E[Y X] f (X) (f (X) f(x)) {z } f (X) 0 E(f) = E(f ) + E[(f(X) f (X))] 2 E(f ), and the term E[(f(X) f (X)) 2 ] = 0 if and only if f = f.
Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias.
Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias. Example: best linear predictor Suppose we are only interested in predictors of the form f(x) = β T x = where β R p is some fixed vector of weights px β(i)x(i) i=1
Restricting the predictor Sometimes, even if we know the distribution P XY, computing the optimal regression function f may not be worth it. In such cases, we may want to settle for the best predictor in some restricted class. This is one way to introduce inductive bias. Example: best linear predictor Suppose we are only interested in predictors of the form f(x) = β T x = where β R p is some fixed vector of weights Then px β(i)x(i) i=1 E(f) = E(β) = E[(Y β T X) 2 ] = E[Y 2 ] 2β T E[Y X] + E[(β T X) 2 ]. Optimize by differentiating w.r.t. β to get the optimal linear predictor: β = (E[XX T ]) 1 E[XY ]
Decomposition of the expected prediction error What happens when we use restricted predictors?
Decomposition of the expected prediction error What happens when we use restricted predictors? Let F be a restricted class of candidate predictors (e.g., linear or quadratic ones); define the best prediction error over F: E F = min f F E(f).
Decomposition of the expected prediction error What happens when we use restricted predictors? Let F be a restricted class of candidate predictors (e.g., linear or quadratic ones); define the best prediction error over F: Then for any f F we have where: E F = min f F E(f). E(f) = E(f) E + E = E(f) EF + EF {z } E +E {z } =T 1 =T 2 T 1 can be set to zero by letting f achieve the minimum min f F E(f) T 2 is large or small depending on how well the predictors in F can do
Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY?
Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1
Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1 Why does it work? Without getting into details, the reason is the Law of Large Numbers: for any f, be(f) E(f) with high probability
Statistical learning: unknown P XY To compute the best predictor f, you need to know the underlying distribution P XY! What should we do when we don t know P XY? Empirical risk minimization If we can draw a large number n of independent samples (X 1, Y 1 ),..., (X n, Y n) from P XY, then we can find f that minimizes the empirical prediction error be(f) = 1 nx l(y i, f(x i )) n i=1 Why does it work? Without getting into details, the reason is the Law of Large Numbers: for any f, be(f) E(f) with high probability It can then be proved that if b f minimizes the empirical prediction error, then E( b f ) E(f ) with high probability where f is the true optimal predictor (for P XY )
Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1
Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ)
Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0
Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0 If X T X is invertible, solve for the optimum: bβ = (X T X) 1 X T Y
Example: least-squares linear regression Given training data (X 1, Y 1 ),..., (X n, Y n), find β b R p to minimize be(β) = 1 nx (Y i β T X i ) 2 n i=1 Least-squares linear regression Define the n p design matrix X = [X ij ], X ij = jth component of X i, and the response vector Y = [Y 1 Y 2... Y n] T. Then be(β) = 1 n (Y Xβ)T (Y Xβ) Differentiate to get the normal equations: X T (Y Xβ) = 0 If X T X is invertible, solve for the optimum: bβ = (X T X) 1 X T Y Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.
Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier.
Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y
Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2
Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2 Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.
Classification using linear regression If we have training data (X 1, Y 1 ),..., (X n, Y n), where X i R p are features and Y i {0, 1} are binary labels, we can attempt to find the best linear classifier. Classification via regression Find the least-squares linear regressor: bβ = (X T X) 1 X T Y Use it to build the classifier by (x) = ( 1, if β b T x 1/2 0, if β b T x < 1/2 Figure taken from Hastie, Tibshirani and Friedman, 2nd ed. This will work well if the two classes are linearly separable (or almost so).
Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous.
Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous. Nearest-neighbor methods k-nearest-neighbor regression: by (x) = 1 k k-nn classification: 8 1, if >< by (x) = >: 0, if X X i N k (x) X X i N k (x) X X i N k (x) Y i Y i k/2 Y i < k/2 N k (x): the set of k nearest neighbors of x in the training sample
Nearest-neighbor methods Nearest-neighbor methods use training samples closest to the test point x to form b Y. These methods work well if the regression function is continuous. Nearest-neighbor methods k-nearest-neighbor regression: by (x) = 1 k k-nn classification: 8 1, if >< by (x) = >: 0, if X X i N k (x) X X i N k (x) X X i N k (x) Y i Y i k/2 Y i < k/2 N k (x): the set of k nearest neighbors of x in the training sample Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.
Issues and pitfalls Curse of dimensionality Local methods (such as k-nearest-neighbors) require fairly dense training samples, which is hard to obtain when the problem dimension is very high. Bias-variance trade-off Using more complex models may lead to overfitting; using models that are overly simple may lead to underfitting. In both cases, we will have poor generalization.
The Curse of Dimensionality For nearest-neighbor methods, we need to capture a fixed fraction of the data (say, 10%) to form local averages. In high dimension, the data will be spread very thinly, and we will need to use almost the entire sample, which would conflict with the local nature of the method. Figure taken from Hasty, Tibshirani and Friedman, 2nd ed.
Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ):
Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ]
Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T b f(x)) 2 ] + 2 E T [(Y E T b f(x))(et b f(x) b f(x))] {z } =0 + E T [( b f(x) E T b f(x)) 2 ]
Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2
Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2 h( f(x) b E T f(x)) b i 2 = O(ɛ) + [f (X) E T f(x)] b 2 + E T {z } =Bias 2 {z } =Variance
Bias and variance Suppose the true model is Y = f (X) + ɛ, where ɛ is noise, which is zero mean (E ɛ = 0) and independent of X. Let s see how well a predictor b f learned from a training sample T = {(X 1, Y 1 ),..., (X n, Y n)} does on an independent test sample (X, Y ): E T [(Y b f(x)) 2 ] = E T [(Y E T b f(x) + ET b f(x) b f(x)) 2 ] = E T [(Y E T f(x)) b 2 ] + 2 E T [(Y E T f(x))(et b f(x) b f(x))] b {z } =0 + E T [( f(x) b E T f(x)) b 2 ] = [Y E T f(x)] b 2 + E T h( f(x) b i E T f(x)) b 2 h( f(x) b E T f(x)) b i 2 = O(ɛ) + [f (X) E T f(x)] b 2 + E T {z } =Bias 2 {z } =Variance Bias: how well b f approximates the true regression function f Variance: how much b f(x) wiggles around its mean
Bias-variance trade-off There is a trade-off between the bias and the variance: Complex predictors will generally have low bias and high variance Simple predictors will generally have high bias and low variance Figure taken from Hastie, Tibshirani and Friedman, 2nd ed.
Regularization One way to mitigate the bias-variance trade-off is to explicitly penalize predictors that are too complex.
Regularization One way to mitigate the bias-variance trade-off is to explicitly penalize predictors that are too complex. Instead of finding the minimizer of the empirical prediction error be(f) = 1 nx l(y i, f(x i )), n i=1 minimize the complexity-regularized empirical prediction error where: be cr(f) = E(f) b + λj(f) = 1 nx l(y i, f(x i )) + λj(f), n i=1 The regularizer J(f) is high for complicated predictors f and low for simple predictors f The regularization constant λ > 0 controls the balance between the data fit term b E(f) and the complexity term J(f)
Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline.
Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline. Subset selection: F is the class of linear regression functions f(x) = β T x, and J(β) = β 0 = the number of nonzero components of β This is effective, but may be computationally expensive.
Examples of regularization Roughness penalty: F is the class of twice differentiable functions, and Z J(f) = f (x) 2 dx This penalizes functions that wiggle around too much. The complexity-regularized solution is the smoothing spline. Subset selection: F is the class of linear regression functions f(x) = β T x, and J(β) = β 0 = the number of nonzero components of β This is effective, but may be computationally expensive. l 1 -penalty: F is again the class of linear regression functions, and J(β) = β 1 = px β(i) This often serves as a reasonable tractable relaxation of the subset selection penalty. This approach is known as the LASSO, and is very popular right now. i=1
Next lecture: a sneak preview Next time, we will focus on linear methods for classification: Logistic regression Separating hyperplanes and the Perceptron