Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University
0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics Perform experiments Make conclusion Statistical learning procedure Collect data Analyze the data Find new rules Let the data tell something. Seoul National University. 1
Why Statistical learning necessary? We know most of rules which can be imagined by our brain. Life (nature, socio-economic status, human behavior, biology etc.) is more complex than we have thought. Our world is changing too fast for us to keep up with based only on our logics. Due to digitalization, amount of data is increasing very fast. Most of information in huge data remains undiscovered Sample questions What are the risk factors for heart failure? Are there genes which characterize differences between various races? How does the stock market behave? Which chemical confounds are effective for a specific disease? Seoul National University. 2
Who are valuable customers for our company? What are the influential factors for changing the amount of ozone? Are there patterns in the content of spam mails? In statistical learning, the common objective is to find causes for a given phenomenon. One of the common features of the problems is that the set of possible causes we can think of is very large. Learning procedure suffers from time limitation unless we are lucky. Seoul National University. 3
Machine learning vs Statistical learning (personal view) Machine learning is a method to educate a machine (computer). Two tasks Without errors (eg. rule based learning) With errors Statistical learning is a subset of machine learning, which deals with tasks with errors. Seoul National University. 4
Statistical view of statistical learning Analysis of ultra-high dimensional data Methods to overcome the curse of dimensionality Seoul National University. 5
Supervised and Unsupervised learnings Supervised learning Use the inputs to predict the values of the outputs Examples: Regression and Classification Unsupervised learning Only use inputs to describe the data Examples: Clustering, PCA Seoul National University. 6
1. Basic set-up of Supervised learning Input(Covariate) : x R p Output(Response) : y Y System (Model): y = ϕ(x, ϵ) Loss function: l(y, a) Assumption : f belongs to a family of functions F. Learning set (Data): L = {(y i, x i ), i = 1,..., n} assumed to be a random sample of (Y, X) P Objective: Find f 0 = arg min f F E (Y,X) l(y, f(x)). Predictor(Estimator): ˆf(x) = f(x, L). Prediction: If new input is x, predict unknown y by ˆf(x). Seoul National University. 7
y is categorical Classification is continuous Regression Seoul National University. 8
2. From Least Squares to Nearest Neighbor (for regression) Least Squares Assumption : f(x) {β 0 + p i=1 x iβ i } Estimate β = (β 0, β 1,..., β p ) by ˆβ which minimizes the residual sum of square ( ) 2 n p RSS(β) = y i β 0 x ki β k. i=1 k=1 f(x, L) = ˆβ 0 + p i=1 x i ˆβ i. Seoul National University. 9
Nearest Neighbor (NN) N k (x): the neighborhood of x defined by the k closest points x i in the training sample. f(x, L) = 1 k x i N k (x) y i. Seoul National University. 10
Simulation 1 Model: y = x + ϵ and ϵ N(0, 1). Training sample size is 100. The test error is calculated by test sample of size 5000. Result Method Training error Test error Linear 0.8247196 3.395535 1 NN 0.0000000 3.915410 5 NN 0.7080551 3.434624 15 NN 0.8412333 3.400420 Seoul National University. 11
Plot Linear Regression Nearest Neighbor with k= 1 y -2 0 2 4 y -2 0 2 4-2 -1 0 1 2 3 Nearest Neighbor with k= 5 x -2-1 0 1 2 3 Nearest Neighbor with k= 15 x y -2 0 2 4 y -2 0 2 4-2 -1 0 1 2 3 x -2-1 0 1 2 3 x Seoul National University. 12
Simulation 2 Model: y = x(1 x) + ϵ and ϵ N(0, 1). Training sample size is 100. The test error is calculated by test sample of size 5000. Result Method Training error Test error Linear 3.3307623 3.051589 1 NN 0.0000000 1.892876 5 NN 0.9872481 1.387429 15 NN 2.1303585 2.069501 Seoul National University. 13
Plot Linear Regression Nearest Neighbor with k= 1 y -10-6 -4-2 0 2 y -10-6 -4-2 0 2-3 -2-1 0 1 2 3 Nearest Neighbor with k= 5 x -3-2 -1 0 1 2 3 Nearest Neighbor with k= 15 x y -10-6 -4-2 0 2 y -10-6 -4-2 0 2-3 -2-1 0 1 2 3 x -3-2 -1 0 1 2 3 x Seoul National University. 14
Comments Linear model is the best when the true model is linear and worst when the true model is nonlinear. NN performs reasonably well regardless of what the true function is. Training error is not a good estimate of the test error. Complicated models do not always perform well. The number of neighborhood k controls the complexity of the predictor. Seoul National University. 15
LS vs NN LS NN Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the true stable regardless is simple of the ture Tuning parameter nothing the size of neighbor Seoul National University. 16
3. Statistical Decision theory Regression The training sample L is a random sample from the joint distribution P (y, x). Let l(y, f(x)) be a loss function for penalizing errors in prediction. The most popular loss function is squared error loss: l(y, f(x)) = (y f(x)) 2. The expected prediction error of f (EP E(f)) is defined as where (Y, X) P (y, x). EP E(f) = E(Y f(x)) 2 Theorem : f 0 (x) = E(Y X = x) minimizes EP E(f). E(Y X = x) is called the regression function. Seoul National University. 17
For NN method, f is estimated by ˆf : Two approximations are ˆf(x) = Ave(y i x i N k (x)). expectation is approximated by averaging over sample data conditioning at a point is relaxed to conditioning on some region close to the target point. Theorem: Under regularity conditions, ˆf(x) f 0 (x) for all x R p when n, k and k/n 0. The condition k/n 0 means that the model complexity should increase slower than the sample size. Seoul National University. 18
For LS, f is assumed to be a linear function: f(x) = β 0 + p x i β i. i=1 f with β = ( E(XX T ) ) 1 E(XY ) minimizes the EPE. The LS estimator replace the expectation by averages over the training sample. Seoul National University. 19
Classification y {1,..., J}. For a given loss function l, the EPE is defined as E(l(Y, f(x))). Since EP E(f) = E X J L(j, f(x))p (Y = j X), j=1 J f(x) = arg min k=1,...,j j=1 the EPE. L(j, k)p (Y = j X = x) minimizes If l(y, f(x)) = I(y f(x)), f(x) becomes f(x) = max j=1,...,j P (Y = j X = x). (1) This predictor is called the Bayes rule (Bayes classifier) and its EPE is called the Bayes rate. Seoul National University. 20
Estimate the Bayes classifier via function estimation First, estimate ϕ j (x) = P (Y = j X = x), and Estimate the Bayes classifier by replacing P (Y = j X = x) by ϕ j (x) in (1). The NN estimation of ϕ j ˆϕ j (x) = 1 k x i N k (x) I(y i = j). Linear models do not fit well for estimating ϕ j since ϕ j should have values between 0 and 1. Logistic regression is an promising alternative. Seoul National University. 21
4. Curse of dimensionality When p is large, the concept neighborhood does not work for local averaging. Phenomenon 1 X = (X 1,..., X p ) Uniform[0, 1] p Consider a hypercubical neighborhood about a target point. We want to capture a fraction r of the sample. Then the expected edge length will be e p (r) = r 1/p. e 10 (0.01) = 0.63 and e 10 (0.1) = 0.80. To capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer local. Seoul National University. 22
Phenomenon 2: X = (X 1,..., X p ) Uniform in a p dimensional unit ball centered at the origin. For the sample size n, let R i = p k=1 X2 ki Let R (1) = min{r i }. Then, the median of R (1) is (1 (1/2) 1/n ) 1/p. for i = 1,..., n. For n = 5000, p = 10, the median is approximately 0.52, more than half way to the boundary. Most data points are closer to the boundary of the sample space than to the origin. Prediction is much more difficult near the edges since one must extrapolate rather than interpolate. Seoul National University. 23
Phenomenon 3: Suppose X Unifrom[ 1, 1] p. Assume that the true relation is Y = f(x) = exp( 8 X 2 ). Consider the 1-NN estimate at x = 0. The bias of the estimator is 1 exp( 8 x 2 (1) ) where x (1) is the smallest norm among the training sample. Since X 2 = p i=1 X2 i X2 (p) and X2 (p) the bias tends to increase as p increases. 1 as p, Seoul National University. 24
5. Overfitting and Bias-Variance tradeoff As we have seen, in the NN method, the size of neighborhood k controls the complexity of the predictor. The question is how to choose k? If we know P (y, x), we can choose k by minimizing the EPE (test error): EP E( ˆf k ) = E(Y ˆf k (X)) 2 where ˆf k is the k-nn estimate of f. Unfortunately, we do not know P (y, x). One naive answer is to estimate the EPE of ˆf k by the residual sum of square (training error): n (y i ˆf k (x i )) 2. i=1 Seoul National University. 25
The training error is downward biased estimator of the test error since the data set is used twice (one for constructing ˆf and the other for calculating the training error). Moreover, the training error keeps decreasing as k is getting smaller while the test error decreases initially and increases later. This means that too complicated models (or models fitting the training data too closely, or overfitted models) show poor performance. This seemingly mysterious phenomenon can be explained by the bias-variance decomposition. Several ways of choosing the model complexity (i.e. k in the NN method) will be explained later. Seoul National University. 26
Bias-Variance tradeoff (for regression) Suppose Y = f(x) + ϵ with E(ϵ) = 0 and Var(ϵ) = σ 2. For a given training sample L, the test error of f(x, L) is given by T E = E L E (Y,X) ((Y f(x, L)) 2 ), which is decomposed by T E = E (Y,X) ((Y f(x)) 2 ) + E X ((f(x) E L (f(x, L))) 2 ) +E X (E L (f(x, L) E L (f(x, L)) 2 ) = σ 2 + E X (Bias L (X) 2 + Variance L (X)). Seoul National University. 27
In general, if the model is getting complicated, the bias decreases and the variance increases. Example : k-nn method f(x, L) = k l=1 (f(x (l)) + ϵ (l) )/k where the subscripts (l) indicates the sequence of nearest neighbors to x. Then Bias L (x) = f(x) 1 k k f(x (l) ) and Variance L (x) = σ2 k. For k = 1, the bias is the smallest and variance is the largest while for k = n, the bias is the largest and variance is the smallest. l=1 Seoul National University. 28
Plot High bias Low variance Low bias High variance Error Test error Training error Complexity Seoul National University. 29
6. Four situations in supervised learning 1. p is small and F is parametric. Standard regression and classification problems MLE, least square, Robust estimator etc. 2. p is large and F is parametric. Develop efficient methods for small and moderate samples Variable selection, Shrinkage, Bayesian method etc. 3. p is small and F is nonparametric. Nonparametric regression Kernel, Spline, Wavelet, Mixture model etc. 4. p is large and F is nonparametric. Main play ground of Data Mining Decision tree, Project pursuit, MARS, Neural network etc. Seoul National University. 30