Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Samples r bservatins Variables r factrs 2 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Samples r bservatins Variables r factrs Quantitative variables, eg. weight, height, number f children,... 2 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Samples r bservatins Variables r factrs Qualitative variables, eg. cllege majr, prfessin, gender,... 2 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Our gal is t: Find meaningful relatinships between the variables. 3 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Our gal is t: Find meaningful relatinships between the variables. Find lw-dimensinal representatins f the data which make it easy t visualize the variables. 3 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Our gal is t: Find meaningful relatinships between the variables. Find lw-dimensinal representatins f the data which make it easy t visualize the variables. Find meaningful grupings f the data. 3 / 20
Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Our gal is t: Find meaningful relatinships between the variables. Crrelatin analysis. Find lw-dimensinal representatins f the data which make it easy t visualize the variables. PCA, ICA, ismap, lcally linear embeddings, etc. Find meaningful grupings f the data. Clustering. 3 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: Samples Input variables Output variable Variables r factrs 4 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: Samples Input variables Output variable If quantitative, we say this is a regressin prblem Variables r factrs 4 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: Samples Input variables Output variable If qualitative (categrical), we say this is a classificatin prblem Variables r factrs 4 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: If X is the vectr f inputs fr a particular sample. The utput variable is mdeled by: Y = f(x) + ε }{{} Randm errr 5 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: If X is the vectr f inputs fr a particular sample. The utput variable is mdeled by: Y = f(x) + ε }{{} Randm errr The functin f captures the systematic relatinship between X and Y. f is fixed and unknwn. 5 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: If X is the vectr f inputs fr a particular sample. The utput variable is mdeled by: Y = f(x) + ε }{{} Randm errr The functin f captures the systematic relatinship between X and Y. f is fixed and unknwn. ɛ represents the unpredictable "nise" in the prblem. 5 / 20
Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: If X is the vectr f inputs fr a particular sample. The utput variable is mdeled by: Y = f(x) + ε }{{} Randm errr The functin f captures the systematic relatinship between X and Y. f is fixed and unknwn. ɛ represents the unpredictable "nise" in the prblem. Our gal is t learn the functin f, using a set f training samples. 5 / 20
Supervised vs. unsupervised learning Mtivatins: Y = f(x) + ε }{{} Randm errr Predictin: Useful when the input variable is readily available, but the utput variable is nt. Example: Predict stck prices next mnth using data frm last year. 6 / 20
Supervised vs. unsupervised learning Mtivatins: Y = f(x) + ε }{{} Randm errr Predictin: Useful when the input variable is readily available, but the utput variable is nt. Inference: A mdel fr f can help us understand the structure f the data which variables influence the utput, and which dn t? What is the relatinship between each variable and the utput, e.g. linear, nn-linear? Example: What is the influence f genetic variatins n the incidence f heart disease. 6 / 20
Parametric and nnparametric methds: Mst supervised learning methds fall int ne f tw classes: 7 / 20
Parametric and nnparametric methds: Mst supervised learning methds fall int ne f tw classes: Parametric methds: We assume that f takes a specific frm. Fr example, a linear frm: f(x) = X 1 β 1 + + X p β p with parameters β 1,..., β p. Using the training data, we try t fit the parameters. 7 / 20
Parametric and nnparametric methds: Mst supervised learning methds fall int ne f tw classes: Parametric methds: We assume that f takes a specific frm. Fr example, a linear frm: f(x) = X 1 β 1 + + X p β p with parameters β 1,..., β p. Using the training data, we try t fit the parameters. Nn-parametric methds: We dn t make any assumptins n the frm f f, but we restrict hw wiggly r rugh the functin can be. 7 / 20
Incme Incme Parametric vs. nnparametric predictin Years f Educatin Senirity Years f Educatin Senirity Figures 2.4 and 2.5 8 / 20
Incme Incme Parametric vs. nnparametric predictin Years f Educatin Senirity Years f Educatin Senirity Figures 2.4 and 2.5 Parametric methds have a limit f fit quality. Nn-parametric methds keep imprving as we add mre data t fit. 8 / 20
Incme Incme Parametric vs. nnparametric predictin Years f Educatin Senirity Years f Educatin Senirity Figures 2.4 and 2.5 Parametric methds have a limit f fit quality. Nn-parametric methds keep imprving as we add mre data t fit. Parametric methds are ften simpler t interpret. 8 / 20
Predictin errr Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predicted functin: ˆf, which is based n these data. What is a gd chice fr ˆf? 9 / 20
Predictin errr Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predicted functin: ˆf, which is based n these data. What is a gd chice fr ˆf? Our gal in supervised learning is t minimize the predictin errr: Fr a new datapint (x 0, y 0 ) (nt used in training) we want the squared errr (y 0 ˆf(x 0 )) 2 t be small. 9 / 20
Predictin errr Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predicted functin: ˆf, which is based n these data. What is a gd chice fr ˆf? Our gal in supervised learning is t minimize the predictin errr: Fr a new datapint (x 0, y 0 ) (nt used in training) we want the squared errr (y 0 ˆf(x 0 )) 2 t be small. Given many test data {(x i, y i ); i = 1,..., m} which were nt used t fit the mdel, a cmmn measure f quality f ˆf is the test mean squared errr (MSE): MSE test ( ˆf) = 1 m m (y i ˆf(x i)) 2. i=1 9 / 20
Predictin errr What if we dn t have any test data? It is tempting t use instead the training MSE: MSE training ( ˆf) = 1 n n (y i ˆf(x i )) 2. i=1 The main challenge f statistical learning is that a lw training MSE des nt imply a lw test MSE. 10 / 20
Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The circles are simulated data frm the black curve. 11 / 20
Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The circles are simulated data frm the black curve. In this artificial example, we knw what f is. 11 / 20
Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Three estimates ˆf are shwn: 1. Linear regressin. 2. Splines (very smth). 3. Splines (quite rugh). 11 / 20
Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Red line: Test MSE. Gray line: Training MSE. 11 / 20
Figure 2.10 Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The functin f is nw almst linear. 12 / 20
Figure 2.11 Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility When the nise ε has small variance, the third methd des well. 13 / 20
The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). 14 / 20
The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). Irreducible errr 14 / 20
The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The variance f the estimate f Y : E[ ˆf(x 0 ) E( ˆf(x 0 ))] 2 This measures hw much the estimate f ˆf at x 0 changes when we sample new training data. 14 / 20
The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The squared bias f the estimate f Y : [E( ˆf(x 0 )) f(x 0 )] 2 This measures the deviatin f the average predictin ˆf(x 0 ) frm the truth f(x 0 ). 14 / 20
15 / 20
15 / 20
15 / 20
15 / 20
15 / 20
15 / 20
15 / 20
15 / 20
15 / 20
15 / 20
Implicatins f bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε). The expected MSE is never smaller than the irreducible errr. 16 / 20
Implicatins f bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε). The expected MSE is never smaller than the irreducible errr. Bth small bias and small variance are needed fr small expected MSE. 16 / 20
Implicatins f bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε). The expected MSE is never smaller than the irreducible errr. Bth small bias and small variance are needed fr small expected MSE. Easy t find zer variance prcedure with high bias (predict a cnstant value) and very lw bias prcedure with high variance (chse ˆf passing thrugh every training pint). 16 / 20
Implicatins f bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε). The expected MSE is never smaller than the irreducible errr. Bth small bias and small variance are needed fr small expected MSE. Easy t find zer variance prcedure with high bias (predict a cnstant value) and very lw bias prcedure with high variance (chse ˆf passing thrugh every training pint). In practice, best expected MSE achieved by incurring sme bias t decrease variance and vice-versa: this is the bias-variance trade-ff. Rule f thumb: Mre flexible methds Higher variance and Lwer Bias. 16 / 20
Implicatins f bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε). The expected MSE is never smaller than the irreducible errr. Bth small bias and small variance are needed fr small expected MSE. Easy t find zer variance prcedure with high bias (predict a cnstant value) and very lw bias prcedure with high variance (chse ˆf passing thrugh every training pint). In practice, best expected MSE achieved by incurring sme bias t decrease variance and vice-versa: this is the bias-variance trade-ff. Rule f thumb: Mre flexible methds Higher variance and Lwer Bias. Example: Linear fit may nt change substantially with training data but has high bias if f is very nn-linear. 16 / 20
Squiggly f, high nise Linear f, high nise Squiggly f, lw nise 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Figure 2.12 17 / 20
Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f variables, the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. We adpt sme additinal ntatin: P (X, Y ) : jint distributin f (X, Y ), P (Y X) : cnditinal distributin f Y given X, ŷ i : predictin fr x i. 18 / 20
Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: 1(y 0 ŷ 0 ) 19 / 20
Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: 1(y 0 ŷ 0 ) As with squared errr, we can cmpute average test preditin errr (called test errr rate under 0-1 lss) using previusly unseen test data {(x i, y i ); i = 1,..., m}: 1 m m 1(y i ŷ i) i=1 Similarly, we can cmpute the (usually ptimistic) training errr rate 1 n 1(y i ŷ i ) n i=1 19 / 20
Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. 20 / 20
Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. The Bayes classifier assigns: ŷ i = argmax j P (Y = j X = x i ) It can be shwn that this is the best classifier under the 0-1 lss. 20 / 20