Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22
Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 9 9 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 0.5 0.5 2 / 22
Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 9 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7.0 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 0.5 2 / 22
X 1.5 1.0 0.5 3 2 1 6 4 Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 1.5 1.0 0.5 0.0 0.5 1.0 X 1.5 1.0 0.5 3 4 6 1 1.5 1.0 0.5 0.0 0.5 1.0 2 X 1 X 1 9 9 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 X 1 1.5 1.0 0.5 0.0 0.5 1.0 X 1 2 / 22
.0 1.5 1.0 0.5 3 2 1 6 4 Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7.0 1.5 1.0 0.5 0.0 0.5 1.0 X 1 2 / 22
Hierarchical clustering Most Most algorithms for for hierarchical clustering are agglomerative. 7 5 X2 X2 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 7 7 8 8 5 5 3 3 2 2 1 1 6 6 4 4 9 9 1.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.5 0.5 1.0 1.0 X 1 X 1 9 9 X2 1.5 1.0 0.5 0.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 3 4 6 1 9 2 7 8 5 9 2 1 6 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 The Theoutput of is a X 1 the algorithm is a dendogram. 9 X2 0.5 0.5 0.5 19 / 26 2 / 22
Hierarchical clustering Most Most algorithms algorithms for for hierarchical clustering are agglomerative. 7 7 5 5 X2 1.5 1.0 0.5 X2 0.0 0.5 1.5 1.0 0.5 0.0 0.5 9 9 8 8 3 3 2 2 1 1 6 6 4 4 7 7 5 5.0 1.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.5 0.5 1.0 1.0 X 1 X 1 9 9 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 The Theoutput of the algorithmisisaa dendogram. X2 0.5 0.5 19 / 26 2 / 22
1.0 7 5 X2 1.5 1.0 0.5 3 1 6 2 8 5 Hierarchical clustering Most Most 4 algorithms for for hierarchical clustering 4 are agglomerative. X2 X2 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 X 1 X 1 9 8 3 8 3 2 1 2 6 1 6 4 75 5 1.5 4 1.0 0.5 0.0 0.5 1.0 X 1 X 1 9 1.5 1.0 0.5 0.0 0.5 1.0 9 X2 7 X2 1.5 1.0 0.5 1.5 1.0 0.5 0.0 0.5 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 6 3 4 1 1 6 8 5 2 9 2 9 8 2 7 5 8 1 6 5 7 The The 1.5 output 1.0 0.5 of 0.0the is a 0.5 algorithm 1.0 is a X dendogram. 1 X2 0.5 19 / 26 2 / 22
5.0 5 1.0.0 75 X2 X 7 1.5 1.0 0.5 3 2 1 6 8 5 Hierarchical clustering Most Most 4 algorithms for for hierarchical clustering are agglomerative. 1.5 1.0 0.5 X2 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 X 1 1.5 1.0 0.5 0.0 0.5 9 8 3 8 3 2 1 2 6 1 6 4 4 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0X 1 0.5 1.0 X 1 9 9 7 75 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 The Theoutput of the algorithmisisaa dendogram. X2 0.5 19 / 26 2 / 22
5.0 5 1.0.0 75 X2 X 7 1.5 1.0 0.5 3 2 1 6 8 5 Hierarchical clustering Most Most 4 algorithms for for hierarchical clustering are agglomerative. 1.5 1.0 0.5 X2 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 X 1 1.5 1.0 0.5 0.0 0.5 9 8 3 8 3 2 1 2 6 1 6 4 4 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0X 1 0.5 1.0 X 1 9 9 7 75 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 We Themust output beof careful the algorithm about how is a we dendogram. interpret the dendogram. X2 0.5 19 / 26 2 / 22
Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Complete linkage: The distance between 2 clusters is the maximum distance between any pair of samples, one in each cluster. 3 / 22
Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Single linkage: The distance between 2 clusters is the minimum distance between any pair of samples, one in each cluster. 3 / 22
Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Average linkage: The distance between 2 clusters is the average of all pairwise distances. 3 / 22
Example Average Linkage Complete Linkage Single Linkage Figure 10.12 4 / 22
Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. 5 / 22
Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. How many clusters are appropriate? Choose subjectively depends on the inference sought. There are formal methods based on gap statistics, mixture models, etc. 5 / 22
Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. How many clusters are appropriate? Choose subjectively depends on the inference sought. There are formal methods based on gap statistics, mixture models, etc. Are the clusters robust? Run the clustering on different random subsets of the data. Is the structure preserved? Try different clustering algorithms. Are the conclusions consistent? Most important: temper your conclusions. 5 / 22
Clustering is riddled with questions and choices Should we scale the variables before doing the clustering? 6 / 22
Clustering is riddled with questions and choices Should we scale the variables before doing the clustering? Variables with larger variance have a larger effect on the Euclidean distance between two samples. ( Area in acres, Price in US$, Number of houses ) Property 1 ( 10, 450,000, 4 ) Property 2 ( 5, 300,000, 1 ) 6 / 22
Clustering is riddled with questions and choices Should we scale the variables before doing the clustering? Variables with larger variance have a larger effect on the Euclidean distance between two samples. ( Area in acres, Price in US$, Number of houses ) Property 1 ( 10, 450,000, 4 ) Property 2 ( 5, 300,000, 1 ) Does Euclidean distance capture dissimilarity between samples? 6 / 22
Correlation distance Example: Suppose that we want to cluster customers at a store for market segmentation. Samples are customers Each variable corresponds to a specific product and measures the number of items bought by the customer during a year. 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 7 / 22
Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 22
Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). Perhaps we want to cluster customers who purchase similar things (orange and teal). 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 22
Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). Perhaps we want to cluster customers who purchase similar things (orange and teal). Then, the correlation distance may be a more appropriate measure of dissimilarity between samples. 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 22
Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 9 / 22
Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): RSS = n (y i ŷ i ) 2 i=1 9 / 22
Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i ) 2. i=1 9 / 22
Estimates ˆβ 0 and ˆβ 1 A little calculus shows that the minimizers of the RSS are: ˆβ 1 = n i=1 (x i x)(y i y) n i=1 (x i x) 2, ˆβ 0 = y ˆβ 1 x. 10 / 22
Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 X 2 1 0 1 2 X Figure 3.3 11 / 22
Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 2 1 0 1 2 X Figure 3.3 X The 95% confidence intervals: ˆβ 0 ± 2 SE( ˆβ 0 ) ˆβ 1 ± 2 SE( ˆβ 1 ) 11 / 22
Hypothesis test H 0 : There is no relationship between X and Y. H a : There is some relationship between X and Y. 12 / 22
Hypothesis test H 0 : β 1 = 0. H a : β 1 0. 12 / 22
Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis, this has a t-distribution with n 2 degrees of freedom. 12 / 22
Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis, this has a t-distribution with n 2 degrees of freedom. 12 / 22
Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? 13 / 22
Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? No. A quadratic relationship may be a better fit, for example. 13 / 22
Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? No. A quadratic relationship may be a better fit, for example. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? 13 / 22
Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? No. A quadratic relationship may be a better fit, for example. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? No. This test is only powerful against certain monotone alternatives. There could be more complex non-linear relationships. 13 / 22
Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. X2 X1 Figure 3.4 14 / 22
Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. or, in matrix notation: X2 Ey = Xβ, Figure 3.4 X1 where y = (y 1,..., y n ) T, β = (β 0,..., β p ) T and X is our usual data matrix with an extra column of ones on the left to account for the intercept. 14 / 22
Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? 15 / 22
Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? 15 / 22
Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? 15 / 22
Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? Given a set of predictor values, what is a likely value for Y, and how accurate is this prediction? 15 / 22
The estimates ˆβ Our goal again is to minimize the RSS: RSS = n (y i ŷ i ) 2 i=1 16 / 22
The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i,1 ˆβ p x i,p ) 2. i=1 16 / 22
The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i,1 ˆβ p x i,p ) 2. i=1 One can show that this is minimized by the vector ˆβ: ˆβ = (X T X) 1 X T y. 16 / 22
Consider the hypothesis: Which variables are important? H 0 : The last q predictors have no relation with Y. 17 / 22
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. 17 / 22
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. 17 / 22
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). 17 / 22
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. 17 / 22
Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. Example: If q = p, we test whether any of the variables is important. n RSS 0 = (y i y) 2 i=1 17 / 22
Which variables are important? A multiple linear regression in R has the following output: 18 / 22
Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. 19 / 22
Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important. 19 / 22
Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important. Warning: If there are many predictors, even under the null hypothesis, some of the t-tests will have low p-values just by chance. 19 / 22
How many variables are important? When we select a subset of the predictors, we have 2 p choices. 20 / 22
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. 20 / 22
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. 20 / 22
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. 20 / 22
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. 20 / 22
How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. Choosing one model in the range produced is a form of tuning. 20 / 22
How good is the fit? To assess the fit, we focus on the residuals. 21 / 22
How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. 21 / 22
How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. 21 / 22
How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions: Sales TV Radio 21 / 22
How good are the predictions? The function predict in R outputs predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. 22 / 22
How good are the predictions? The function predict in R outputs predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. Prediction intervals reflect uncertainty on ˆβ and the irreducible error ε as well. 22 / 22