Lecture 5: Clustering, Linear Regression

Similar documents
Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression

Lecture 6: Linear Regression

Lecture 6: Linear Regression (continued)

Machine Learning Linear Regression. Prof. Matteo Matteucci

11 Hypothesis Testing

MS&E 226: Small Data

Correlation Analysis

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

Business Statistics. Lecture 10: Correlation and Linear Regression

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Lecture 14 Simple Linear Regression

Fitting a regression model

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Statistical Methods for Data Mining

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear Regression In God we trust, all others bring data. William Edwards Deming

Scatter plot of data from the study. Linear Regression

Simple Linear Regression

Multiple Linear Regression

Data Mining Stat 588

Scatter plot of data from the study. Linear Regression

Ch 2: Simple Linear Regression

Lecture 15. Hypothesis testing in the linear model

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

MATH 644: Regression Analysis Methods

Ch 3: Multiple Linear Regression

Chapter 16. Simple Linear Regression and dcorrelation

Simple Linear Regression

LECTURE 5 HYPOTHESIS TESTING

Lecture 10 Multiple Linear Regression

ISyE 691 Data mining and analytics

Statistics for Engineers Lecture 9 Linear Regression

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

Chapter 3 Multiple Regression Complete Example

Lecture 6 Multiple Linear Regression, cont.

Linear model selection and regularization

Regression, Ridge Regression, Lasso

Introduction to Statistical modeling: handout for Math 489/583

Measuring the fit of the model - SSR

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Day 4: Shrinkage Estimators

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Section 3: Simple Linear Regression

Math 423/533: The Main Theoretical Topics

Chapter 14 Simple Linear Regression (A)

Machine Learning for OR & FE

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Multiple Linear Regression. Chapter 12

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

A discussion on multiple regression models

Basic Business Statistics 6 th Edition

Probability and Statistics Notes

Chapter 4: Regression Models

Inference in Regression Analysis

Math 3330: Solution to midterm Exam

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Simple and Multiple Linear Regression

Unit 10: Simple Linear Regression and Correlation

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

22s:152 Applied Linear Regression. Take random samples from each of m populations.

BNAD 276 Lecture 10 Simple Linear Regression Model

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Business Statistics. Lecture 9: Simple Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Simple linear regression

Multiple Linear Regression

Chapter 4. Regression Models. Learning Objectives

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Lecture 14: Shrinkage

Density Temp vs Ratio. temp

Lectures on Simple Linear Regression Stat 431, Summer 2012

Chapter 16. Simple Linear Regression and Correlation

Chapter 12: Multiple Linear Regression

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Linear Regression Model. Badr Missaoui

Lecture 19 Multiple (Linear) Regression

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

AMS 7 Correlation and Regression Lecture 8

Statistics for Managers using Microsoft Excel 6 th Edition

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Lecture 3: Inference in SLR

Regression With a Categorical Independent Variable

Applied Regression Analysis

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

1 Least Squares Estimation - multiple regression.

Regression With a Categorical Independent Variable

Lecture 9: Linear Regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

STAT 3A03 Applied Regression With SAS Fall 2017

Chapter 14 Student Lecture Notes 14-1

Transcription:

Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22

Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 9 9 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 0.5 0.5 2 / 22

Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 9 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7.0 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 0.5 2 / 22

X 1.5 1.0 0.5 3 2 1 6 4 Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 1.5 1.0 0.5 0.0 0.5 1.0 X 1.5 1.0 0.5 3 4 6 1 1.5 1.0 0.5 0.0 0.5 1.0 2 X 1 X 1 9 9 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 X 1 1.5 1.0 0.5 0.0 0.5 1.0 X 1 2 / 22

.0 1.5 1.0 0.5 3 2 1 6 4 Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 1.5 1.0 0.5 0.0 0.5 1.0 X 1 9 5 7 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 2 8 5 7.0 1.5 1.0 0.5 0.0 0.5 1.0 X 1 2 / 22

Hierarchical clustering Most Most algorithms for for hierarchical clustering are agglomerative. 7 5 X2 X2 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 7 7 8 8 5 5 3 3 2 2 1 1 6 6 4 4 9 9 1.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.5 0.5 1.0 1.0 X 1 X 1 9 9 X2 1.5 1.0 0.5 0.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 3 4 6 1 9 2 7 8 5 9 2 1 6 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 The Theoutput of is a X 1 the algorithm is a dendogram. 9 X2 0.5 0.5 0.5 19 / 26 2 / 22

Hierarchical clustering Most Most algorithms algorithms for for hierarchical clustering are agglomerative. 7 7 5 5 X2 1.5 1.0 0.5 X2 0.0 0.5 1.5 1.0 0.5 0.0 0.5 9 9 8 8 3 3 2 2 1 1 6 6 4 4 7 7 5 5.0 1.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.5 0.5 1.0 1.0 X 1 X 1 9 9 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 The Theoutput of the algorithmisisaa dendogram. X2 0.5 0.5 19 / 26 2 / 22

1.0 7 5 X2 1.5 1.0 0.5 3 1 6 2 8 5 Hierarchical clustering Most Most 4 algorithms for for hierarchical clustering 4 are agglomerative. X2 X2 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 X 1 X 1 9 8 3 8 3 2 1 2 6 1 6 4 75 5 1.5 4 1.0 0.5 0.0 0.5 1.0 X 1 X 1 9 1.5 1.0 0.5 0.0 0.5 1.0 9 X2 7 X2 1.5 1.0 0.5 1.5 1.0 0.5 0.0 0.5 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 6 3 4 1 1 6 8 5 2 9 2 9 8 2 7 5 8 1 6 5 7 The The 1.5 output 1.0 0.5 of 0.0the is a 0.5 algorithm 1.0 is a X dendogram. 1 X2 0.5 19 / 26 2 / 22

5.0 5 1.0.0 75 X2 X 7 1.5 1.0 0.5 3 2 1 6 8 5 Hierarchical clustering Most Most 4 algorithms for for hierarchical clustering are agglomerative. 1.5 1.0 0.5 X2 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 X 1 1.5 1.0 0.5 0.0 0.5 9 8 3 8 3 2 1 2 6 1 6 4 4 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0X 1 0.5 1.0 X 1 9 9 7 75 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 The Theoutput of the algorithmisisaa dendogram. X2 0.5 19 / 26 2 / 22

5.0 5 1.0.0 75 X2 X 7 1.5 1.0 0.5 3 2 1 6 8 5 Hierarchical clustering Most Most 4 algorithms for for hierarchical clustering are agglomerative. 1.5 1.0 0.5 X2 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 X 1 1.5 1.0 0.5 0.0 0.5 9 8 3 8 3 2 1 2 6 1 6 4 4 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0X 1 0.5 1.0 X 1 9 9 7 75 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3 4 1 6 9 2 8 5 7 We Themust output beof careful the algorithm about how is a we dendogram. interpret the dendogram. X2 0.5 19 / 26 2 / 22

Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Complete linkage: The distance between 2 clusters is the maximum distance between any pair of samples, one in each cluster. 3 / 22

Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Single linkage: The distance between 2 clusters is the minimum distance between any pair of samples, one in each cluster. 3 / 22

Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Average linkage: The distance between 2 clusters is the average of all pairwise distances. 3 / 22

Example Average Linkage Complete Linkage Single Linkage Figure 10.12 4 / 22

Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. 5 / 22

Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. How many clusters are appropriate? Choose subjectively depends on the inference sought. There are formal methods based on gap statistics, mixture models, etc. 5 / 22

Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. How many clusters are appropriate? Choose subjectively depends on the inference sought. There are formal methods based on gap statistics, mixture models, etc. Are the clusters robust? Run the clustering on different random subsets of the data. Is the structure preserved? Try different clustering algorithms. Are the conclusions consistent? Most important: temper your conclusions. 5 / 22

Clustering is riddled with questions and choices Should we scale the variables before doing the clustering? 6 / 22

Clustering is riddled with questions and choices Should we scale the variables before doing the clustering? Variables with larger variance have a larger effect on the Euclidean distance between two samples. ( Area in acres, Price in US$, Number of houses ) Property 1 ( 10, 450,000, 4 ) Property 2 ( 5, 300,000, 1 ) 6 / 22

Clustering is riddled with questions and choices Should we scale the variables before doing the clustering? Variables with larger variance have a larger effect on the Euclidean distance between two samples. ( Area in acres, Price in US$, Number of houses ) Property 1 ( 10, 450,000, 4 ) Property 2 ( 5, 300,000, 1 ) Does Euclidean distance capture dissimilarity between samples? 6 / 22

Correlation distance Example: Suppose that we want to cluster customers at a store for market segmentation. Samples are customers Each variable corresponds to a specific product and measures the number of items bought by the customer during a year. 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 7 / 22

Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 22

Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). Perhaps we want to cluster customers who purchase similar things (orange and teal). 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 22

Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). Perhaps we want to cluster customers who purchase similar things (orange and teal). Then, the correlation distance may be a more appropriate measure of dissimilarity between samples. 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 22

Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 9 / 22

Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): RSS = n (y i ŷ i ) 2 i=1 9 / 22

Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i ) 2. i=1 9 / 22

Estimates ˆβ 0 and ˆβ 1 A little calculus shows that the minimizers of the RSS are: ˆβ 1 = n i=1 (x i x)(y i y) n i=1 (x i x) 2, ˆβ 0 = y ˆβ 1 x. 10 / 22

Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 X 2 1 0 1 2 X Figure 3.3 11 / 22

Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 2 1 0 1 2 X Figure 3.3 X The 95% confidence intervals: ˆβ 0 ± 2 SE( ˆβ 0 ) ˆβ 1 ± 2 SE( ˆβ 1 ) 11 / 22

Hypothesis test H 0 : There is no relationship between X and Y. H a : There is some relationship between X and Y. 12 / 22

Hypothesis test H 0 : β 1 = 0. H a : β 1 0. 12 / 22

Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis, this has a t-distribution with n 2 degrees of freedom. 12 / 22

Hypothesis test H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis, this has a t-distribution with n 2 degrees of freedom. 12 / 22

Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? 13 / 22

Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? No. A quadratic relationship may be a better fit, for example. 13 / 22

Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? No. A quadratic relationship may be a better fit, for example. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? 13 / 22

Interpreting the hypothesis test If we reject the null hypothesis, can we conclude that there is significant evidence of a linear relationship? No. A quadratic relationship may be a better fit, for example. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? No. This test is only powerful against certain monotone alternatives. There could be more complex non-linear relationships. 13 / 22

Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. X2 X1 Figure 3.4 14 / 22

Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. or, in matrix notation: X2 Ey = Xβ, Figure 3.4 X1 where y = (y 1,..., y n ) T, β = (β 0,..., β p ) T and X is our usual data matrix with an extra column of ones on the left to account for the intercept. 14 / 22

Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? 15 / 22

Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? 15 / 22

Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? 15 / 22

Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? Given a set of predictor values, what is a likely value for Y, and how accurate is this prediction? 15 / 22

The estimates ˆβ Our goal again is to minimize the RSS: RSS = n (y i ŷ i ) 2 i=1 16 / 22

The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i,1 ˆβ p x i,p ) 2. i=1 16 / 22

The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i,1 ˆβ p x i,p ) 2. i=1 One can show that this is minimized by the vector ˆβ: ˆβ = (X T X) 1 X T y. 16 / 22

Consider the hypothesis: Which variables are important? H 0 : The last q predictors have no relation with Y. 17 / 22

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. 17 / 22

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. 17 / 22

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). 17 / 22

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. 17 / 22

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. Example: If q = p, we test whether any of the variables is important. n RSS 0 = (y i y) 2 i=1 17 / 22

Which variables are important? A multiple linear regression in R has the following output: 18 / 22

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. 19 / 22

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important. 19 / 22

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important. Warning: If there are many predictors, even under the null hypothesis, some of the t-tests will have low p-values just by chance. 19 / 22

How many variables are important? When we select a subset of the predictors, we have 2 p choices. 20 / 22

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. 20 / 22

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. 20 / 22

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. 20 / 22

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. 20 / 22

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to greedily add variables (or to remove them from a baseline model). This creates a sequence of models, from which we can select the best. Forward selection: Starting from a null model (the intercept), include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. Choosing one model in the range produced is a form of tuning. 20 / 22

How good is the fit? To assess the fit, we focus on the residuals. 21 / 22

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. 21 / 22

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. 21 / 22

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions: Sales TV Radio 21 / 22

How good are the predictions? The function predict in R outputs predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. 22 / 22

How good are the predictions? The function predict in R outputs predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. Prediction intervals reflect uncertainty on ˆβ and the irreducible error ε as well. 22 / 22