Lecture 5: Clustering, Linear Regression

Similar documents
Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression

Lecture 6: Linear Regression

Lecture 6: Linear Regression (continued)

Business Statistics. Lecture 10: Correlation and Linear Regression

Correlation Analysis

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

11 Hypothesis Testing

Machine Learning Linear Regression. Prof. Matteo Matteucci

MS&E 226: Small Data

Lecture 14 Simple Linear Regression

Fitting a regression model

Linear Regression In God we trust, all others bring data. William Edwards Deming

Multiple Linear Regression

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Data Mining Stat 588

MATH 644: Regression Analysis Methods

Scatter plot of data from the study. Linear Regression

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Simple Linear Regression

Statistical Methods for Data Mining

Chapter 3 Multiple Regression Complete Example

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Chapter 16. Simple Linear Regression and dcorrelation

Inference in Regression Analysis

Scatter plot of data from the study. Linear Regression

Lecture 10 Multiple Linear Regression

Lecture 15. Hypothesis testing in the linear model

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

ISyE 691 Data mining and analytics

Ch 2: Simple Linear Regression

Linear model selection and regularization

Chapter 14 Simple Linear Regression (A)

Ch 3: Multiple Linear Regression

Simple Linear Regression

Chapter 14 Student Lecture Notes 14-1

Statistics for Engineers Lecture 9 Linear Regression

Machine Learning for OR & FE

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Math 423/533: The Main Theoretical Topics

Section 3: Simple Linear Regression

Basic Business Statistics 6 th Edition

Lecture 6 Multiple Linear Regression, cont.

BNAD 276 Lecture 10 Simple Linear Regression Model

Linear Regression Model. Badr Missaoui

Introduction to Statistical modeling: handout for Math 489/583

Measuring the fit of the model - SSR

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Multiple Linear Regression. Chapter 12

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Chapter 4: Regression Models

Lectures on Simple Linear Regression Stat 431, Summer 2012

Day 4: Shrinkage Estimators

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Business Statistics. Lecture 9: Simple Regression

Unit 10: Simple Linear Regression and Correlation

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

LECTURE 5 HYPOTHESIS TESTING

Bayesian Linear Regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

A discussion on multiple regression models

Lecture 3: Inference in SLR

Regression Models - Introduction

Chapter 16. Simple Linear Regression and Correlation

STAT 3A03 Applied Regression With SAS Fall 2017

Applied linear statistical models: An overview

Regression With a Categorical Independent Variable

Regression, Ridge Regression, Lasso

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Lecture 1 Intro to Spatial and Temporal Data

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Homework 2: Simple Linear Regression

Mathematics for Economics MA course

Math 3330: Solution to midterm Exam

STA442/2101: Assignment 5

Multivariate Regression (Chapter 10)

Simple and Multiple Linear Regression

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Data Analysis and Statistical Methods Statistics 651

Lecture 14: Shrinkage

Regression - Modeling a response

Inferences for Regression

Exam Applied Statistical Regression. Good Luck!

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Simple linear regression

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Chapter 4. Regression Models. Learning Objectives

Transcription:

Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23

Announcements Starting next week, Julia Fukuyama will be having her office hours after business hours for SCPD students (through Skype). The times will be updated on the website as usual. Probability problems. Homework 2 will go out this afternoon. It s a long one. 2 / 23

5 7 1 6 8 4 3 2 9 Hierarchical clustering Most algorithms for hierarchical clustering are agglomerative. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 X2 1.5 1.0 0.5 0.0 0.5 3 4 6 1 9 2 8 5 7 1.5 1.0 0.5 0.0 0.5 1.0 X1 The output of the algorithm is a dendogram. We must be careful about how we interpret the dendogram. 3 / 23

Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Complete linkage: The distance between 2 clusters is the maximum distance between any pair of samples, one in each cluster. 4 / 23

Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Single linkage: The distance between 2 clusters is the minimum distance between any pair of samples, one in each cluster. 4 / 23

Notion of distance between clusters At each step, we link the 2 clusters that are closest to each other. Hierarchical clustering algorithms are classified according to the notion of distance between clusters. Average linkage: The distance between 2 clusters is the average of all pairwise distances. 4 / 23

Example Average Linkage Complete Linkage Single Linkage Figure 10.12 5 / 23

Clustering is riddled with questions and choices Is clustering appropriate? i.e. Could a sample belong to more than one cluster? Mixture models, soft clustering, topic models. How many clusters are appropriate? Choose subjectively depends on the inference sought. There are formal methods based on gap statistics, mixture models, etc. Are the clusters robust? Run the clustering on different random subsets of the data. Is the structure preserved? Try different clustering algorithms. Are the conclusions consistent? Most important: temper your conclusions. 6 / 23

Clustering is riddled with questions and choices Should we scale the variables before doing the clustering. Variables with larger variance have a larger effect on the Euclidean distance between two samples. ( Area in acres, Price in US$, Number of houses ) Property 1 ( 10, 450,000, 4 ) Property 2 ( 5, 300,000, 1 ) Does Euclidean distance capture dissimilarity between samples? 7 / 23

Correlation distance Example: Suppose that we want to cluster customers at a store for market segmentation. Samples are customers Each variable corresponds to a specific product and measures the number of items bought by the customer during a year. 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 8 / 23

Correlation distance Euclidean distance would cluster all customers who purchase few things (orange and purple). Perhaps we want to cluster customers who purchase similar things (orange and teal). Then, the correlation distance may be a more appropriate measure of dissimilarity between samples. 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 9 / 23

Simple linear regression y i = β 0 + β 1 x i + ε i ε i N (0, σ) i.i.d. Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV Figure 3.1 The estimates ˆβ 0 and ˆβ 1 are chosen to minimize the residual sum of squares (RSS): RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i ) 2. i=1 10 / 23

Estimates ˆβ 0 and ˆβ 1 A little calculus shows that the minimizers of the RSS are: ˆβ 1 = n i=1 (x i x)(y i y) n i=1 (x i x) 2, ˆβ 0 = y ˆβ 1 x. 11 / 23

Assesing the accuracy of ˆβ 0 and ˆβ 1 Y 10 5 0 5 10 Y 10 5 0 5 10 The Standard Errors for the parameters are: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ] SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2. 2 1 0 1 2 2 1 0 1 2 X Figure 3.3 X The 95% confidence intervals: ˆβ 0 ± 2 SE( ˆβ 0 ) ˆβ 1 ± 2 SE( ˆβ 1 ) 12 / 23

Hypothesis test H 0 : There is no relationship between X and Y. H a : There is some relationship between X and Y. H 0 : β 1 = 0. H a : β 1 0. Test statistic: t = ˆβ 1 0 SE( ˆβ 1 ). Under the null hypothesis, this has a t-distribution with n 2 degrees of freedom. 13 / 23

Interpreting the hypothesis test If we reject the null hypothesis, can we assume there is a linear relationship? No. A quadratic relationship may be a better fit, for example. If we don t reject the null hypothesis, can we assume there is no relationship between X and Y? No. This test is only powerful against certain monotone alternatives. There could be more complex non-linear relationships. 14 / 23

Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d. or, in matrix notation: X2 Ey = Xβ, Figure 3.4 X1 where y = (y 1,..., y n ) T, β = (β 0,..., β p ) T and X is our usual data matrix with an extra column of zeroes on the left to account for the intercept. 15 / 23

Multiple linear regression answers several questions Is at least one of the variables X i useful for predicting the outcome Y? Which subset of the predictors is most important? How good is a linear model for these data? Given a set of predictor values, what is a likely value for Y, and how accurate is this prediction? 16 / 23

The estimates ˆβ Our goal again is to minimize the RSS: RSS = = n (y i ŷ i ) 2 i=1 n (y i β 0 β 1 x i,1 β p x i,p ) 2. i=1 One can show that this is minimized by the vector ˆβ: ˆβ = (X T X) 1 X T y. 17 / 23

Consider the hypothesis: Which variables are important? H 0 : The last q predictors have no relation with Y. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. Example: If q = p, we test whether any of the variables is important. n RSS 0 = (y i y) 2 i=1 18 / 23

Consider the hypothesis: Which variables are important? H 0 : β p q+1 = β p q+2 = = β p = 0. Let RSS 0 be the residual sum of squares for the model which excludes these variables. The F -statistic is defined by: F = (RSS 0 RSS)/q RSS/(n p 1). Under the null hypothesis, this has an F -distribution. Example: If q = p, we test whether any of the variables is important. n RSS 0 = (y i y) 2 i=1 18 / 23

Which variables are important? A multiple linear regression in R has the following output: 19 / 23

Which variables are important? The t-statistic associated to the ith predictor is the square root of the F -statistic for the null hypothesis which sets only β i = 0. A low p-value indicates that the predictor is important. Warning: If there are many predictors, even under the null hypothesis, some of the t-tests will have low p-values. 20 / 23

How many variables are important? When we select a subset of the predictors, we have 2 p choices. A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best. Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step. Mixed selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable. Choosing one model in the range produced is a form of tuning. 21 / 23

How good is the fit? To assess the fit, we focus on the residuals. The RSS always decreases as we add more variables. The residual standard error (RSE) corrects this: 1 RSE = n p 1 RSS. Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions: Sales TV Radio 22 / 23

How good are the predictions? The function predict in R output predictions from a linear model: Confidence intervals reflect the uncertainty on ˆβ. Prediction intervals reflect uncertainty on ˆβ and the irreducible error ε as well. 23 / 23