Simple Linear Regression (single variable)

Similar documents
Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

What is Statistical Learning?

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

COMP 551 Applied Machine Learning Lecture 4: Linear classification

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Comparing Several Means: ANOVA. Group Means and Grand Mean

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Part 3 Introduction to statistical classification techniques

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Introduction to Regression

Resampling Methods. Chapter 5. Chapter 5 1 / 52

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

EASTERN ARIZONA COLLEGE Introduction to Statistics

Statistical Learning. 2.1 What Is Statistical Learning?

Chapter 3: Cluster Analysis

Smoothing, penalized least squares and splines

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

Data mining/machine learning large data sets. STA 302 or 442 (Applied Statistics) :, 1

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Distributions, spatial statistics and a Bayesian perspective

7 TH GRADE MATH STANDARDS

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

IAML: Support Vector Machines

Lecture 3: Principal Components Analysis (PCA)

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Inference in the Multiple-Regression

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Pattern Recognition 2014 Support Vector Machines

Lecture 10, Principal Component Analysis

INSTRUMENTAL VARIABLES

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Machine Learning for OR & FE

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Determining the Accuracy of Modal Parameter Estimation Methods

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Competency Statements for Wm. E. Hay Mathematics for grades 7 through 12:

Chapter 15 & 16: Random Forests & Ensemble Learning

Linear Classification

Hypothesis Tests for One Population Mean

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

Computational modeling techniques

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

CS 109 Lecture 23 May 18th, 2016

A Matrix Representation of Panel Data

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Support-Vector Machines

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Math Foundations 20 Work Plan

Performance Bounds for Detect and Avoid Signal Sensing

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Probability and Statistical Decision Theory

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

Checking the resolved resonance region in EXFOR database

, which yields. where z1. and z2

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

SAMPLING DYNAMICAL SYSTEMS

Eric Klein and Ning Sa

The blessing of dimensionality for kernel methods

Support Vector Machines and Flexible Discriminants

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

MACHINE LEARNING FOR CLUSTER- GALAXY CLASSIFICATION

Simple Linear Regression

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Statistics Statistical method Variables Value Score Type of Research Level of Measurement...

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Evaluation of Classification Procedures for Estimating Wheat Acreage in Kansas

Elements of Machine Intelligence - I

ENSC Discrete Time Systems. Project Outline. Semester

Statistics, Numerical Models and Ensembles

The standards are taught in the following sequence.

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

GENESIS Structural Optimization for ANSYS Mechanical

Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Computational Statistics

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Discussion on Regularized Regression for Categorical Data (Tutz and Gertheiss)

Section 11 Simultaneous Equations

Functional Form and Nonlinearities

Mathematics and Computer Sciences Department. o Work Experience, General. o Open Entry/Exit. Distance (Hybrid Online) for online supported courses

Differentiation Applications 1: Related Rates

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Transcription:

Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with permissin frm the authrs: G. James, D. Witten, T. Hastie and R. Tibshirani

Last Class 1. Basic machine learning framewrk Y = f(x) 2. Predictin vs inference: predict Y vs understand f 3. Parametric vs nn-parametric: linear regressin vs k-nn 4. Classificatin vs regressins: k-nn vs linear regressin 5. Why we need t have a test set: verfitting

What is Machine Learning Discver unknwn functin f: X = set f features, r inputs Y = target, r respnse Y = f(x) Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Sales = f(tv, Radi, Newspaper)

Errrs in Machine Learning: Wrld is Nisy Wrld is t cmplex t mdel precisely Many features are nt captured in data sets Need t allw fr errrs ɛ in f: Y = f(x) + ɛ

Hw Gd are Predictins? Learned functin ˆf Test data: (x 1, y 1 ), (x 2, y 2 ),... Mean Squared Errr (MSE): MSE = 1 n n (y i ˆf(x i )) 2 i=1 This is the estimate f: MSE = E[(Y ˆf(X)) 2 ] = 1 Ω Imprtant: Samples x i are i.i.d. (Y (ω) ˆf(X(ω))) 2 ω Ω

D We Need Test Data? Why nt just test n the training data? Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Flexibility is the degree f plynmial being fit Gray line: training errr, red line: testing errr

Types f Functin f Regressin: cntinuus target f : X R Years f Educatin Senirity Incme Classificatin: discrete target f : X {1, 2, 3,..., k} X1 X2

Bias-Variance Decmpsitin Y = f(x) + ɛ Mean Squared Errr can be decmpsed as: MSE = E(Y ˆf(X)) 2 = Var( }{{ ˆf(X)) + (E( } ˆf(X))) 2 + Var(ɛ) }{{} Variance Bias Bias: Hw well wuld methd wrk with infinite data Variance: Hw much des utput change with different data sets

Tday Basics f linear regressin Why linear regressin Hw t cmpute it Why cmpute it

Simple Linear Regressin We have nly ne feature Y β 0 + β 1 X Y = β 0 + β 1 X + ɛ Example: 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales Sales β 0 + β 1 TV

Hw T Estimate Cefficients N line that will have n errrs n data x i Predictin: ŷ i = ˆβ 0 + ˆβ 1 x i Errrs (y i are true values): e i = y i ŷ i Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV

Residual Sum f Squares Residual Sum f Squares RSS = e 2 1 + e 2 2 + e 2 3 + + e 2 n = n i=1 e 2 i Equivalently: RSS = n (y i ˆβ 0 ˆβ 1 x i ) 2 i=1

Minimizing Residual Sum f Squares min β 0,β 1 RSS = min β 0,β 1 n i=1 e 2 i = min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 RSS β 1 β 0

Minimizing Residual Sum f Squares min β 0,β 1 RSS = min β 0,β 1 n i=1 e 2 i = min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 3 3 2.5 β 1 0.03 0.04 0.05 0.06 3 2.15 2.2 2.3 3 5 6 7 8 9 β 0

Slving fr Minimal RSS min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 RSS is a cnvex functin f β 0, β 1 Minimum achieved when (recall the chain rule): RSS β 0 RSS β 1 = 2 = 2 n (y i β 0 β 1 x i ) = 0 i=1 n x i (y i β 0 β 1 x i ) = 0 i=1

Linear Regressin Cefficients Slutin: min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 β 0 = ȳ β 1 x β 1 = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 = n i=1 x i(y i ȳ) n i=1 x i(x i x) where n n x = 1 n i=1 x i ȳ = 1 n i=1 y i

Why Minimize RSS

Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 )

Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 ) 2. Best Linear Unbiased Estimatr (BLUE): Gauss-Markv Therem (ESL 3.2.2)

Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 ) 2. Best Linear Unbiased Estimatr (BLUE): Gauss-Markv Therem (ESL 3.2.2) 3. It is cnvenient: can be slved in clsed frm

Bias in Estimatin Assume a true value µ Estimate µ is unbiased when E[µ] = µ Standard mean estimate is unbiased (e.g. X N (0, 1)): E [ 1 n ] n X i = 0 i=1 Standard variance estimate is biased (e.g. X N (0, 1)): E [ 1 n ] n (X i X) 2 1 i=1

Linear Regressin is Unbiased Y 10 5 0 5 10 Y 10 5 0 5 10 2 1 0 1 2 X 2 1 0 1 2 X Gauss-Markv Therem (ESL 3.2.2)

Slutin f Linear Regressin 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales

Hw Gd is the Fit Hw well is linear regressin predicting the training data? Can we be sure that TV advertising really influences the sales? What is the prbability that we just gt lucky?

R 2 Statistic R 2 = 1 RSS n TSS = 1 i=1 (y i ŷ i ) 2 n i=1 (y i ȳ) 2 RSS - residual sum f squares, TSS - ttal sum f squares R 2 measures the gdness f the fit as a prprtin Prprtin f data variance explained by the mdel Extreme values: 0: Mdel des nt explain data 1: Mdel explains data perfectly

Example: TV Impact n Sales 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales

Example: TV Impact n Sales 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales R 2 = 0.61

Example: Radi Impact n Sales 0 10 20 30 40 50 5 10 15 20 25 Radi Sales

Example: Radi Impact n Sales 0 10 20 30 40 50 5 10 15 20 25 Radi Sales R 2 = 0.33

Example: Newspaper Impact n Sales 0 20 40 60 80 100 5 10 15 20 25 Newspaper Sales

Example: Newspaper Impact n Sales 0 20 40 60 80 100 5 10 15 20 25 Newspaper Sales R 2 = 0.05

Crrelatin Cefficient Measures dependence between tw randm variables X and Y r = Cv(X, Y ) Var(X) Var(Y ) Crrelatin cefficient r is between [ 1, 1] 0: Variables are nt related 1: Variables are perfectly related (same) 1: Variables are negatively related (different)

Crrelatin Cefficient Measures dependence between tw randm variables X and Y r = Cv(X, Y ) Var(X) Var(Y ) Crrelatin cefficient r is between [ 1, 1] 0: Variables are nt related 1: Variables are perfectly related (same) 1: Variables are negatively related (different) R 2 = r 2

Hypthesis Testing Null hypthesis H 0 : There is n relatinship between X and Y β 1 = 0 Alternative hypthesis H 1 : There is sme relatinship between X and Y β 1 0 Seek t reject hypthesis H 0 with small prbability (p-value) f making a mistake Imprtant tpic, but beynd the scpe f the curse

Multiple Linear Regressin Usually mre than ne feature is available sales = β 0 + β 1 TV + β 2 radi + β 3 newspaper + ɛ In general: p Y = β 0 + β j X j j=1

Multiple Linear Regressin Y X 2 X 1

Estimating Cefficients Predictin: p ŷ i = ˆβ 0 + ˆβ j x ij Errrs (y i are true values): j=1 Residual Sum f Squares e i = y i ŷ i RSS = e 2 1 + e 2 2 + e 2 3 + + e 2 n = n i=1 e 2 i Hw t minimize RSS? Linear algebra!

Inference frm Linear Regressin 1. Are predictrs X 1, X 2,..., X p really predicting Y? 2. Is nly a subset f predictrs useful? 3. Hw well des linear mdel fit data? 4. What Y shuld be predict and hw accurate is it?

Inference 1 Are predictrs X 1, X 2,..., X p really predicting Y? Null hypthesis H 0 : There is n relatinship between X and Y β 1 = 0 Alternative hypthesis H 1 : There is sme relatinship between X and Y β 1 0 Seek t reject hypthesis H 0 with small prbability (p-value) f making a mistake See ISL 3.2.2 n hw t cmpute F-statistic and reject H 0

Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features

Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features!

Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2

Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2 Testing all subsets f features is impractical: 2 p ptins!

Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2 Testing all subsets f features is impractical: 2 p ptins! Mre n hw t d this later

Inference 3 Hw well des linear mdel fit data? R 2 als always increases with mre features (like RSS) Is the mdel linear? Plt it: Sales TV Radi Mre n this later

Inference 4 What Y shuld be predict and hw accurate is it? The linear mdel is used t make predictins: y predicted = ˆβ 0 + ˆβ 1 x new Can als predict a cnfidence interval (based n estimate n ɛ):

Inference 4 What Y shuld be predict and hw accurate is it? The linear mdel is used t make predictins: y predicted = ˆβ 0 + ˆβ 1 x new Can als predict a cnfidence interval (based n estimate n ɛ): Example: Spent $100 000 n TV and $20 000 n Radi advertising Cnfidence interval: predict f(x) (the average respnse): f(x) [10.985, 11, 528] Predictin interval: predict f(x) + ɛ (respnse + pssible nise) f(x) [7.930, 14.580]

R ntebk