LECTURE 15: SIMPLE LINEAR REGRESSION I

Similar documents
appstats27.notebook April 06, 2017

Chapter 27 Summary Inferences for Regression

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Unit 6 - Introduction to linear regression

Warm-up Using the given data Create a scatterplot Find the regression line

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Chapter 16. Simple Linear Regression and Correlation

Inferences for Regression

Please bring the task to your first physics lesson and hand it to the teacher.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

ECON 497 Midterm Spring

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

1 Correlation and Inference from Regression

Preptests 55 Answers and Explanations (By Ivy Global) Section 4 Logic Games

Chapter 16. Simple Linear Regression and dcorrelation

Parabolas and lines

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

CS 124 Math Review Section January 29, 2018

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Introduction to Linear Regression

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Solution to Proof Questions from September 1st

CORRELATION & REGRESSION

Business Statistics. Lecture 9: Simple Regression

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Testing a Hash Function using Probability

Data Analysis and Statistical Methods Statistics 651

appstats8.notebook October 11, 2016

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

The Simple Linear Regression Model

Multiple Regression Analysis

A Series Transformations

Algebra & Trig Review

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Section 3: Simple Linear Regression

Unit 6 - Simple linear regression

Quadratic Equations Part I

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Announcements. J. Parman (UC-Davis) Analysis of Economic Data, Winter 2011 February 8, / 45

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

Inference with Simple Regression

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

MBF1923 Econometrics Prepared by Dr Khairul Anuar

Stat 20 Midterm 1 Review

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Introduction. So, why did I even bother to write this?

Ordinary Least Squares Regression Explained: Vartanian

AMS 7 Correlation and Regression Lecture 8

Introduction to Algebra: The First Week

Chapter 26: Comparing Counts (Chi Square)

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

LECTURE 2: SIMPLE REGRESSION I

Chapter 7 Summary Scatterplots, Association, and Correlation

Fitting a Straight Line to Data

Line Integrals and Path Independence

Confidence intervals

Inference in Regression Model

Solving Equations by Adding and Subtracting

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Biostatistics: Correlations

In the previous chapter, we learned how to use the method of least-squares

Chapter 18 Sampling Distribution Models

t. y = x x R² =

MA554 Assessment 1 Cosets and Lagrange s theorem

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

11 Correlation and Regression

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

Astronomy 102 Math Review

Science Literacy: Reading and Writing Diagrams Video Transcript

determine whether or not this relationship is.

CSC321 Lecture 4 The Perceptron Algorithm

Physics 6A Lab Experiment 6

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Preptests 59 Answers and Explanations (By Ivy Global) Section 1 Analytical Reasoning

Lecture 11: Extrema. Nathan Pflueger. 2 October 2013

Expansion of Terms. f (x) = x 2 6x + 9 = (x 3) 2 = 0. x 3 = 0

Thermodynamics (Classical) for Biological Systems. Prof. G. K. Suraishkumar. Department of Biotechnology. Indian Institute of Technology Madras

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

download instant at

1 Impact Evaluation: Randomized Controlled Trial (RCT)

Water tank. Fortunately there are a couple of objectors. Why is it straight? Shouldn t it be a curve?

Chapter 1 Review of Equations and Inequalities

Lecture 10: F -Tests, ANOVA and R 2

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

2 Prediction and Analysis of Variance

#29: Logarithm review May 16, 2009

review session gov 2000 gov 2000 () review session 1 / 38

Common models and contrasts. Tuesday, Lecture 5 Jeanette Mumford University of Wisconsin - Madison

Linear Regression & Correlation

Lecture 2. When we studied dimensional analysis in the last lecture, I defined speed. The average speed for a traveling object is quite simply

Using Microsoft Excel

Transcription:

David Youngberg BSAD 20 Montgomery College LECTURE 5: SIMPLE LINEAR REGRESSION I I. From Correlation to Regression a. Recall last class when we discussed two basic types of correlation (positive and negative). b. While it s usually clear from a scatter plot if two variables are correlated (and in which direction they are correlated), we often want more than that. In a world of cost-benefit analysis, correlation is not enough. The level of influence is needed as well. c. This is why we do regressions: they let us know how much one variable influences another. i. These are the best estimate, an estimate because there will always be some things we cannot predict. At the very least, no sample is perfectly precise. d. It is thus important to remember that when you construct a regression, you are making a causal claim. You are claiming one thing (x) causes another thing (y). If x increases, y will change. Y cannot change without x changing; y cannot change independently. i. This is why we call y a dependent variable (it depends on x), and x an independent variable (changes to it happen independent of the model). ii. We also call the independent variable(s) the explanatory variable(s) because it explains what the dependent variable is. e. When choosing variables, be sure that your explanatory variable(s) logically matches with your dependent variable. Some common mistakes: i. One variable adjusts for population and other doesn t (e.g. population density and total number of crimes). ii. Adjusting for population for a variable adjust when that doesn t make sense either because:. It already adjusts for population (e.g. percent of people in a country who are teenagers or average life expectancy) 2. Population isn t relevant (e.g. average rainfall or if a state voted for a Democrat or Republican in the last presidential election). iii. The same can be said for other adjustments, such as land area.

II. Basics a. Least Squares Regression line which minimizes the sum of squared deviations between the constructed line and the actual data points. i. It is also referred to as Ordinary Least Squares (OLS). ii. Here, we re determining the line: HEIGHT i = β 0 + β *AGE i + ε i The ε is the residual, the distance between what s predicted and what s observed. Sometimes it s called the error term but that s a bit deceiving. It s not suggesting anyone did anything wrong. Still, many sources (including your book) refer to it as error so mention that here to avoid future confusion. iii. The subscript, i, refers to a particular observation. For example, i= is the first observation, i=2 is the second, and so on. Note the order of observations doesn t matter; it just for differentiating one observation from another.. Note that i repeats for both variables. That s because for any observation, we know that observation s height and age. Repeating i means those values are for the same observation: the ith observation. iv. This line is determined by minimizing the sum of the squared vertical distance between the line and a data point. This is built to minimize this value (Residual Sum of Squares):. Where y i -hat is the estimated value based on the regression line; 2. y i is a particular observation; and 3. n is the sample size.

III. v. Note this line is not a perfect fit. That s because other factors influence height besides age such as genetics, diet, and exercise. These unmeasured factors are captured in ε i. The i subscript means that ε changes from observation to observation. Its placement is for technical reasons, creating the connection between how much the y value should be and how much the y value is. b. β is the slope of the line. It tells us how much age matters to height. Suppose the line is HEIGHT i = 80 + 5.6 AGE i + ε i. i. We can estimate that someone who is 8 years old is probably 80 + 5.6(8) = 24.8 cm tall. ii. For every year someone ages, they get 5.6 cm taller. Why Square the Difference? a. The best fit line relies on squaring the vertical difference between what the line predicts a value should be and what it actually is. Why not just take the absolute value. b. Consider this hypothetical data set and two different regression lines: c. At halfway between three pairs of data, the blue line is clearly the better fit. But if you use absolute value, there s no difference in quality. Suppose each vertical pair is a distance of 2 apart with the blue line halfway between each pair. d. For the red line, there s three vertical distances of 2 and three vertical distances of 0. i. Absolute value method: 2+2+2+0+0+0 = 6 ii. Squared method: 2 2 +2 2 +2 2 +0 2 +0 2 +0 2 = 2

2 0 2 e. For the blue line, there s six vertical distances of each. i. Absolute value method: +++++ = 6 ii. Squared method: 2 + 2 + 2 + 2 + 2 + 2 = 6 0 2 0 IV. f. Using absolute value, there s no difference between the two lines. But under the correct method summing the squares the blue line is better. Basic Regression a. Open Data Set 5; you ll find data on Montgomery College professors from Rate My Professor. i. Like before, this data includes every MC professor with at least 25 ratings, gathered in July 204. We have data on their department, the number of ratings, the overall quality, the level of difficulty, and if users rate that professor hot or not. ii. There are 2 observations (professors). b. Suppose we want to tell a story that an easy professor will lead a student to rate that professor well on overall teaching. (Perhaps, because the professor is easy, students think they ve learned a lot and thus rate the professor as quite skilled in pedagogy.) i. Thus our causal claim: Difficulty causes Quality. QUALITY i = β 0 + β *DIFFICULTY i + ε i

A B C D F G H c. To run a regression in Excel, go to Data >>> Data Analysis >>> Regression. You ll get a window that looks like this: i. A red letter means this option can be ignored for purposes of this class. d. I filled the box as so: E A: Where the range for your dependent variable goes. B: Where the range for your explanatory variable goes. C: If you check this box, Excel will assume the first row of your data is the label for that column. It is useful to use this option, as we ll see soon. D: Excel will output the confidence interval of your dependent variable s coefficient, defaulting to 95% confidence. Check this box to change the confidence level. E: Check this box if you want to force the intercept (β 0 ) to be zero. You won t need this option for this class. F: As before, this how you tell Excel where you want the results. I usually select the first option and select an out-of-the-way cell. G: Excel can give you information on the residuals for each observation. H: Used to analyze the data to see how it deviates from a normal distribution.

e. And got this result: V. Interpretation a. For now, focus only on the end of the regression (we ll take about the rest of it later).

b. For each variable in the regression (and it s possible to have many, which we will discuss later), Excel will tell you the following: i. Coefficient this is the beta-value for the variable; the slope. ii. Standard Error this is the dispersion of the coefficients. If you draw multiple unbiased samples, this gives an idea of how much the coefficients would change. iii. t-statistic ratio of the estimated coefficient to the standard error of the estimated coefficient (coefficient divided by error). iv. p-value tells you the threshold of significance you achieve for a particular t-statistic. (Remember critical t values changes based on degrees of freedom.) If the p-value is below 0.05, it s significant to the 5% (95% confidence) level. If below 0.0, it s significant to the % level, etc. It s basically the α. v. Confidence interval describes the range that the true value of the parameter could fall with a certain level of certainty (usually 95%). It outputs this result twice, the second one for whatever you customized Excel to do (e.g. 97% rather than 95%). c. The intercept is β 0 ; it s not really a variable and the t-stat other information doesn t matter too much. But the coefficient does. That number (6.22) is β 0. Our estimated line is thus: QUALITY i = 6.22-0.8625*DIFFICULTY i + ε i i. Note as well this result is statistically significant. The t-stat is huge and p is functionally zero. d. Increasing DIFFICULTY by one point decreases QUALITY by 0.8625 points. e. A professor with a DIFFCULTY of 3 is expected to have a QUALITY of about 3.624. i. If the professor is actually above or below that predicted value, you can infer that there is something special (good or bad) about his or her teaching.