Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables

Size: px
Start display at page:

Download "Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables"

Transcription

1 Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables MALES FEMALES Age. Tot. %-ile; weight,g Tot. %-ile; weight,g wk N. 0th 50th 90th No. 0th 50th 90th Technical Meaning [originally] simply a line of 'best fit' to data points [nowadays] Regression line is the LINE that connects the CENTRES of the distributions of Y's at each X value not necessarily a straight line; could be curved, as with growth charts not necessarily µ Y X 's used as CENTRES ; could use medians etc strictly speaking, haven't completed description unless we characterize the variation around the centres of the Y distributions at each X inference not restricted to the distributions of Y's for which we make some observations; it applies to distributions of Y's at all unobserved X values in between BIRTH WEIGHT (DISTRIBUTION) MALES BIRTH WEIGHT (MEDIAN) FEMALES Median (50th %ile) for MALES Median (50th %ile) for FEMALES Examples (with appropriate caveats) Birth weight (Y) in relation to gestational age (X) Blood pressure (Y) in relation to age (X) Cardiovascular mortality (Y) in relation to water hardness (X)? Cancer incidence (Y) in relation to some exposure (X)? Scholastic performance (Y) vis a vis amount of TV watched (X) Caveat: No guarantee that simple straight line relationship will be adequate. Also, in some instances the relationship might change with the type of X and Y variables used to measure the two phenomena being studied; also the relationship may be more artifact than real - see later for inference.) 4000g 3000g 2000g 000g Live singleton births, Canada 986 Source: Arbuckle & Sherman CMAJ , 989 GESTATIONAL AGE (week) page

2 S i m p l e L i n e a r R e g r e s s i o n (one X) (straight line) Equation µ Y X = α + β X or µ Y X X = β = "rise" "run" In Practice: one rarely sees an exact straight line relationship in health science applications; - While physicists are often able to examine the relationship between Y and X in a laboratory with all other things being equal (ie controlled or held constant) medical investigators largely are not. The universe of (X,Y) pairs is very large and any 'true' relationship is disturbed by countless uncontrollable (and sometimes un-measurable factors. In any particular sample of (X,Y) pairs these distortions will surely be operating. 2 - The true relationship (even if we could measure it exactly) may not be a simple straight line. 3 - The measuring instruments may be faulty or inexact (using 'instruments' in the broadest sense). One always tries to have the investigation sufficiently controlled that the 'real' relationship won't be 'swamped' by factors and 3 and that the background "noise" will be small enough so that alternative models (eg curvilinear relationships) can be distinguished from one another Linear here means linear in the parameters. The equation y = Bx C can be made linear in the parameters by taking logs i.e. log[y] = log[b] + x log[c]; y = a+b x+c x 2 is already linear in the parameters a b and c. The following model cannot be made linear in the parameters α β γ: Fitting a straight line to data - Least Squares Method The most common method is that of Least Squares. Note that least squares can be thought of as just a curve fitting method and doesn't have to be thought of in a statistical (or random variation or sampling variation) context. Other more statistically-oriented methods include the method of minimum Chi-Square (matching observed and expected counts according to measure of discrepancy) and the Method of Maximum likelihood (finding the parameters that made the data most likely). Each has a different criterion of "best-fit". Least Squares Approach: Consider a candidate slope (b) and intercept (a) and predict that the Y value accompanying any X=x is y^ = a + b x. The observed y value will deviate from this "predicted" or "fitted" value by an amount d = y - y^ We wish to keep this deviation as small as possible, but we must try to strike a balance over all the data points. Again just like when calculating variances, it is easier to work with squared deviations : d 2 = (y - y^ ) 2 We weight all deviations equally (whether they be the ones in the middle or the extremes of the x range) using d 2 = (y - y^ ) 2 to measure the overall (or average) discrepancy of the points from the line. proportion dying = α + -α +exp{β γ log[dose]} there are also several theoretical advantages to least squares estimates over others based for example on least absolute deviations: - they are the most precise of all the possible estimates one could get by taking linear combinations of y's. page 2

3 From all the possible candidates for slope (b) and intercept (a), we choose the particular values a and b which make this sum of squares (sum of squared deviations of 'fitted' from 'observed' Y's) a minimum. ie we search for the a and b that give us the least squares fit. Fortunately, we don't have to use trial and error to arrive at the 'best' a and b. Instead, it can be shown by calculus or algebraically that the a and b which minimize d 2 are: b = β^ = a = α^ = y b x {x i x }{y i y } {x i x } 2 = r xy s y s x [Note that a least-squares fit of the regression line of X on Y would give a different set of values for the slope and intercept: the slope r of the line of x on y is xy s x ]. one needs to be careful when s y using a calculator or computer program to specify which is the explanatory variable (X) and which is the predicted variable (Y)]. Meaning of intercept parameter (a): Unlike the slope parameter (which represents the increase/decrease in µ Y X for every unit increase in x), the intercept does not always have a 'natural' interpretation. It depends on where the x-values lie in relation to x=0, and may represent part of what is really the mean Y. For example, the regression line for fuel economy of cars (Y) in relation to their weight (x) might be µ Y weight = 60 mpg 0.0 weight in lbs [0.0 mpg/lb] but there are no cars weighing 0 lbs. It would be better to write the equation in relation to some 'central' value for weight e.g lbs; then the same equation can be cast as µ Y weight 25 = 0.0 (weight 3500) It is helpful for testing whether there is evidence of a non-zero slope to think of the simplest of all regression models, namely that which is a horizontal straight line µ Y X = α + 0 X = the constant α. This is a re-statement of the fact that the sum of squared deviances around a constant horizontal line at height 'a' is smallest when 'α' = the mean. [We don't always use the mean as the best 'centre' of a set of numbers. Imagine waiting for one of several elevators with doors in a row along one wall; you do not know which one will arrive next, and so want to stand in the 'best' place no matter which one comes next. Where to stand depends on the criterion being optimized: if you want to minimize the maximum distance, stand in the middle between the one on the extreme left and the extreme right; if you wish to minimize the average distance, where do you stand?, If, for some reason, you want to minimize the average squared distance, where to stand? If the elevator doors are not equally spaced from each other, what then?] page 3

4 The anatomy of a slope: some re-expressions i.e. a weighted average of the slope from datapoints and 4 and that from datapoints 2 and 3, with weights proportional to the {x xbar}{y ybar} Consider the formula: slope = b = squares of their distances on x axis {x 4 x } 2 and {x 3 x 2 } 2 {x xbar} 2 Without loss of generality & for simplicity, assume ybar=0. If we have 3 x's, unit apart (e.g. x =; x 2 =2; x 3 =3), then... x xbar = ; x 2 xbar = 0; x 3 xbar = + so slope = b = { }y + {0}y 2 + { + }y 3 { } 2 + { 0 } 2 + { + } 2 b = i.e. slope = y 3 y x 3 x Note that y 2 contributes to ybar and thus to an estimate of the average y (i.e. level) but not to the slope. If 4 x's unit apart (e.g. x =; x 2 =2; x 3 =3; x 4 =4), then,... Another way to think of the slope: Rewrite b = {x xbar}{y ybar} {x xbar} 2 {x xbar} 2 {y ybar} {x xbar} {x xbar} 2 = weight {x xbar} 2 for estimate as {y ybar} {x xbar} Yet another way to think of the slope: weight {y ybar} {x xbar} weight of slope x xbar =.5 x 2 xbar = 0.5 x 3 xbar = +0.5 x 4 xbar = +.5 and so slope = b = {.5}y + { 0.5}y 2 + { + 0.5}y 3 + { +.5}y 4 {.5} 2 + { 0.5} 2 + { 0.5 } 2 + { +.5} 2 i.e. slope =.5{ y 4 y } + 0.5{ y 3 y 2 } {y 4 y } 2 { y 3 y 2 } i.e. slope = {x 5 4 x } {x 3 x 2 } i.e. slope = 9 0 {y 4 y } {x 4 x } + 0 { y 3 y 2 } {x 3 x 2 } b is a weighted average of all the pairwise slopes with weights proportional to {xi xj } 2. e.g. If 4 x's unit apart yi yj xi xj denote by b &2 the slope obtained from {x 2,y 2 } & {x,y }, etc... b =.b &2 + 4.b &3 + 9.b &4 +.b &3 + 4.b 2&4 +.b 3& = 20 jh 6/94 page 4

5 Inferences regarding Simple Linear Regression How reliable are 50 than to take X =38, 39, 49, 4, 42. Any individual fluctuations will 'throw off' the slope much less if the X's are far apart. (i) the (estimated) slope (ii) the (estimated) intercept (ii) the predicted mean Y at a given X (iv) the predicted y for a (future) individual with a given X BP (a) BP (b) when they are based on data from a sample? i.e. how much would these estimated quantities change if they were based on a different random sample [with the same x values]? We can use the concept of sampling variation to (i) describe the 'uncertainty' in our estimates via CONFIDENCE INTERVALS or (ii) carry out TESTS of significance on the parameters (slope, intercept, predicted mean) AGE AGE We can describe the degree of reliability of (or, conversely, the degree of uncertainty in) an estimated quantity by the standard deviation of the possible estimates produced by different random samples of the same size from the same x's. We call this (obviously conceptual) S.D. the standard error of the estimated quantity (just like the standard error of the mean when estimating µ). helpful to think of slope as an average difference in means for 2 groups that are x-unit apart. The size of the standard error will depend on. how 'spread apart' the x's are 2. How good a fit the regression line really is (i.e. how small is the unexplained variation about the line) 3. How large the sample size, n, is. Factors affecting reliability (in more detail). The spread of the X's: The best way to get a reliable estimate of the slope is to take Y readings at X's that are quite a distance from each other. E.g. in estimating the "per year increase in BP over the yr. age range", it would be better to take X=30,35, 40, 45, thick line: real (true) relation between average BP at age X and X : thin lines: possible apparent relationships because of individual variation when we study individual at each of two ages (a) spaced closer together (b) spaced further apart. Notes Regression line refers to the relationship between the average Y at a given X to the X, and not to individual Y's vs X. Obviously of course if the individual Y's are close to the average Y, so much the better! The above argument would suggest studying individuals at the extremes of the X values of interest. We do this if we are sure that the relationship is a linear one. If we are not sure, it is wiser -- if we have a choice in the matter -- to take a 3-point distribution. There is a common misapprehension that a Gaussian distribution of X values is desirable for estimating a regression slope of Y on X. In fact, the 'inverted U' shape of the Gaussian is the least desirable! page 5

6 Factors affecting reliability (continued) 2. The (vertical) variation about the regression line: Again, consider BP and age, and suppose that indeed the average BP of all persons aged X + is β units higher than the average BP of all persons aged X, and that this linear relationship average BP of persons aged x = α + β X (average of Y's at a given x = intercept + slope X) holds over the age span Obviously, everybody aged x=32 won't have the exact same BP, some will be above the average of 32 yr olds, some below. Likewise for the different ages x=30, In other words, at any x there will be a distribution of y's about the average for age X. Obviously, how wide this distribution is about α + β X will have an effect on what slopes one could find in different samples (measure vertical spread around the line by σ) variation when we study individual at each of two ages when the within-age distributions have (a) a narrow spread (b) a wider spread NOTE: For unweighted regression, should have roughly same spread of Y's at each X. Factors affecting reliability (continued) 3. Sample Size (n) Larger n will make it more difficult for the types of extremes and misleading estimates caused by ) poor X spread and 2) large variation in Y about µ Y X, to occur. Clearly, it may be possible to spread the x's out so as to maximize their variance (and thus reduce the n required) but it may not be possible to change the magnitude of the variation about µ Y X (unless there are other known factors influencing BP). Thus the need for reasonably stable estimated y^ [i.e.estimate of µ Y X ] BP (a) BP (b) AGE AGE thick line: real (true) relation between average BP at age X and X : thin lines: possible apparent relationships because of individual page 6

7 Standard Errors SE(b) = SE( β^) = SE(a) = SE( α^) = σ σ {x i x } 2 ; n + x 2 {x i x } 2 (Note: there is a negative correlation between a and b). We don't usually know σ so we estimate it from the data, using scatter of the y's from the fitted line i.e. SD of the residuals) If examine the structure of SE(b), see that it reflects the 3 factors discussed above: (i) a large spread of the x's makes contribution of each observation to {x i x } 2 large, and since this is in the denominator, it reduces the SE (ii) a small vertical scatter is reflected in a small σ and since this is in the numerator, it also reduces the SE of the estimated slope (iii) a large sample size means that {x i x } 2 is larger, and like (i) this reduces the SE. The formula, as written, tends to hide this last factor; note that {x i x } 2 is what we use to compute the spread of a set of x's -- we simply divide it by n to get a variance and then take the square root to get the sd. To make the point here, simplify n- to n and write {x i x } 2 n var(x), so that {x i x } 2 n sd(x) The structure of SE(a) : In addition to the factors mentioned above, all of which come in again in the expected way, there is the additional factor of x 2 ; since this is in the denominator, it increases the SE. This is natural in that if the data, and thus x, are far from x=0, then any imprecision in the estimate of the slope will project backwards to a large imprecision in the estimated intercept. Also, if one uses 'centered' x's, so that x = 0, the formula for the SE reduces to SE(a) = σ n = σn and we recognize this as SE(y ) -- not surprisingly, since y is the 'intercept' for centered data. CI's & Tests of Significance for ^ and ^ are based on t distribution (or Gaussian Z's if n large) ^ ± tn 2 SE( ^ ) ^ H 0 : t n 2 = SE(^) ^ ± tn 2 SE( ^ ) and the equation for the SE simplifies to SE(b) = σ n sd(x) = SD y x / SDx n H 0 : t n 2 = ^ SE(^) with n in its familiar place in the denominator of the SE (even in more complex SE's, this is where n is usually found!) page 7

8 Standard Error for Estimated µ Y X or 'average Y at X' variation, which is as follows: We estimate 'average Y at X' or µ Y X by α^ + β^ X. Since the estimate is based on two estimated quantities, each of which is subject to sampling variation, it contains the uncertainty of both: SE(estimated average Y at X) = σ Again, we must use an estimate σ^ of σ. n + {X x } 2 {x i x } 2 First-time users of this formula suspect that it has a missing or an x instead of an xbar or something. There is no typographical error, and indeed if one examines it closely, it makes sense. X refers to the x-value at which one is estimating the mean -- it has nothing to do with the actual x's in the study which generated the estimated coefficients, except that the closer X is to the center of the data, the smaller the quantity {X x } and thus the quantity {X x } 2, and thus the SE, will be. Indeed, if we estimate the average Y right at X = x, the estimate is simply y (since the fitted line goes through [ x, y ] ) and its SE will be σ n + { x x } 2 {x i x } 2 or σ n = σ n = SE( y ). α^+ β^ X ± t σ + n + {X x } 2 {x i x } 2. Both the CI for the estimated mean and the CI for individuals (ie the estimated percentiles of the distribution) are bow-shaped when drawn as a function of X. They are narrowest at X = x, and fan out from there. One needs to be careful not to confuse the much narrower CI for the mean with the much wider CI for individuals. If one can see the raw data, it is usually obvious which is which -- the CI for individuals is almost as wide as the raw data themselves. cf. data on sleeping through the night; alcohol levels and eye speed. Confidence Interval for individual Y at X A certain percentage P% of individuals are within t P σ of the mean µ Y X = α + β X, where t P is a multiple, depending on P, from the t or, if n is large, the Z table. However, we are not quite certain where exactly the mean α + β X is -- the best we can do is estimate, with a certain P% confidence, that it is within t P SE( α^ + β^ X ) of the point estimate α^+ β^ X. The uncertainty concerning the mean and the natural variation of individuals around the mean -- wherever it is -- combine in the expression for the estimated P% range of individual page 8

Correlation and Regression

Correlation and Regression Correlation and Regression references: A&B Ch 5,8,9,0; Colton Chapter 6, M&M Chapters 2 (descriptive) and 9 (inference) Similarities Both involve relationships between pair of numerical variables X & Y.

More information

BIOSTATISTICS NURS 3324

BIOSTATISTICS NURS 3324 Simple Linear Regression and Correlation Introduction Previously, our attention has been focused on one variable which we designated by x. Frequently, it is desirable to learn something about the relationship

More information

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 9: Simple Regression Business Statistics Lecture 9: Simple Regression 1 On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals

More information

Finite Mathematics : A Business Approach

Finite Mathematics : A Business Approach Finite Mathematics : A Business Approach Dr. Brian Travers and Prof. James Lampes Second Edition Cover Art by Stephanie Oxenford Additional Editing by John Gambino Contents What You Should Already Know

More information

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com 12 Simple Linear Regression Material from Devore s book (Ed 8), and Cengagebrain.com The Simple Linear Regression Model The simplest deterministic mathematical relationship between two variables x and

More information

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y Regression and correlation Correlation & Regression, I 9.07 4/1/004 Involve bivariate, paired data, X & Y Height & weight measured for the same individual IQ & exam scores for each individual Height of

More information

using the beginning of all regression models

using the beginning of all regression models Estimating using the beginning of all regression models 3 examples Note about shorthand Cavendish's 29 measurements of the earth's density Heights (inches) of 14 11 year-old males from Alberta study Half-life

More information

Important note: Transcripts are not substitutes for textbook assignments. 1

Important note: Transcripts are not substitutes for textbook assignments. 1 In this lesson we will cover correlation and regression, two really common statistical analyses for quantitative (or continuous) data. Specially we will review how to organize the data, the importance

More information

Topic 10 - Linear Regression

Topic 10 - Linear Regression Topic 10 - Linear Regression Least squares principle Hypothesis tests/confidence intervals/prediction intervals for regression 1 Linear Regression How much should you pay for a house? Would you consider

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

STA 218: Statistics for Management

STA 218: Statistics for Management Al Nosedal. University of Toronto. Fall 2017 My momma always said: Life was like a box of chocolates. You never know what you re gonna get. Forrest Gump. Problem How much do people with a bachelor s degree

More information

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 12th Class 6/23/10

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 12th Class 6/23/10 Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis 12th Class 6/23/10 In God we trust, all others must use data. --Edward Deming hand out review sheet, answer, point to old test,

More information

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Lecture - 39 Regression Analysis Hello and welcome to the course on Biostatistics

More information

Chapter 8. Linear Regression /71

Chapter 8. Linear Regression /71 Chapter 8 Linear Regression 1 /71 Homework p192 1, 2, 3, 5, 7, 13, 15, 21, 27, 28, 29, 32, 35, 37 2 /71 3 /71 Objectives Determine Least Squares Regression Line (LSRL) describing the association of two

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

11.5 Regression Linear Relationships

11.5 Regression Linear Relationships Contents 11.5 Regression............................. 835 11.5.1 Linear Relationships................... 835 11.5.2 The Least Squares Regression Line........... 837 11.5.3 Using the Regression Line................

More information

Measurements and Data Analysis

Measurements and Data Analysis Measurements and Data Analysis 1 Introduction The central point in experimental physical science is the measurement of physical quantities. Experience has shown that all measurements, no matter how carefully

More information

Week 8: Correlation and Regression

Week 8: Correlation and Regression Health Sciences M.Sc. Programme Applied Biostatistics Week 8: Correlation and Regression The correlation coefficient Correlation coefficients are used to measure the strength of the relationship or association

More information

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated,

More information

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10 Physics 509: Error Propagation, and the Meaning of Error Bars Scott Oser Lecture #10 1 What is an error bar? Someone hands you a plot like this. What do the error bars indicate? Answer: you can never be

More information

The response variable depends on the explanatory variable.

The response variable depends on the explanatory variable. A response variable measures an outcome of study. > dependent variables An explanatory variable attempts to explain the observed outcomes. > independent variables The response variable depends on the explanatory

More information

Chapter 3. Introduction to Linear Correlation and Regression Part 3

Chapter 3. Introduction to Linear Correlation and Regression Part 3 Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 1 Richard Lowry, 1999-2000 All rights reserved. Chapter 3. Introduction to Linear Correlation and Regression Part 3 Regression The appearance

More information

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p). Sampling distributions and estimation. 1) A brief review of distributions: We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation,

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Chapter 10 Regression Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Scatter Diagrams A graph in which pairs of points, (x, y), are

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 28. SIMPLE LINEAR REGRESSION III Fitted Values and Residuals To each observed x i, there corresponds a y-value on the fitted line, y = βˆ + βˆ x. The are called fitted values. ŷ i They are the values of

More information

Probability Distributions

Probability Distributions CONDENSED LESSON 13.1 Probability Distributions In this lesson, you Sketch the graph of the probability distribution for a continuous random variable Find probabilities by finding or approximating areas

More information

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem. Statistics 1 Mathematical Model A mathematical model is a simplification of a real world problem. 1. A real world problem is observed. 2. A mathematical model is thought up. 3. The model is used to make

More information

Correlation and Regression

Correlation and Regression Correlation and Regression 8 9 Copyright Cengage Learning. All rights reserved. Section 9.2 Linear Regression and the Coefficient of Determination Copyright Cengage Learning. All rights reserved. Focus

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Chapter 10 Regression Analysis

Chapter 10 Regression Analysis Chapter 10 Regression Analysis Goal: To become familiar with how to use Excel 2007/2010 for Correlation and Regression. Instructions: You will be using CORREL, FORECAST and Regression. CORREL and FORECAST

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Chapter 11. Correlation and Regression

Chapter 11. Correlation and Regression Chapter 11. Correlation and Regression The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS THE ROYAL STATISTICAL SOCIETY 008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS The Society provides these solutions to assist candidates preparing for the examinations

More information

Regression with Nonlinear Transformations

Regression with Nonlinear Transformations Regression with Nonlinear Transformations Joel S Steele Portland State University Abstract Gaussian Likelihood When data are drawn from a Normal distribution, N (µ, σ 2 ), we can use the Gaussian distribution

More information

Section Least Squares Regression

Section Least Squares Regression Section 2.3 - Least Squares Regression Statistics 104 Autumn 2004 Copyright c 2004 by Mark E. Irwin Regression Correlation gives us a strength of a linear relationship is, but it doesn t tell us what it

More information

Simple Linear Regression Using Ordinary Least Squares

Simple Linear Regression Using Ordinary Least Squares Simple Linear Regression Using Ordinary Least Squares Purpose: To approximate a linear relationship with a line. Reason: We want to be able to predict Y using X. Definition: The Least Squares Regression

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

Review of Statistics

Review of Statistics Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

More information

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall) Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall) We will cover Chs. 5 and 6 first, then 3 and 4. Mon,

More information

15: Regression. Introduction

15: Regression. Introduction 15: Regression Introduction Regression Model Inference About the Slope Introduction As with correlation, regression is used to analyze the relation between two continuous (scale) variables. However, regression

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

Looking at data: relationships

Looking at data: relationships Looking at data: relationships Least-squares regression IPS chapter 2.3 2006 W. H. Freeman and Company Objectives (IPS chapter 2.3) Least-squares regression p p The regression line Making predictions:

More information

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

Chapter 6. Exploring Data: Relationships. Solutions. Exercises: Chapter 6 Exploring Data: Relationships Solutions Exercises: 1. (a) It is more reasonable to explore study time as an explanatory variable and the exam grade as the response variable. (b) It is more reasonable

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5) 10 Simple Linear Regression (Chs 12.1, 12.2, 12.4, 12.5) Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 2 Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 3 Simple Linear Regression

More information

Introduction to Design of Experiments

Introduction to Design of Experiments Introduction to Design of Experiments Jean-Marc Vincent and Arnaud Legrand Laboratory ID-IMAG MESCAL Project Universities of Grenoble {Jean-Marc.Vincent,Arnaud.Legrand}@imag.fr November 20, 2011 J.-M.

More information

Regression Models - Introduction

Regression Models - Introduction Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent

More information

Module 3 Study Guide. GCF Method: Notice that a polynomial like 2x 2 8 xy+9 y 2 can't be factored by this method.

Module 3 Study Guide. GCF Method: Notice that a polynomial like 2x 2 8 xy+9 y 2 can't be factored by this method. Module 3 Study Guide The second module covers the following sections of the textbook: 5.4-5.8 and 6.1-6.5. Most people would consider this the hardest module of the semester. Really, it boils down to your

More information

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation Lecture 4 Scatterplots, Association, and Correlation Previously, we looked at Single variables on their own One or more categorical variable In this lecture: We shall look at two quantitative variables.

More information

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison. Regression Bret Hanlon and Bret Larget Department of Statistics University of Wisconsin Madison December 8 15, 2011 Regression 1 / 55 Example Case Study The proportion of blackness in a male lion s nose

More information

1995, page 8. Using Multiple Regression to Make Comparisons SHARPER & FAIRER. > by SYSTAT... > by hand = 1.86.

1995, page 8. Using Multiple Regression to Make Comparisons SHARPER & FAIRER. > by SYSTAT... > by hand = 1.86. Using Multiple Regression to Make Comparisons SHARPER & FAIRER Illustration : Effect of sexual activity on male longevity Longevity (days) of male fruit-flies randomized to live with either uninterested

More information

Just Enough Likelihood

Just Enough Likelihood Just Enough Likelihood Alan R. Rogers September 2, 2013 1. Introduction Statisticians have developed several methods for comparing hypotheses and for estimating parameters from data. Of these, the method

More information

Approximations - the method of least squares (1)

Approximations - the method of least squares (1) Approximations - the method of least squares () In many applications, we have to consider the following problem: Suppose that for some y, the equation Ax = y has no solutions It could be that this is an

More information

A Scientific Model for Free Fall.

A Scientific Model for Free Fall. A Scientific Model for Free Fall. I. Overview. This lab explores the framework of the scientific method. The phenomenon studied is the free fall of an object released from rest at a height H from the ground.

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

22 Approximations - the method of least squares (1)

22 Approximations - the method of least squares (1) 22 Approximations - the method of least squares () Suppose that for some y, the equation Ax = y has no solutions It may happpen that this is an important problem and we can t just forget about it If we

More information

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation? Did You Mean Association Or Correlation? AP Statistics Chapter 8 Be careful not to use the word correlation when you really mean association. Often times people will incorrectly use the word correlation

More information

COVARIANCE EXPLAINED GRAPHICALLY

COVARIANCE EXPLAINED GRAPHICALLY COVARIANCE EXPLAINED GRAPHICALLY Sometimes we can "augment knowledge" with an unusual or different approach. I would like this reply to be accessible to kindergartners and also have some fun, so everybody

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Relationships Regression

Relationships Regression Relationships Regression BPS chapter 5 2006 W.H. Freeman and Company Objectives (BPS chapter 5) Regression Regression lines The least-squares regression line Using technology Facts about least-squares

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

ECON The Simple Regression Model

ECON The Simple Regression Model ECON 351 - The Simple Regression Model Maggie Jones 1 / 41 The Simple Regression Model Our starting point will be the simple regression model where we look at the relationship between two variables In

More information

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size. Descriptive Statistics: One of the most important things we can do is to describe our data. Some of this can be done graphically (you should be familiar with histograms, boxplots, scatter plots and so

More information

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran Lecture 2 Linear Regression: A Model for the Mean Sharyn O Halloran Closer Look at: Linear Regression Model Least squares procedure Inferential tools Confidence and Prediction Intervals Assumptions Robustness

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

1 Random and systematic errors

1 Random and systematic errors 1 ESTIMATION OF RELIABILITY OF RESULTS Such a thing as an exact measurement has never been made. Every value read from the scale of an instrument has a possible error; the best that can be done is to say

More information

The word "interact" is a less-than-than-helpful term used by statisticians. The correct meaning of interact is, according to

The word interact is a less-than-than-helpful term used by statisticians. The correct meaning of interact is, according to (page 1) Notes on "INTERACTIONS BETWEEN THE INDEPENDENT VARIABLES" (G&S CHAPTER 3 pp 94-103) " sometimes the independent variables interact so that the effect of one is different depending on the value

More information

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 Lecture 3: Linear Models Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector of observed

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Regression Models for Risks(Proportions) and Rates. Proportions. E.g. [Changes in] Sex Ratio: Canadian Births in last 60 years

Regression Models for Risks(Proportions) and Rates. Proportions. E.g. [Changes in] Sex Ratio: Canadian Births in last 60 years Regression Models for Risks(Proportions) and Rates Proportions E.g. [Changes in] Sex Ratio: Canadian Births in last 60 years Parameter of Interest: (male)... just above 0.5 or (male)/ (female)... "ODDS"...

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression BSTT523: Kutner et al., Chapter 1 1 Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression Introduction: Functional relation between

More information

Chapter 11 Linear Regression

Chapter 11 Linear Regression Chapter 11 Linear Regression Linear regression is a methodology that allows us to examine the relationship between two continuously measured variables where we believe that values of one variable may influence

More information

Chapter 8: An Introduction to Probability and Statistics

Chapter 8: An Introduction to Probability and Statistics Course S3, 200 07 Chapter 8: An Introduction to Probability and Statistics This material is covered in the book: Erwin Kreyszig, Advanced Engineering Mathematics (9th edition) Chapter 24 (not including

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Most of this course will be concerned with use of a regression model: a structure in which one or more explanatory

More information

Fractional Polynomial Regression

Fractional Polynomial Regression Chapter 382 Fractional Polynomial Regression Introduction This program fits fractional polynomial models in situations in which there is one dependent (Y) variable and one independent (X) variable. It

More information

MICROPIPETTE CALIBRATIONS

MICROPIPETTE CALIBRATIONS Physics 433/833, 214 MICROPIPETTE CALIBRATIONS I. ABSTRACT The micropipette set is a basic tool in a molecular biology-related lab. It is very important to ensure that the micropipettes are properly calibrated,

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Simple Linear Regression for the MPG Data

Simple Linear Regression for the MPG Data Simple Linear Regression for the MPG Data 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG What do we do with the data? y i = MPG of i th car x i = Weight of i th car i =1,...,n n = Sample Size Exploratory

More information

UNIT 12 ~ More About Regression

UNIT 12 ~ More About Regression ***SECTION 15.1*** The Regression Model When a scatterplot shows a relationship between a variable x and a y, we can use the fitted to the data to predict y for a given value of x. Now we want to do tests

More information

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation Lecture 4 Scatterplots, Association, and Correlation Previously, we looked at Single variables on their own One or more categorical variables In this lecture: We shall look at two quantitative variables.

More information

AP Final Review II Exploring Data (20% 30%)

AP Final Review II Exploring Data (20% 30%) AP Final Review II Exploring Data (20% 30%) Quantitative vs Categorical Variables Quantitative variables are numerical values for which arithmetic operations such as means make sense. It is usually a measure

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors: Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility

More information

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression Objectives: 1. Learn the concepts of independent and dependent variables 2. Learn the concept of a scatterplot

More information

ECON3150/4150 Spring 2015

ECON3150/4150 Spring 2015 ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2

More information

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics Last Lecture Distinguish Populations from Samples Importance of identifying a population and well chosen sample Knowing different Sampling Techniques Distinguish Parameters from Statistics Knowing different

More information

Getting Started with Communications Engineering

Getting Started with Communications Engineering 1 Linear algebra is the algebra of linear equations: the term linear being used in the same sense as in linear functions, such as: which is the equation of a straight line. y ax c (0.1) Of course, if we

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Measurement: The Basics

Measurement: The Basics I. Introduction Measurement: The Basics Physics is first and foremost an experimental science, meaning that its accumulated body of knowledge is due to the meticulous experiments performed by teams of

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression EdPsych 580 C.J. Anderson Fall 2005 Simple Linear Regression p. 1/80 Outline 1. What it is and why it s useful 2. How 3. Statistical Inference 4. Examining assumptions (diagnostics)

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Lecture 2 January 27, 2005 Lecture #2-1/27/2005 Slide 1 of 46 Today s Lecture Simple linear regression. Partitioning the sum of squares. Tests of significance.. Regression diagnostics

More information

STATISTICS 1 REVISION NOTES

STATISTICS 1 REVISION NOTES STATISTICS 1 REVISION NOTES Statistical Model Representing and summarising Sample Data Key words: Quantitative Data This is data in NUMERICAL FORM such as shoe size, height etc. Qualitative Data This is

More information

Stat 101: Lecture 6. Summer 2006

Stat 101: Lecture 6. Summer 2006 Stat 101: Lecture 6 Summer 2006 Outline Review and Questions Example for regression Transformations, Extrapolations, and Residual Review Mathematical model for regression Each point (X i, Y i ) in the

More information

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H. ACE 564 Spring 2006 Lecture 8 Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information by Professor Scott H. Irwin Readings: Griffiths, Hill and Judge. "Collinear Economic Variables,

More information