Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Similar documents
Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

REVIEW 8/2/2017 陈芳华东师大英语系

Dealing with the assumption of independence between samples - introducing the paired design.

Two sided, two sample t-tests. a) IQ = 100 b) Average height for men = c) Average number of white blood cells per cubic millimeter is 7,000.

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Linear Correlation and Regression Analysis

AMS 7 Correlation and Regression Lecture 8

Categorical Data Analysis. The data are often just counts of how many things each category has.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Can you tell the relationship between students SAT scores and their college grades?

MITOCW ocw f99-lec01_300k

Homework 6. Wife Husband XY Sum Mean SS

Statistics Introductory Correlation

UNIT 4 RANK CORRELATION (Rho AND KENDALL RANK CORRELATION

MITOCW ocw f99-lec30_300k

1 A Review of Correlation and Regression

determine whether or not this relationship is.

y n 1 ( x i x )( y y i n 1 i y 2

Chapter 13 Correlation

MITOCW ocw f99-lec23_300k

Chapter 4 Describing the Relation between Two Variables

9 Correlation and Regression

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Correlation 1. December 4, HMS, 2017, v1.1

- a value calculated or derived from the data.

Regression I - the least squares line

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Note: Please use the actual date you accessed this material in your citation.

MITOCW ocw f99-lec09_300k

PhysicsAndMathsTutor.com

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Ch. 16: Correlation and Regression

Statistics: revision

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Chapter 1 Review of Equations and Inequalities

Confidence Intervals. - simply, an interval for which we have a certain confidence.

Correlation. What Is Correlation? Why Correlations Are Used

MITOCW ocw f99-lec05_300k

Sequential Logic (3.1 and is a long difficult section you really should read!)

Systems of Linear Equations

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

MITOCW watch?v=poho4pztw78

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

DISTRIBUTIONS USED IN STATISTICAL WORK

Inter-Rater Agreement

Confidence intervals

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Inference for Single Proportions and Means T.Scofield

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

Section 4.6 Negative Exponents

Chapter 8: Correlation & Regression

Descriptive Statistics (And a little bit on rounding and significant digits)

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

= v = 2πr. = mv2 r. = v2 r. F g. a c. F c. Text: Chapter 12 Chapter 13. Chapter 13. Think and Explain: Think and Solve:

MITOCW ocw f07-lec39_300k

Analysing data: regression and correlation S6 and S7

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

INTRODUCTION TO ANALYSIS OF VARIANCE

Contrasts and Multiple Comparisons Supplement for Pages

Correlation Analysis

Business Statistics. Lecture 9: Simple Regression

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur

Chs. 15 & 16: Correlation & Regression

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Instructor (Brad Osgood)

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

L21: Chapter 12: Linear regression

Chapter 12 - Part I: Correlation Analysis

Survey on Population Mean

Properties of Arithmetic

Correlation. January 11, 2018

Mathematical Notation Math Introduction to Applied Statistics

Algebra & Trig Review

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

Chs. 16 & 17: Correlation & Regression

Chapter 3. Introduction to Linear Correlation and Regression Part 3

Correlation and Regression

Sleep data, two drugs Ch13.xls

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 7

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

2: SIMPLE HARMONIC MOTION

6. CORRELATION SCATTER PLOTS. PEARSON S CORRELATION COEFFICIENT: Definition

Multiple Regression Theory 2006 Samuel L. Baker

Chapter 1A -- Real Numbers. iff. Math Symbols: Sets of Numbers

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

Review of Statistics 101

Implicit Differentiation Applying Implicit Differentiation Applying Implicit Differentiation Page [1 of 5]

Unbalanced Designs & Quasi F-Ratios

Correlation and regression

Regression Analysis: Exploring relationships between variables. Stat 251

Lesson 21 Not So Dramatic Quadratics

Transcription:

Comment: notes are adapted from BIOL 214/312. I. Correlation. Correlation A) Correlation is used when we want to examine the relationship of two continuous variables. We are not interested in prediction. We don't consider one variable independent and the other dependent. All that we're interested in is: B) Graphical description: Do they vary together? Does x go up as y goes up? or Does x go down as y goes up? 1) illustrate fig. 19.1, p. 381 on board (include perfect correlation and 0 correlation). 2) Problem - just describing the graphs is subjective, so we describe the correlation using what is called a correlation coefficient. 3) the correlation coefficient is designated by the Greek letter rho, and is estimated by r. In other words: B) Calculation of r 1) Here's the formula: r estimates ρ r = 2) How does it work? n i=1 For the numerator: n x i x y i y i=1 n x i x 2 y i y 2 i=1 a) first notice that the coordinates for x and y are somewhere inside our points on

the graph (illustrate) b) if x i < x, and y i < y, that implies that the point is below and to the left of ( x, y ), and for that value of i, the numerator is positive (you re multiplying two negative numbers) c) similarly, if x i > x, and y i > y, the point is above and to the right, and the numerator is positive. - so if all the points pair up below and to the left and above and to the right, r will be positive (you re adding up a bunch of positive numbers. d) now what happens if x i < x and y i > y? The point is now above and to the left of ( x, y ), and the numerator is negative (you re multiplying a (+) and (-) number). e) similarly, if x i > x and y i < y, the point is below and to the right, and again the numerator is negative - if all the points are above to the left and below to the right of ( x, y ), r will be negative. Also, - if you have a mix of points, then it depends on where most of the points are, and how far away from ( x, y ) they are. For the denominator: a) this is basically a scaling factor. It makes sure that r stays between -1 and 1. b) This is a little similar to dividing by the standard deviation to get your normal scores (when calculating z). 3) Okay, now you have r. What do you do next? - Important - people often use r, even if r is not significant. r is often used descriptively, sort of like saying the average of something is y, without saying if y means anything other than an average. - If you wish to test if the relationship between x and y is significant, then you carry out a statistical test. C) Testing for significance in r: a) Set up your hypotheses: H 0 : ρ = 0 (this implies that there is no relationship between x and y) H 1 : ρ 0 (either positive or negative)

or, of course, H 1 : ρ < 0, H 1 : ρ > 0 b) decide on α c) calculate r d) calculate t* (yes, we re back to using t) as follows: t * = r n 2 1 r 2 Your text does this just a little bit differently, but the formula is actually the same. e) get your t from the t-tables with α and: d.f. = ν = n-2 (n-2, NOT n-1)!! f) compare as usual (compare t* to the table t) and make your conclusion. g) a one sided test is carried out the same way as with a t-test: - compare the absolute value of t* with the one sided value from the t-table, and if t* t table, reject. - just make sure you verify that your data agree with H 1 before you proceed. D) An example from a different text. 1) A plant physiologist compared the total leaf area with the total dry weight of the plant for 13 plants and got the following results: Leaf area (X) Dry weight (Y) 411 2.00 550 2.46 SS x = 28,465.7 471 2.11 393 1.89 SS y = 0.363708 427 2.05 431 2.30 SS cp = 82.8977 492 2.46 371 2.06 (book also gives SS r which we 470 2.25 haven't discussed yet (we'll 419 2.07 need it for regression)) 407 2.17 489 2.32 439 2.12 mean 443.8 2.174

a) before you go on, what do you think should happen (what kind of alternative hypothesis makes sense)? b) So let s set up our hypotheses: H 0 : ρ = 0 H 1 : ρ > 0 c) decide on α, so let s pick 0.05 d) calculate r: the book does most of the work for us: numerator = 82.8977 denominator = square root (28,465.7 x.363708) So we have: 82.8977 r = --------- =.8147 101.75 (check: since r > 0, which agrees with our H 1, we proceed) e) figure out t*: E) Concluding remarks: t * = 0.8147 13 2 1 0.8147 2 = 4.66 f) We look up the tabulated t for 11 d.f. and α = 0.05 and get 1.796 g) We reject H 0 and conclude that leaf area and dry weight increase together. 1) there may or not be a direct relationship between r and significance: a) r might be.23 and be highly significant b) r might be.95 and not be significant 2) r is very sensitive to extreme points (illustrate) (so is regression) 3) while r itself doesn t have any assumptions (i.e., you can always calculate r (e.g., you can always calculate y )), the t-test based on r does. (Normal, random, data). (Caution: obviously, if you use r tables, you still have the same assumptions).

4) Because r can be sensitive to extreme points, and because sometimes you can t meet the assumptions, there are alternatives: - Spearman s rank correlation is easy to learn, and doesn t need any assumptions: - rank the data in each column (not like in the KW test where you rank all columns at once). - then use the above formula on the ranks. - for a hypothesis test you need tables listing Spearmans s rank critical values. - See section 19.9, p. 398 of your text 5) r, even if significant, DOES NOT IMPLY one variable is causing the effect in the other. There is not necessarily any causation. a) In Europe, they have found a strong positive correlation (significant) between the number of storks and the number of babies. This is absolutely true! b) But, obviously (I hope!), storks don t bring babies, so what is going on?? c) Be patient - to be discussed in class. F) Doing correlations in R: This is not difficult. Arrange your data in two columns. For the above plant example, we should arrange our data exactly as above (keep in mind that R doesn't like spaces in variable names). Then we simply do: cor(area,weight) And R gives us: [1] 0.8147143 To get the actual correlation test, do: cor.test(area,weight)

And the result is: Pearson's product-moment correlation data: plant$leafarea and plant$dryweight t = 4.6599, df = 11, p-value = 0.0006938 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4785463 0.9425797 sample estimates: cor 0.8147143 And just for completeness sake, here's Spearman's rank correlation: cor.test(area,weight,method = spearman ) Spearman's rank correlation rho data: plant$leafarea and plant$dryweight S = 74.6022, p-value = 0.00116 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.7950489 Warning message: In cor.test.default(plant$leafarea, plant$dryweight, method = "spearman") : Cannot compute exact p-values with ties