4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Similar documents
CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Simple Linear Regression (single variable)

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

What is Statistical Learning?

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Comparing Several Means: ANOVA. Group Means and Grand Mean

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

A Matrix Representation of Panel Data

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Source Coding and Compression

Distributions, spatial statistics and a Bayesian perspective

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Lecture 10, Principal Component Analysis

Resampling Methods. Chapter 5. Chapter 5 1 / 52

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

Functional Form and Nonlinearities

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Thermodynamics Partial Outline of Topics

End of Course Algebra I ~ Practice Test #2

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Math 0310 Final Exam Review Problems

Computational modeling techniques

Pattern Recognition 2014 Support Vector Machines

SAMPLING DYNAMICAL SYSTEMS

Introduction to Quantitative Genetics II: Resemblance Between Relatives

SticiGui Chapter 4: Measures of Location and Spread Philip Stark (2013)

PLEASURE TEST SERIES (XI) - 07 By O.P. Gupta (For stuffs on Math, click at theopgupta.com)

Inference in the Multiple-Regression

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

Math 105: Review for Exam I - Solutions

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

, which yields. where z1. and z2

AP Statistics Notes Unit Two: The Normal Distributions

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Figure 1a. A planar mechanism.

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

1 The limitations of Hartree Fock approximation

Math 9 Year End Review Package. (b) = (a) Side length = 15.5 cm ( area ) (b) Perimeter = 4xside = 62 m

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

ECEN 4872/5827 Lecture Notes

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

Thermodynamics and Equilibrium

TEST 3A AP Statistics Name: Directions: Work on these sheets. A standard normal table is attached.

Hypothesis Tests for One Population Mean

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

BASD HIGH SCHOOL FORMAL LAB REPORT

INSTRUMENTAL VARIABLES

Chapter 8: The Binomial and Geometric Distributions

Cambridge Assessment International Education Cambridge Ordinary Level. Published

Smoothing, penalized least squares and splines

Differentiation Applications 1: Related Rates

IN a recent article, Geary [1972] discussed the merit of taking first differences

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Dead-beat controller design

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Individual Events 5 I3 a 6 I4 a 8 I5 A Group Events

Introduction to Spacetime Geometry

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

How do scientists measure trees? What is DBH?

Introduction to Smith Charts

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

3. Classify the following Numbers (Counting (natural), Whole, Integers, Rational, Irrational)

Chapter 9 Vector Differential Calculus, Grad, Div, Curl

ELT COMMUNICATION THEORY

A Regression Solution to the Problem of Criterion Score Comparability

READING STATECHART DIAGRAMS

Biochemistry Summer Packet

6.3: Volumes by Cylindrical Shells

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

37 Maxwell s Equations

NUMBERS, MATHEMATICS AND EQUATIONS

Statistical Learning. 2.1 What Is Statistical Learning?

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Group Color: Subgroup Number: How Science Works. Grade 5. Module 2. Class Question: Scientist (Your Name): Teacher s Name: SciTrek Volunteer s Name:

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

CHM112 Lab Graphing with Excel Grading Rubric

Experiment #3. Graphing with Excel

Homology groups of disks with holes

The blessing of dimensionality for kernel methods

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Chapter 13: The Correlation Coefficient and the Regression Line. We begin with a some useful facts about straight lines.

0606 ADDITIONAL MATHEMATICS

Chapter 3: Cluster Analysis

A Polarimetric Survey of Radio Frequency Interference in C- and X-Bands in the Continental United States using WindSat Radiometry

Math Foundations 10 Work Plan

Pressure And Entropy Variations Across The Weak Shock Wave Due To Viscosity Effects

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Introduction to Regression

Probability, Random Variables, and Processes. Probability

ANSWER KEY FOR MATH 10 SAMPLE EXAMINATION. Instructions: If asked to label the axes please use real world (contextual) labels

A solution of certain Diophantine problems

Chapter 3 Kinematics in Two Dimensions; Vectors

Our Lady Star of the Sea Religious Education CIRCLE OF GRACE LESSON PLAN - Grade 1

Transcription:

4th Indian Institute f Astrphysics - PennState Astrstatistics Schl July, 2013 Vainu Bappu Observatry, Kavalur Crrelatin and Regressin Rahul Ry Indian Statistical Institute, Delhi.

Crrelatin Cnsider a tw dimensinal randm vectr (X, Y). As seen earlier, the randm vectr is gverned by jint prbability mass functin P(X = x i, Y = y j ) fr x = x i, y = y j p(x, y) = 0 therwise where (X, Y) takes discrete values (x i, y j ) fr i = 1,..., m, j = 1,..., n; r jint prbability density functin f(x, y) P(X a, Y b) = a b f(x, y)dxdy where (X, Y) takes cntinuus values.

We restrict urselves t the randm vectr (X, Y) taking discrete values (x i, y j ) fr i = 1,..., m, j = 1,..., n. Als, as seen earlier the marginal prbability mass functins f X and Y are respectively n j=1 p X (x) = p(x i, y j ) fr x = x i 0 therwise m i=1 p Y (y) = p(x i, y j ) fr y = y j 0 therwise

and the means and variances f the marginals X and Y are µ X = m x i p X (x i ), Var(X) = i=1 m (x i µ X ) 2 p X (x i ) i=1 µ Y = n y i p Y (y j ), Var(Y) = j=1 n (y j µ Y ) 2 p Y (y j ). j=1

Cvariance and Crrelatin Cefficient The Cvariance f X and Y is given by Cv(X, Y) = E[(X µ X )(Y µ Y )] m n = (x i µ X )(y j µ Y )p(x i, y j ) = i=1 m j=1 i=1 j=1 n x i y j p(x i, y j ) µ X µ Y Nte, in the abve the Expectatin is taken with respect t the jint distributin f (X, Y).

The crrelatin cefficient between X and Y is given by ρ XY = Cv(X, Y) VarXVar(Y) = m n i=1 j=1 x iy j p(x i, y j ) µ X µ Y ( m i=1 x2 i p X(x i ) µ 2 X )( n j=1 y2 j p Y(y j ) µ 2 Y )

Linear regressin Suppse we have 2-dimensinal data regarding the fllwing (i) age and height f children, r (ii) amunt f exercise and the rate f heart beat f peple (iii) incme and age at death f peple (iv) level f educatin and caste bias

Nte here that the data cmes in pairs nw, e.g. (height, weight) f the same persn. Thus unlike earlier we d nt have (x i, y j ) as a data pint, but just (x i, y i ). In all these examples, we expect a relatin between the tw variables. If a child is much shrter than anther child, it is mre likely that she is yunger than the ther ne, etc. Althugh in the last example, caste bias vis-a-vis level f educatin, the relatinship is nt that simple. We want t use statistical techniques t determine whether there is a mathematical equatin cnnecting the variables.

Such mathematical equatins are imprtant because they wuld allw us t predict and plan. Fr example if we knew that a grup f 9 and 10 year lds are cming fr a summer camp at this bservatry, then we culd have a fair idea f the lengths f beds and mattresses needed fr their sleep. We first cllect the data, tabulate it and express it as a scatter diagram.

25 20 15 Series1 10 5 0 0 2 4 6 8 10 12

450 400 350 300 250 200 Series1 150 100 50 0 0 2 4 6 8 10 12

Linear regressin The first chart shw suggests a linear relatinship between the X and Y variables.

Y X

Y X

Y (x i, ŷ i ) = predicted E i = errr (x i, y i ) = bservatin X

Y (x i, ŷ i ) = predicted E i = ŷ i y i (x i, y i ) = bservatin X

Nte that here we tk the errr as E i = ŷ i y i the abslute value being imprtant because if the bserved pint was abve the predicted pint, then we d nt want t have negative errr. S clearly a criterin t btain the line giving the predicted linear relatin between the x and the y variables is that the sum f the errrs must be minimized, i.e. n the line shuld minimize ŷ i y i. i=1

Unfrtunately minimizing n i=1 ŷ i y i, is difficult and can nly be dne thrugh numerical methds. A simpler methd is t use calculus. First nte that we tk the abslute errr, s as nt t have negative errrs. Anther way t avid negative errrs is t take the square, i.e. E i = (ŷ i y i ) 2. In this case, ur criterin shuld be n the line shuld minimize (ŷ i y i ) 2. i=1

Y (x i, ŷ i ) = predicted E i = (ŷ i y i ) 2 (x i, y i ) = bservatin X

T d this is easy because the equatin f a straight line is y = bx + c where b is the slpe and c the intercept. S that ŷ i = bx i + c and we have t minimize E(b, c) := n (bx i y i + c) 2 i=1

S we need t set E(b, c) b = 0 and E(b, c) c = 0 and equate fr b and c.

Fr the data {(x i, y i ) : 1 = 1,..., n}, calculus yields (Hmewrk) ˆb = = n n i=1 x iy i n i=1 x n i i=1 y i n n i=1 x2 i ( n i=1 x ) 2 i n i=1 x iy i n xȳ n i=1 x2 i n x 2 ĉ = ȳ ˆb x

x y 1.2 101 0.8 92 1.0 110 1.3 120 0.7 90 0.8 82 1.0 93 0.6 75 0.9 91 1.1 105

x y x 2 xy y 2 1.2 101 1.44 121.2 10201 0.8 92 0.64 73.6 8464 1.0 110 1.00 110.0 12100 1.3 120 1.69 156.0 14400 0.7 90 0.49 63.0 8100 0.8 82 0.64 65.6 6724 1.0 93 1.00 93 8649 0.6 75 0.36 45.0 5625 0.9 91 0.81 81.9 8281 1.1 105 1.21 115.5 11025 Sum 959 9.4 9.28 924.8 93569 ˆb = 52.568, ĉ = 46.486 s ŷ = 46.486+52.568x

Linear Crrelatin analysis Suppse the data cmes in pairs as in the case f linear regressin i.e., {(x i, y i ) : 1 = 1,..., n} and we assign equal prbability fr any data pint P(X = x i, Y = y i ) = 1/n.. Nte: this takes care f the inherent distributin f the data pints, because if the values f the randm variable X is mre cncentrated arund a given value a say, then there will be mre data pints x i arund the value a.

Fr example if we are measuring the height and weight f individuals, then we will btain mre data pints with the height value arund 165 cm rather than arund 180 cm because there are mre peple whse height are cncentrated arund 165 cm, than there are basketball players.

In this case the crrelatin cefficient becmes ρ XY = = m n i=1 j=1 x iy j p(x i, y j ) µ X µ Y ( m i=1 x2 i p X(x i ) µ 2 X )( n j=1 y2 j p Y(y j ) µ 2 Y ) n i=1 x iy i n xȳ ( n i=1 x2 i n x 2) ( n i=1 y2 i nȳ 2) T emphasize that this is a sample crrelatin cefficient we write n i=1 r X,Y = x iy i n xȳ ( n i=1 x2 i n x 2) ( n i=1 y2 i nȳ 2)

Recall frm regressin analysis ˆb = n i=1 x iy i n xȳ n i=1 x2 i n x 2, while the crrelatin cefficient is r X,Y = n i=1 x iy i n xȳ ( n i=1 x2 i n x 2) ( n i=1 y2 i nȳ 2).

The denminatr in ˆb being (1/n) variance is psitive, while we take the psitive square rt in the denminatr f r X,Y. The numeratr term is the same fr bth. Thus the sign f the slpe ˆb is the same as that f the crrelatin cefficient r X,Y. Fr the data given earlier we can nw cmpute the crrelatin cefficient (Hmewrk).

Crrelatin and regressin analysis are related, bth deal with relatinships amng variables. The crrelatin cefficient is a measure f linear assciatin between tw variables. Values f the crrelatin cefficient are always between 1 and +1.

ρ XY = 1 implies the variables are perfectly crrelated in the psitive linear sense. ρ XY = 1 implies the variables are perfectly crrelated in the negative linear sense. ρ XY = 0 implies the variables are uncrrelated. This des nt mean that the variables are independent.

Neither regressin nr crrelatin analyses establish cause-and-effect relatinships. They indicate nly hw r t what extent variables are assciated with each ther. The crrelatin cefficient measures nly the degree f linear assciatin between tw variables. A cause-and-effect relatinship must be based n the judgment f the analyst.