Statistics I Chapter 3: Bivariate data analysis

Similar documents
Statistics I Chapter 2: Univariate data analysis

Statistics I Chapter 2: Univariate data analysis

Statistics I Chapter 1: Introduction

Statistics I Chapter 1: Introduction

Statistics I Exercises Lesson 3 Academic year 2015/16

Estadística II Chapter 5. Regression analysis (second part)

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Bivariate data analysis

WISE International Masters

Introduction to bivariate analysis

Introduction to bivariate analysis

9. Linear Regression and Correlation

Describing Contingency tables

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Midterm 2 - Solutions

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Bivariate Paired Numerical Data

Bivariate Relationships Between Variables

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

Psych Jan. 5, 2005

Extended Mosaic and Association Plots for Visualizing (Conditional) Independence. Achim Zeileis David Meyer Kurt Hornik

Many phenomena in nature have approximately Normal distributions.

Statistics II Lesson 1. Inference on one population. Year 2009/10

Math 180B Problem Set 3

Random Vectors. 1 Joint distribution of a random vector. 1 Joint distribution of a random vector

Estadística II Chapter 4: Simple linear regression

Covariance and Correlation

STAT/MA 416 Answers Homework 6 November 15, 2007 Solutions by Mark Daniel Ward PROBLEMS

3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

Discrete Choice Modeling

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Chapter 5 continued. Chapter 5 sections

STAC51: Categorical data Analysis

Review of probability and statistics 1 / 31

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

13.1 Categorical Data and the Multinomial Experiment

Stat 101 Exam 1 Important Formulas and Concepts 1

Log-linear Models for Contingency Tables

MULTIVARIATE DISTRIBUTIONS

Correlation analysis. Contents

Visualizing Independence Using Extended Association and Mosaic Plots. Achim Zeileis David Meyer Kurt Hornik

Statistics - Lecture 04

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Chapter 14 Student Lecture Notes 14-1

Outline for Today. Review of In-class Exercise Bivariate Hypothesis Test 2: Difference of Means Bivariate Hypothesis Testing 3: Correla

Discrete Multivariate Statistics

Foundations of Statistical Inference

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 5: Bivariate Correspondence Analysis

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Chapter 7 Student Lecture Notes 7-1

Section Linear Correlation and Regression. Copyright 2013, 2010, 2007, Pearson, Education, Inc.

Topic 5: Discrete Random Variables & Expectations Reference Chapter 4

A Short Course in Basic Statistics

Chapter 3 Examining Data

Econometrics Problem Set 3

Chapter 2: Describing Contingency Tables - I

Business Statistics 41000:

STA 2201/442 Assignment 2

Contents 1. Contents

MA : Introductory Probability

Homework 9 (due November 24, 2009)

Speci cation of Conditional Expectation Functions

8/4/2009. Describing Data with Graphs

Multivariate Methods. Multivariate Methods: Topics of the Day

Principal Component Analysis for Mixed Quantitative and Qualitative Data

Chapter 4: Regression Models

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

Review Example 3: Suppose that we know the revenues of a company each year since This information is given in the table below:

Expectation and Variance

Data Analysis and Statistical Methods Statistics 651

Linear Regression With Special Variables

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

More than one variable

An Introduction to Relative Distribution Methods

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Review. Number of variables. Standard Scores. Anecdotal / Clinical. Bivariate relationships. Ch. 3: Correlation & Linear Regression

Scatterplots and Correlation

18 Bivariate normal distribution I

Multiple Random Variables

Chapter 3: Examining Relationships

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

MATH 38061/MATH48061/MATH68061: MULTIVARIATE STATISTICS Solutions to Problems on Random Vectors and Random Sampling. 1+ x2 +y 2 ) (n+2)/2

STP 226 ELEMENTARY STATISTICS NOTES

4. Nonlinear regression functions

Lecture 28 Chi-Square Analysis

Final Exam - Solutions

Notes on Random Variables, Expectations, Probability Densities, and Martingales

ECON 482 / WH Hong Binary or Dummy Variables 1. Qualitative Information

Sociology 6Z03 Review I

Statistics 3858 : Contingency Tables

Analysis of Categorical Data Three-Way Contingency Table

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

SAMPLE 4CORE. Displaying and describing relationships between two variables. 4.1 Investigating the relationship between two categorical variables

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

Transcription:

Statistics I Chapter 3: Bivariate data analysis Chapter 3: Bivariate data analysis Contents 3.1 Two-way tables Bivariate data Definition of a two-way table Joint absolute/relative frequency distribution Marginal frequency distribution Conditional frequency distribution 3.2 Correlation Graphical display: scatterplot Types of relationships between numerical variables Measures of linear association

Chapter 2: Bivariate data analysis Recommended reading Peña, D., Romo, J., Introducción a la Estadística para las Ciencias Sociales Chapters 7, 8, 9 Newbold, P. Estadística para los Negocios y la Economía (2009) Chapter 12 Bivariate data Bivariate data is obtained when we observe two characteristics (numerical or categorical) of one individual/object. Notation: A sample of n bivariate observations will be written as n pairs (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ).

Summarizing bivariate data: two-way table In order to summarize the data, we create two-dimensional table... with classes/class intervals of one variable in the rows... and classes/class intervals of the other in the columns The corresponding frequencies (absolute or relative) appear in the cells on the cross-section of the given row and the given column When one variable is qualitative, the two-way table is called a contingency table Two-way table: joint absolute frequency distribution Two-way table with k rows and m columns y 1 y j y m Total x 1 n 11 n 1j n 1m n 1..... X x i n i1 n ij n im n i..... x k n k1 n kj n km n k Total n 1 n j n m n Notation: n ij absolute frequency of cell (i, j) n sample size Row i total: Column j total: n i = n i1 + n i2 + + n im n j = n 1j + n 2j + + n kj

Joint absolute frequency distribution Example Sixty four men were selected and the following two variables were considered = Color of man s mother s eyes {Bright, Dark} X = Color of man s eyes {Bright, Dark} (x 1, y 1 ) = (B, B), (x 2, y 2 ) = (B, D),..., (x 64, y 64 ) = (D, D) Bright Dark Total X Bright 23 12 35 Dark 17 12 29 Total 40 24 64 (x 2, y 2 ) = (B, D) means that the 2nd man had Bright eyes and his mother had Dark eyes 23 of men had B eyes and their mothers also had B eyes 29 of men had Dark eyes Two-way table: relative joint frequency distribution f ij = n ij /n relative frequency of cell (i, j) y 1 y j y m Total x 1 f 11 f 1j f 1m f 1..... X x i f i1 f ij f im f i..... x k f k1 f kj f km f k Total f 1 f j f m 1 Notation: f ij relative frequency of cell (i, j) n sample size Row i total: Column j total: f i = f i1 + f i2 + + f im f j = f 1j + f 2j + + f kj

Joint frequency distribution: absolute and relative Example = Weekly number of visits to the cinema X = Weekly number of visits to the theater (absolute frequencies in black, relative in blue) 0 1 2 3 4 0 12 0.279 5 0.116 4 0.093 2 0.047 1 0.023 1 4 0.093 3 0.070 2 0.047 1 0.023 0 0.000 X 2 3 0.070 3 0.070 2 0.047 0 0.000 0 0.000 3 1 0.023 0 0.000 0 0.000 0 0.000 0 0.000 Marginal frequency distribution: absolute and relative The marginal frequencies take us back to the univariate case (we ignore the other variable) We obtain them by looking at the margins of the joint frequency table (absolute or relative) Specifically, the relative marginal frequencies of X are defined as and those of are f i = f i1 + + f ij + + f im f j = f 1j + + f ij + + f kj Absolute marginal frequencies are obtained analogously

Marginal frequency distribution Example = Number of workers X = Number of sales 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1 Marginal (relative) frequencies of Workers 1-24 25-49 50-74 75-99 f j 0.463 0.268 0.195 0.073 Marginal (relative) frequencies of X Sales 1-100 101-200 201-300 f i 0.561 0.244 0.195 Conditional frequency distribution The conditional distribution of, given that X = x i (notation f ( X = x i )) is obtained by calculating for j-th class/class interval of (we narrow our data set to the observations satisfying the condition) The conditional distribution of X, given that = y j is obtained by calculating for i-th class/class interval of X f ij f i Exactly the same conditional distribution is obtained if we replace the relative frequencies by the absolute frequencies in the above definitions f ij f j Conditional frequencies always add up to 1!

Conditional frequency distribution Example = Number of workers X = Number of sales Find the conditional frequency distribution for sales given that the number of workers is between 50 and 74. 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1 The conditional frequency distribution f (X 50 74) Sales 1-100 101-200 201-300 f (x 50 y 74) 0.500(= 0.098 0.195 ) 0.049 0.250(= 0.195 ) 0.049 0.250(= 0.195 ) Conditional frequency distribution Example = Number of workers X = Number of sales Find the conditional frequency distribution for workers when the number of sales is between 101 and 200. 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1 The conditional frequency distribution f ( 101 X 200) Workers 1-24 25-49 50-74 75-99 f (y 101 x 200) 0.402 0.299 0.201 0.098

Graphics for bivariate numerical data: scatterplot size price 107 162657 114 165554 91 154506 100 162103 96 158271 107 166925 104 161917 100 161149 80 152263 81 151878 105 165678 111 166696 108 165387 97 161806 106 163824 Price of a house (euro) 155000 160000 165000 80 85 90 95 100 105 110 115 Size of a house (m^2) Types of relationships There are different ways in which two variables may be related: (i) absence of relationship (ii) negative linear relationship (iii) positive linear relationship and (iv) nonlinear relationship NO RELATIONSHIP NEGATIVE LINEAR 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 2 1 0 1 2 POSITIVE LINEAR NONLINEAR 4 3 2 1 0 1 2 8 6 4 2 0 3 2 1 0 1 2 3 2 1 0 1 2

Measures of linear association: covariance We look for a descriptive measure that would indicate whether there is a linear relationship between x and y Sample covariance is defined as n i=1 x iy i n xȳ ( {}}{ ) s xy = 1 n (x i x)(y i ȳ) n 1 i=1 Covariance and linear transformation: if we transform variable according to Z = a + b, then the covariance between X and the new variable Z is s xz = bs xy Properties of the covariance If the covariance is much larger than 0, it s because there exists a positive linear relationship between the variables If the covariance is much smaller than 0, it s because there exists a negative linear relationship If the covariance is small, it s because: (i) the linear relationship does not exists; or (ii) the relationship is nonlinear. Drawbacks: What does it mean large/small? What are the units of covariance?

Measures of linear association: correlation Correlation overcomes the issues with covariance Sample correlation is defined as Properties: r (x,y) = s xy s x s y The correlation is bounded: 1 r(x,y) 1, so the terms large and small now make sense It is unitless Measures of linear association: correlation Example Three variables are measured over 91 countries: female life expectancy, male life expectancy and gross national product. The covariances between the three pairs of two variables are 105.15, 50066.04 and 57917.93, respectively. On the other hand, the correlations between the three pairs of two variables are 0.98, 0.64 y 0.65, respectively. Therefore, even if the covariances between male and female live expectancy and gross national product are larger than the covariance between male and female live expectancies, the correlation is larger for these last two variables.