MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 5: Bivariate Correspondence Analysis

Similar documents
MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

A Peak to the World of Multivariate Statistical Analysis

Quantum Numbers. principal quantum number: n. angular momentum quantum number: l (azimuthal) magnetic quantum number: m l

Statistics for Managers Using Microsoft Excel

Describing Contingency tables

Principal Component Analysis for Mixed Quantitative and Qualitative Data

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Introduction to Statistical Inference Self-study

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Research Methodology: Tools

POLI 443 Applied Political Research

Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples

Topic 21 Goodness of Fit

Chapter 2: Describing Contingency Tables - II

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

STAC51: Categorical data Analysis

Module 10: Analysis of Categorical Data Statistics (OA3102)

Statistics - Lecture 04

DISCUSSION PAPER 2016/43

Foundations of Statistical Inference

Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. Jonathan Marchini (University of Oxford) BS2a MT / 15

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

Econ 325: Introduction to Empirical Economics

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

STAT 501 Assignment 1 Name Spring Written Assignment: Due Monday, January 22, in class. Please write your answers on this assignment

Topic 2: Probability & Distributions. Road Map Probability & Distributions. ECO220Y5Y: Quantitative Methods in Economics. Dr.

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

CHAPTER 3 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. 3.1 Concept of a Random Variable. 3.2 Discrete Probability Distributions

11-2 Multinomial Experiment

Probability and Statistics Notes

Probability Basics. Part 3: Types of Probability. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

Department of Mathematics Faculty of Natural Science, Jamia Millia Islamia, New Delhi-25

Two Way Table 2 Categorical Variable Analysis Problem

Statistics 3858 : Contingency Tables

2.3 Analysis of Categorical Data

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Probability Theory and Simulation Methods

Correlations with Categorical Data

Lecture 13. Poisson Distribution. Text: A Course in Probability by Weiss 5.5. STAT 225 Introduction to Probability Models February 16, 2014

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Lecture 8: Summary Measures

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

Hypothesis Testing. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Goodness of Fit Test and Test of Independence by Entropy

ScienceDirect. ScienceDirec

Lecture 28 Chi-Square Analysis

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Statistics I Chapter 3: Bivariate data analysis

Generalized Linear Models

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Categorical Predictor Variables

Multiple Correspondence Analysis

Introduction to Statistical Inference Lecture 10: ANOVA, Kruskal-Wallis Test

Bivariate Relationships Between Variables

Chapter 9. Conic Sections and Analytic Geometry. 9.3 The Parabola. Copyright 2014, 2010, 2007 Pearson Education, Inc.

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Bivariate distributions

Techniques and Applications of Multivariate Analysis

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

Lecture 3. Conditional distributions with applications

Correspondence Analysis

Conjugate Analysis for the Linear Model

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Chance Constrained Data Envelopment Analysis The Productive Efficiency of Units with Stochastic Outputs

Textbook Examples of. SPSS Procedure

Mathematical statistics

13.1 Categorical Data and the Multinomial Experiment

ASSESSING VARIATION: A UNIFYING APPROACH FOR ALL SCALES OF MEASUREMENT JSM Tamar Gadrich Emil Bashkansky

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Stats Probability Theory

Chapter 4 Multiple Random Variables

Midterm #1. Lecture 10: Joint Distributions and the Law of Large Numbers. Joint Distributions - Example, cont. Joint Distributions - Example

CSC411 Fall 2018 Homework 5

Table Probabilities and Independence

Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

CDA Chapter 3 part II

MULTIVARIATE PROBABILITY DISTRIBUTIONS

MLE for a logit model

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

16.400/453J Human Factors Engineering. Design of Experiments II

Mutual Information and Optimal Data Coding

Content by Week Week of October 14 27

Statistical Methods. by Robert W. Lindeman WPI, Dept. of Computer Science

Chapter 3. Chapter 3 sections

Econometrics. 5) Dummy variables

STA2603/205/1/2014 /2014. ry II. Tutorial letter 205/1/

On a simple construction of bivariate probability functions with fixed marginals 1

Weighted space-filling designs

Chapter 5 continued. Chapter 5 sections

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Machine Learning - MT Classification: Generative Models

A Multivariate Perspective

STP 226 ELEMENTARY STATISTICS NOTES

Transcription:

MS-E2112 Multivariate Statistical (5cr) Lecture 5: Bivariate

Contents

analysis is a PCA-type method appropriate for analyzing categorical variables. The aim in bivariate correspondence analysis is to describe dependencies (correspondences) in a two-way contingency table.

Example, Education and Salary In this lecture, we consider an example where we examine dependencies of categorical variables education and salary.

Contingency Tables We consider a sample of size n described by two qualitative variables, x with categories A 1,...,A J and y with categories B 1,...,B K. The number of individuals having the modality (category) A j for the variable x and the modality B k for the variable y is denoted by n jk. Now the number of individuals having the modality A j for the variable x is given by n j. = K n jk, k=1 the number of individuals having the modality B k for the variable y is given by and J n.k = n jk, j=1 J K n = n jk. j=1 k=1

Contingency Tables The data is often displayed as a two-way contingency table. B 1 B 2 B K A 1 n 11 n 12 n 1K n 1. A 2 n 21 n 22 n 2K n 2....... A J n J1 n J2 n JK n J. n.1 n.2 n.k n Table: Contingency table

Example, Education and Salary We consider size 1000 sample of two categorical variables. Variable x Education is divided to categories A 1 Primary School, A 2 High School, and A 3 University, and variable y Salary is divided to categories B 1 low, B 2 average, and B 3 high.

Example, Education and Salary We display the Education and Salary data as a two-way contingency table. B 1 B 2 B 3 A 1 150 40 10 200 A 2 190 350 60 600 A 3 10 110 80 200 350 500 150 1000 Table: Contingency table In this sample of 1000 observations, there are 150 individuals that have Primary School education and low salary. In this sample of 1000 observations, there are 10 individuals that have Primary School education and high salary. In this sample of 1000 observations, there are 110 individuals that have University education and average salary....

Contingency Tables The value of the numbers n jk is naturally relative to the total number of observations, n. Thus it is preferable to analyze the contingency table in the form of joint relative frequencies. From the contingency table, it is straightforward to compute the associated relative frequency table (F) where the elements of the contingency table are divided by the number of individuals n leading to f jk = n jk n. The marginal relative frequencies are computed as K f j. = and f.k = k=1 f jk J f jk. j=1

Contingency Tables B 1 B 2 B K A 1 f 11 f 12 f 1K f 1. A 2 f 21 f 22 f 2K f 2....... A J f J1 f J2 f JK f J. f.1 f.2 f.k 1 Table: Frequency table

Example, Education and Salary B 1 B 2 B 3 A 1 0.15 0.04 0.01 0.20 A 2 0.19 0.35 0.06 0.60 A 3 0.01 0.11 0.08 0.20 0.35 0.50 0.15 1 Table: Frequency table In this sample 15% of individuals have Primary School education and low salary. In this sample, 1% of individuals have Primary School education and high salary. In this sample, 11% of individuals have University education and average salary....

Frequencies The frequency f jk is the estimate of p jk = P(x A j, y B k ), and f j. and f.k are the estimates of p j. = P(x A j ), and respectively. p.k = P(y B k ),

Tables of Conditional Frequencies The proportion of individuals that belong to category B k for the variable y among the individuals that have the modality A j for the variable x form the so called table of row profiles. The conditional frequencies for fixed j and all k are f k j = n jk n j. = n jk/n n j. /n = f jk f j.. The frequency f k j is the estimate of p k j = P(y B k x A j ).

B 1 B 2 B K f f 1. 1K A 1 f 11 f 1. f 12 A 2 f 21 f 2. f 22 f 2.. A J. f J1 f J... f J2 f J. Table: Row profiles f 1. 1 f 2K f 2. 1.. f JK f J. 1

Example, Education and Salary B 1 B 2 B 3 A 1 0.75 0.20 0.05 1 A 2 0.32 0.58 0.10 1 A 3 0.05 0.55 0.40 1 Table: Row profiles In this sample 75% of the individuals that have Primary School education, have low salary. In this sample, 5% of the individuals that have Primary School education, have high salary. In this sample, 55% of the individuals that have University education, have average salary....

Tables of Conditional Frequencies The proportion of individuals that belong to category A j for the variable x among the individuals that have the modality B k for the variable y form the table of column profiles. The conditional frequencies for fixed k and all j are f j k = n jk n.k = n jk/n n.k /n = f jk f.k. The frequency f j k is the estimate of p j k = P(x A j y B k ).

B 1 B 2 B K f.2 f 1K f.k f.2.... f J2 f.2 1 1 1 A 1 f 11 f.1 f 12 A 2 f 21 f.1 f 22. A J f J1 f.1 f 2K f.k f JK f.k Table: Column profiles

Example, Education and Salary B 1 B 2 B 3 A 1 0.43 0.08 0.07 A 2 0.54 0.70 0.40 A 3 0.03 0.22 0.53 1 1 1 Table: Column profiles In this sample 43% of the individuals that have low salary, have Primary School education. In this sample, 7% of the individuals that have high salary, have Primary School education. In this sample, 22% of the individuals that have average salary, have University education....

The variables x and y are independent if and only if for all j, k it holds that and P(x A j, y B k ) = P(x A j )P(y B k ), P(x A j y B k ) = P(x A j ), P(y B k x A j ) = P(y B k ). These equalities can be estimated by f jk f j. f.k, and respectively. f j k = f jk f.k f j., f k j = f jk f j. f.k,

We can now define the theoretical relative frequencies and theoretical frequencies under the assumption of independence as follows: f jk = f j.f.k and n jk = n j.n.k n = f jk n.

Example, Education and Salary B 1 B 2 B 3 A 1 150 40 10 200 A 2 190 350 60 600 A 3 10 110 80 200 350 500 150 1000 Table: Observed frequencies B 1 B 2 B 3 A 1 70 100 30 200 A 2 210 300 90 600 A 3 70 100 30 200 350 500 150 1000 Table: Theoretical frequencies under independence

Example, Education and Salary B 1 B 2 B 3 A 1 0.15 0.04 0.01 0.20 A 2 0.19 0.35 0.06 0.60 A 3 0.01 0.11 0.08 0.20 0.35 0.50 0.15 1 Table: Observed relative frequencies B 1 B 2 B 3 A 1 0.07 0.10 0.03 0.20 A 2 0.21 0.30 0.09 0.60 A 3 0.07 0.10 0.03 0.20 0.35 0.50 0.15 1 Table: Theoretical relative frequencies under independence

The elements of the attraction repulsion matrix D are given by Note that d jk = n jk n jk = f jk f jk = f jk f j. f.k. d jk > 1 f jk > f j. f.k and f j k > f j. and f k j > f k. d jk < 1 f jk < f j. f.k f j k < f j. and f k j < f k. If d jk > 1, then the modalities (categories) A j and B k are said to be attracted to each other. If d jk < 1, then the modalities A j and B k are said to repulse each other.

Salary Example B 1 B 2 B 3 A 1 2.14 0.40 0.33 A 2 0.90 1.16 0.67 A 3 0.14 1.10 2.67 Table: Attraction repulsion indices High salary is more frequent for people with University education. High salary is less frequent for people with a Primary School education. Low salary is less frequent for people with University education....

Next Week Next week we will continue discussion about correspondence analysis.

I K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate, Academic Press, London, 2003 (reprint of 1979).

II R. V. Hogg, J. W. McKean, A. T. Craig, Introduction to Mathematical Statistics, Pearson Education, Upper Sadle River, 2005. R. A. Horn, C. R. Johnson,, Cambridge University Press, New York, 1985. R. A. Horn, C. R. Johnson, Topics in, Cambridge University Press, New York, 1991.

III L. Simar, An Introduction to Multivariate Data, Université Catholique de Louvain Press, 2008.