Relate Attributes and Counts

Similar documents
Correspondence Analysis

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Frequency Distribution Cross-Tabulation

Topic 21 Goodness of Fit

Statistics for Managers Using Microsoft Excel

Multivariate T-Squared Control Chart

Discrete Multivariate Statistics

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Lecture 41 Sections Mon, Apr 7, 2008

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Textbook Examples of. SPSS Procedure

Lecture 8: Summary Measures

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

LOOKING FOR RELATIONSHIPS

Statistics 3858 : Contingency Tables

Describing Contingency tables

Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance ECON 509. Dr.

Retrieve and Open the Data

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Research Methodology: Tools

Stat 101 Exam 1 Important Formulas and Concepts 1

CDA Chapter 3 part II

Chapter 8 Student Lecture Notes 8-1. Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance

Principal Components. Summary. Sample StatFolio: pca.sgp

Basic Business Statistics, 10/e

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Canonical Correlations

11-2 Multinomial Experiment

Biostatistics Presentation of data DR. AMEER KADHIM HUSSEIN M.B.CH.B.FICMS (COM.)

Didacticiel - Études de cas. In this tutorial, we show how to use TANAGRA ( and higher) for measuring the association between ordinal variables.

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

13.1 Categorical Data and the Multinomial Experiment

How To: Analyze a Split-Plot Design Using STATGRAPHICS Centurion

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

8 Nominal and Ordinal Logistic Regression

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

FREQUENCY DISTRIBUTIONS AND PERCENTILES

Example. χ 2 = Continued on the next page. All cells

Section 2.1 ~ Data Types and Levels of Measurement. Introduction to Probability and Statistics Spring 2017

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

A SHORT INTRODUCTION TO PROBABILITY

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

NON-PARAMETRIC STATISTICS * (

STAT 705: Analysis of Contingency Tables

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Chapter 26: Comparing Counts (Chi Square)

Distribution Fitting (Censored Data)

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Vehicle Freq Rel. Freq Frequency distribution. Statistics

How To: Deal with Heteroscedasticity Using STATGRAPHICS Centurion

Readings Howitt & Cramer (2014) Overview

An area chart emphasizes the trend of each value over time. An area chart also shows the relationship of parts to a whole.

Psych Jan. 5, 2005

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Inference for Categorical Data. Chi-Square Tests for Goodness of Fit and Independence

Factor Analysis. Summary. Sample StatFolio: factor analysis.sgp

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Readings Howitt & Cramer (2014)

Lecture 28 Chi-Square Analysis

Data Mining 4. Cluster Analysis

7 Ordered Categorical Response Models

Hypothesis Testing: Chi-Square Test 1

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Hypothesis Testing hypothesis testing approach

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Categorical Data Analysis Chapter 3

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Is there a connection between gender, maths grade, hair colour and eye colour? Contents

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Principal Component Analysis for Mixed Quantitative and Qualitative Data

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should

Multiple Choice. Chapter 2 Test Bank

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Psych 230. Psychological Measurement and Statistics

Exam details. Final Review Session. Things to Review

Introduction to Statistics

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Module 10: Analysis of Categorical Data Statistics (OA3102)

Statistics lecture 3. Bell-Shaped Curves and Other Shapes

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Chapter 19. Agreement and the kappa statistic

Chi square test of independence

8/4/2009. Describing Data with Graphs

Lecture 25: Models for Matched Pairs

For instance, we want to know whether freshmen with parents of BA degree are predicted to get higher GPA than those with parents without BA degree.

4 Multicategory Logistic Regression

Tests for Two Correlated Proportions in a Matched Case- Control Design

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Transcription:

Relate Attributes and Counts This procedure is designed to summarize data that classifies observations according to two categorical factors. The data may consist of either: 1. Two Attribute variables. In such cases, the procedure will identify all unique values in both variables and calculate c i, j = frequency of occurrence of category i for first variable and category j for second variable, i = 1, 2,, r and j = 1, 2,, c It will then construct an r by c contingency table containing the frequencies. The leftmost variable selected defines the rows, while the other variable defines the columns. 2. c Count columns, each containing r counts. These counts are used to create the entries in the table directly. In this case, the column names will be used as labels for each column of the two-way table. A Labels column may also be selected to provide labels for each row of the table. The charts created are: Clustered Barchart plots the frequencies as a clustered barchart, with one bar for each cell in the table. Stacked Barchart plots the frequencies as a stacked barchart, with one bar for each column in the table. Percentage Barchart plots the frequencies as a stacked barchart, subdividing each bar according to the percentages represented by each column. Mosaic Plot plots the frequencies in each cell in a manner that makes the area of each bar proportional to the cell frequency. Access Highlight: two or more Count columns or two Attributes column. If Count columns are selected, a Labels column may also be selected to supply labels for the row categories. Select: Relate from the main menu. Output Page 1: A clustered barchart. Output Page 2: A stacked barchart. Output Page 3: A percentage barchart. Output Page 4: A mosaic plot. 2006 by StatPoint, Inc. Describe Attributes and Counts - 1

Sample Data #1: Untabulated Data STATGRAPHICS Mobile Rev. 4/27/2006 The file opinion.sgm contains information about 50 individuals who were each asked to sample a product and rate it as Excellent, Good, Fair or Poor. The gender of each individual was also recorded. A portion of the data is shown below: Row Gender Rating 1 Male Excellent 2 Female Good 3 Male Fair 4 Male Good 5 Female Poor 6 Female Good 7 Female Excellent 8 Female Fair IMPORTANT NOTE: If an Attribute variable is Character, its categories will be ordered according to their first occurrence in the data column. If order is important for a variable, be certain that the values appear in the desired order. Note that in the above table, the values for Rating appear in the order Excellent, Good, Fair and then Poor, which is the desired order. If an Attribute variable is numeric, the numeric values will be used to determine the order. Clustered Barchart If two Attribute columns are selected, the procedure begins by finding all unique values in both columns. It then calculates the frequency of occurrence of each pair of values. If there are r unique values in the first column and c unique values in the second column, the frequencies c i, j can be thought of as forming an r by c contingency table. A clustered barchart plots the frequencies, grouping together all values for each value in the first variable. 2006 by StatPoint, Inc. Describe Attributes and Counts - 2

At the bottom of the display are values of the uncertainty coefficient. This coefficient measures the proportional reduction in entropy (variance) of one classification variable if the value of the other is known. It ranges from a low of 0 (complete independence between rows and columns) to 1 (no conditional variation). Three values are calculated: 1. Rows dep.: calculated on the assumption that the first classification variable (Gender) is dependent on the second (Rating). 2. Cols dep.: calculated on the assumption that the second classification variable (Rating) is dependent on the first (Gender). 3. Symmetric: a combination of the two measures above. In the current example, the Cols dep. coefficient would be most appropriate, since Gender might have an effect on Rating. The reduction in entropy, however, is only about 4%. The uncertainty coefficient is typically used to provide a measure of association when the variables are nominal (have no natural ordering). 2006 by StatPoint, Inc. Describe Attributes and Counts - 3

Stacked Barchart The stacked barchart places the bars on top of each other, rather than in clusters. The output also displays the value of a statistic called Gamma. Gamma is appropriate when both classification variables are ordinal (have a natural ordering). It is based on calculating the number of discordant and concordant pairs, where a pair of observations are concordant if a member of the pair has a higher rating on both classification variables and discordant if one member is higher on one variable and lower on the other. Gamma ranges between -1 (all pairs discordant) to +1 (all pairs concordant). A value close to 0 implies independence between the two classification factors. 2006 by StatPoint, Inc. Describe Attributes and Counts - 4

Percentage Barchart The percentage barchart is similar to a stacked barchart. However, the vertical axis runs from 0% to 100%. The height of the bars within each column represents the conditional distribution of the second variable within the first. The results of a chi-squared test for independence between rows and columns is also shown. A small P-value implies that the two classification variables are not independent. NOTE: The P-value may not be reliable for sparse data. i.e., data with small expected counts in many cells. 2006 by StatPoint, Inc. Describe Attributes and Counts - 5

Mosaic Plot The mosaic plot is similar to a percentage barchart. However, the width of each column of bars is scaled according to the relative percentage of data in each category of the first variable. The result is a display in which the area of each bar is proportional to the frequencies c,. i j Since there are more males than females in the sample data, the first bar is wider than the second. 2006 by StatPoint, Inc. Describe Attributes and Counts - 6

Sample Data #2: Tabulated Data When the data to be analyzed has already been tabulated, the two-way table of counts can be placed into adjacent columns of the datasheet. For example, the file table.sgm contains the same data as analyzed above, except it is placed in four Count variables. Row Gender Excellent Good Fair Poor 1 Male 12 25 5 3 2 Female 5 14 10 6 Select the columns containing the counts, and a Labels column containing the row labels. The select Relate from the main menu. The same output will be generated as in the first sample above. Calculations r = number of categories in first variable (rows) c = number of categories in second variable (columns) c, = frequency of occurrence of category i for first variable and category j for second variable, i i j = 1, 2,, r and j = 1, 2,, c Row Totals c R i = c ij j= 1 (1) Column Totals r C j = c ij i= 1 (2) Total Count n = r c c ij i= 1 j= 1 (3) 2006 by StatPoint, Inc. Describe Attributes and Counts - 7

Uncertainty Coefficient STATGRAPHICS Mobile Rev. 4/27/2006 Rows dependent: U U ( X ) + U ( Y ) U ( XY ) = (4) U ( X ) Columns dependent: U U ( X ) + U ( Y ) U ( XY ) = (5) U ( Y ) U ( X ) + U ( Y ) U ( XY ) Symmetric: U = 2 (6) U ( X ) + U ( Y ) where Gamma r Ri U ( X ) = n c C j U ( Y ) = n Ri log i= 1 n C log j j= 1 n r c cij U ( XY ) = n c log ij i= 1 j= 1 n for c ij > 0 (9) (7) (8) ( P Q) γ = (10) P + Q where P = number of concordant pairs, Q = number of discordant pairs Chi-Squared Χ 2 = r c ( cij eij ) i= 1 j= 1 e ij 2 (11) where RiC j eij = n (12) Degrees of freedom: v = (r-1)(c-1) (13) 2006 by StatPoint, Inc. Describe Attributes and Counts - 8