Conceptual Models for Visualizing Contingency Table Data

Size: px
Start display at page:

Download "Conceptual Models for Visualizing Contingency Table Data"

Transcription

1 Conceptual Models for Visualizing Contingency Table Data Michael Friendly York University 1 Introduction For some time I have wondered why graphical methods for categorical data are so poorly developed and little used compared with methods for quantitative data. For quantitative data, graphical methods are commonplace adjuncts to all aspects of statistical analysis, from the basic display of data in a scatterplot, to diagnostic methods for assessing assumptions and finding transformations, to the final presentation of results. In contrast, graphical methods for categorical data are still in infancy. There are not many methods, and those that are available in the literature are not accessible in common statistical software; consequently, they are not widely used. What has made this contrast puzzling is the fact that the statistical methods for categorical data are in many respects discrete analogs of corresponding methods for quantitative data: log-linear models and logistic regression, for example, are such close parallels of analysis of variance and regression models that they can all be seen as special cases of generalized linear models. Several possible explanations for this apparent puzzle may be suggested. First, it may just be that those who have worked with and developed methods for categorical data are more comfortable with tabular data, or that frequency tables, representing sums over all cases in a dataset, are more easily apprehended in tables than quantitative data. Second, it may be argued that graphical methods for quantitative data are easily generalized so, for example, the scatterplot for two variables provides the basis for visualizing any number of variables in a scatterplot matrix; available graphical methods for categorical data tend to be more specialized. However, a more fundamental reason may be, as I will try to show here, that quantitative data display relies on e well-known natural visual mapping in which a magnitude is depicted by length or position along a scale; for categorical data, it will be seen that a count is more naturally displayed by an area or by the visual density of an area. Author s address: Psychology Department, York University, Toronto, Ontario, Canada M3J 1P3. friendly@yorku.ca 1

2 2 SOME GRAPHICAL METHODS FOR CONTINGENCY TABLES 2 2 Some graphical methods for contingency tables Several schemes for representing contingency tables graphically are based on the fact that when the row and column variables are independent, the estimated expected frequencies, m ij, are products of the row and column totals (divided by the grand total). Then each cell can be represented by a rectangle whose area shows the cell frequency, n ij, or deviation from independence. 2.1 Sieve diagrams Table 1 shows data on the relation between hair color and eye color among 592 subjects (students in a statistics course) collected by Snee (1974). The Pearson χ 2 for these data is with nine degrees of freedom, indicating substantial departure from independence. The question is how to understand the nature of the association between hair and eye color. [Table 1 about here.] For any two-way table, the expected frequencies under independence can be represented by rectangles whose widths are proportional to the total frequency in each column, n +j, and whose heights are proportional to the total frequency in each row, n i+ ; the area of each rectangle is then proportional to m ij. Figure 1 shows the expected frequencies for the hair and eye color data. [Figure 1 about here.] Riedwyl and Schüpbach (1983, 1994) proposed a sieve diagram (later called a parquet diagram) based on this principle. In this display the area of each rectangle is proportional to the expected frequency and the observed frequency is shown by the number of squares in each rectangle. Hence, the difference between observed and expected frequencies appears as the density of shading, using color to indicate whether the deviation from independence is positive or negative. (In monochrome versions, positive residuals are shown by solid lines, negative by broken lines.) The sieve diagram for hair color and eye color is shown in Figure 2. [Figure 2 about here.] 2.2 Mosaic displays for n-way tables The mosaic display, proposed by Hartigan & Kleiner (1981) and extended by Friendly (1994a), represents the counts in a contingency table directly by tiles whose area is proportional to the cell frequency. This display generalizes readily to n-way tables and can be used to display the residuals from various log-linear models. One form of this plot, called the condensed mosaic display, is similar to a divided bar chart. The width of each column of tiles in Figure 3 is proportional to the marginal frequency of hair colors; the height of each tile is determined by the conditional probabilities of eye color in each column. Again, the area of each box is proportional to the cell frequency, and complete independence is shown when the tiles in each row all have the same height.

3 2 SOME GRAPHICAL METHODS FOR CONTINGENCY TABLES 3 [Figure 3 about here.] Enhanced mosaics The enhanced mosaic display (Friendly, 1992b, 1994a) achieves greater visual impact by using color and shading to reflect the size of the residual from independence and by reordering rows and columns to make the pattern more coherent. The resulting display shows both the observed frequencies and the pattern of deviations from a specified model. Figure 4 gives the extended the mosaic plot, showing the standardized (Pearson) residual from independence, d ij = (n ij m ij )/ m ij by the color and shading of each rectangle: cells with positive residuals are outlined with solid lines and filled with slanted lines; negative residuals are outlined with broken lines and filled with grayscale. The absolute value of the residual is portrayed by shading density: cells with absolute values less than 2 are empty; cells with d ij 2 are filled; those with d ij 4 are filled with a darker pattern. 1 Under the assumption of independence, these values roughly correspond to two-tailed probabilities p <.05 and p <.0001 that a given value of d ij exceeds 2 or 4. For exploratory purposes, we do not usually make adjustments (e.g., Bonferroni) for multiple tests because the goal is to display the pattern of residuals in the table as a whole. However, the number and values of these cutoffs can be easily set by the user. [Figure 4 about here.] When the row or column variables are unordered, we are also free to rearrange the corresponding categories in the plot to help show the nature of association. For example, in Figure 4, the eye color categories have been permuted so that the residuals from independence have an opposite-corner pattern, with positive values running from bottom-left to top-right corners, negative values along the opposite diagonal. Coupled with size and shading of the tiles, the excess in the black-brown and blond-blue cells, together with the underrepresentation of brown-eyed blonds and people with black hair and blue eyes is now quite apparent. Alhough the table was reordered on the basis of the d ij values, both dimensions in Figure 4 are ordered from dark to light, suggesting an explanation for the association. (In this example the eye-color categories could be reordered by inspection. A general method (Friendly, 1994a) uses category scores on the largest correspondence analysis dimension.) Multi-way tables Like the scatterplot matrix for quantitative data, the mosaic plot generalizes readily to the display of multi-dimensional contingency tables. Imagine that each cell of the two-way table for hair and eye color is further classified by one or more additional variables sex and level of education, for example. Then each rectangle can be subdivided horizontally to show the proportion of males and females in that cell, and each of those horizontal portions can be subdivided vertically to show the proportions of people at each educational level in the hair-eye-sex group. 1 Color versions use blue and red at varying lightness to portray both sign and magnitude of residuals.

4 2 SOME GRAPHICAL METHODS FOR CONTINGENCY TABLES 4 Fitting models When three or more variables are represented in the mosaic, we can fit several different models of independence and display the residuals from each model. We treat these models as null or baseline models, which may not fit the data particularly well. The deviations of observed frequencies from expected ones, displayed by shading, will often suggest terms to be added to to an explanatory model that achieves a better fit. Complete independence: The model of complete independence asserts that all joint probabilities are products of the one-way marginal probabilities: π ijk = π i++ π +j+ π ++k (1) for all i,j,k in a three-way table. This corresponds to the log-linear model [A][B][C]. Fitting this model puts all higher terms, and hence all association among the variables, into the residuals. Joint independence: Another possibility is to fit the model in which variable C is jointly independent of variables A and B, π ijk = π ij+ π ++k. (2) This corresponds to the log-linear model [AB][C]. Residuals from this model show the extent to which variable C is related to the combinations of variables A and B but they do not show any association between A and B. For example, with the data from Table 1 broken down by sex, fitting the model [Hair- Eye][Sex] allows us to see the extent to which the joint distribution of hair-color and eye-color is associated with sex. For this model, the likelihood-ratio G 2 is on 15 df (p =.178), indicating an acceptable overall fit. The three-way mosaic, shown in Figure 5, highlights two cells: among blue-eyed blonds, there are more females (and fewer males) than would be the case if hair color and eye color were jointly independent of sex. Except for these cells hair color and eye color appear unassociated with sex. [Figure 5 about here.] 2.3 Fourfold Display A third graphical method based on the use of area as the visual mapping of cell frequency is the fourfold display (Friendly, 1994b, 1994c) designed for the display of 2 2 (or 2 2 k) tables. In this display the frequency n ij in each cell of a fourfold table is shown by a quarter circle, whose radius is proportional to n ij, so the area is proportional to the cell count. For a single 2 2 table the fourfold display described here also shows the frequencies by area, but scaled in a way that depicts the sample odds ratio, ˆθ = (n 11 /n 12 ) (n 21 /n 22 ). An association between the variables (θ 1) is shown by the tendency of diagonally opposite cells in one direction to differ in size from those in the opposite direction, and the display uses color or shading to show this direction. Confidence rings for the observed θ allow a

5 2 SOME GRAPHICAL METHODS FOR CONTINGENCY TABLES 5 visual test of the hypothesis H 0 : θ = 1. They have the property that the rings for adjacent quadrants overlap iff the observed counts are consistent with the null hypothesis. As an example, Figure 6 shows aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. At issue is whether the data show evidence of sex bias in admission practices (Bickel et al., 1975). The figure shows the cell frequencies numerically in the corners of the display. Thus there were 2691 male applicants, of whom 1193 (44.4%) were admitted, compared with 1855 female applicants of whom 557 (30.0%) were admitted. Hence the sample odds ratio, Odds (Admit Male) / (Admit Female) is 1.84 indicating that males were almost twice as likely to be admitted. [Figure 6 about here.] The frequencies displayed graphically by shaded quadrants in Figure 6 are not the raw frequencies. Instead, the frequencies have been standardized (by iterative proportional fitting) so that all table margins are equal, while preserving the odds ratio. Each quarter circle is then drawn to have an area proportional to this standardized cell frequency. This makes it easier to see the association between admission and sex without being influenced by the overall admission rate or the differential tendency of males and females to apply. With this standardization the four quadrants will align when the odds ratio is 1, regardless of the marginal frequencies. The shaded quadrants in Figure 6 do not align and the 99% confidence rings around each quadrant do not overlap, indicating that the odds ratio differs significantly from 1. The width of the confidence rings gives a visual indication of the precision of the data. Multiple strata In the case of a 2 2 k table, the last dimension typically corresponds to strata or populations, and it is typically of interest to see if the association between the first two variables is homogeneous across strata. The fourfold display allows easy visual comparison of the pattern of association between two dichotomous variables across two or more populations. For example, the admissions data shown in Figure 6 were obtained from a sample of six departments; Figure 7 displays the data for each department. The departments are labeled so that the overall acceptance rate is highest for Department A and decreases steadily to Department F. Again, each panel is standardized to equate the marginals for sex and admission. This standardization also equates for the differential total applicants across departments, facilitating visual comparison. [Figure 7 about here.] Figure 7 shows that, for five of the six departments, the odds of admission is approximately the same for both men and women applicants. Department A appears to differs from the others, with women approximately 2.86 (= (313/19)/(512/89)) times as likely to gain admission. This appearance is confirmed by the confidence rings, which in Figure 7 are joint 99% intervals for θ c, c = 1,...,k. This result, which contradicts the display for the aggregate data in Figure 6, is a nice example of Simpson s paradox. The resolution of this contradiction can be found in the

6 3 CONCEPTUAL MODELS FOR VISUAL DISPLAYS 6 large differences in admission rates among departments. Men and women apply to different departments differentially, and in these data women apply in larger numbers to departments that have a low acceptance rate. The aggregate results are misleading because they falsely assume men and women are equally likely to apply in each field. (This explanation ignores the possibility of structural bias against women, e.g., lack of resources allocated to departments that attract women applicants.) 3 Conceptual Models for Visual Displays Visual representation of data depends fundamentally on an appropriate visual scheme for mapping numbers into graphic patterns (Bertin 1983). The widespread use of graphical methods for quantitative data relies on the availability of a natural visual mapping: magnitude can be represented by length, as in a bar chart, or by position along a scale, as in dot charts and scatterplots. One reason for the relative paucity of graphical methods for categorical data may be that a natural visual mapping for frequency data is not so apparent. And, as I have just shown, the mapping of frequency to area appears to work well for categorical data. Closely associated with the idea of a visual metaphor is a conceptual model that helps you interpret what is shown in a graph. A good conceptual model for a graphical display will have deeper connections with underlying statistical ideas as well. In this section I will describe conceptual models for both quantitative and frequency data that have these properties and help to elucidate the differences between their graphical displays. The discussion borrows from Sall (1991a, 1991b), Farebrother (1987) and Friendly (1995). 3.1 Quantitative Data The simplest conceptual model for quantitative data is the balance beam, often used in introductory statistics texts to illustrate the sample mean as the point along an axis where the positive and negative deviations balance. Springs A more powerful model (Sall, 1991a) likens observations to fixed points connected to a movable junction by springs of equal spring constant, k 1/σ. By this model, each observation exerts a force proportional to (X µ)/σ on the junction, and the sample mean is seen as the point which not only balances the forces, but also minimizes the total potential energy in the system. The spring model is more powerful because it provides a basis for understanding a wide class of both graphical displays and statistical principles for quantitative data. For example, least squares regression can be represented as shown in Figure 8, where the points are again fixed and attached to a movable rod by unit length, equally stiff springs. If the springs are constrained to be kept vertical, the rod, when released, moves to the position of balance and

7 3 CONCEPTUAL MODELS FOR VISUAL DISPLAYS 7 minimum potential energy, the least square solution. The normal equations, ei = xi e i = n (y i a bx i ) = 0 (3) i=1 n (y i a bx i )x i = 0 (4) i=1 are seen as conditions that the vertical forces balance (Eq. (3)), and the rotational moments about the intercept (0, a) balance (Eq. (4)). Letting X = [1, x], the normal equations provide the derivation X e = X (y Xb) = 0 X y = X Xb b = (X X) 1 X y, but springs get it right without inverting a matrix. [Figure 8 about here.] Neat explanations by springs The appeal of the spring model lies in the intuitive explanations it provides for many statistical phenomena and the understanding it can bring to our perception of graphical displays. Without much explanation here (but see Sall, 1991a and Farebrother, 1987) I will simply list some of these. least squares minimum energy orthogonality balancing of forces hypothesis testing: To test the hypothesis H 0 : β = 0, simply force the rod to the horizontal position, measuring the additional energy required to make it obey the hypothesis. This is the regression sum of squares, SSR (X). sample size and power: Adding more points (more springs) causes the position of the rod (and thus the coefficient estimates) to be held more tightly. It is therefore easier to reject the null hypothesis. error variance: The spring constant is inversely proportional to σ, so smaller error variance means stiffer springs, which require more force to impose the null hypothesis. outliers: Points far from the regression line pull the line towards themselves in proportion to the square of their vertical distance from the line. leverage: Observations exert a force on the regression line proportional to the square of their horizontal distance from x. Influence is the product of the force of leverage and the vertical force from the residual. multiple regression: With two (or more) predictors, fit a plane (or hyperplane) to points with vertical springs. partial F tests: With two predictors, fit a plane to (X 1,X 2 ). How much more energy is required to make the plane flat in the x 2 direction? This is the extra sum of squares, SSR(X 2 X 1 )

8 3 CONCEPTUAL MODELS FOR VISUAL DISPLAYS 8 robust estimation: Use springs which deform (so they exert little or no force) when stretched past some limit. smoothing splines: Make the regression line from a material which is flexible but stiff. principal components: If we relax the restriction that the springs are constrained to remain vertical, the rod moves to the position of the first principal component. 3.2 Categorical Data For categorical data, we need a visual analog for the sample frequency in k mutually exclusive and exhaustive categories. Consider first the one-way marginal frequencies of hair color from Table 1. Urn model The simplest physical model represents the hair color categories by urns containing marbles representing the observations (Figure 9). This model is sometimes used in texts to describe multinomial sampling, and provides a visual representation that equates the count n i with the area filled in each urn, as in the familiar bar chart. (When the urns are of equal width, count is also reflected by height, but in the general case, count is proportional to area.) However, the urn model is a static one and provides no further insights. It does not relate to the concept of likelihood or to the constraint that the probabilities sum to 1. [Figure 9 about here.] Pressure and Energy A dynamic model gives each observation a force (Figure 10). Consider the observations in a given category (red hair, say) as molecules of an ideal gas confined to a cylinder whose volume can be varied with a movable piston (Sall, 1991b), set up so that a probability of 1.0 corresponds to ambient pressure, with no force exerted on the piston. An actual probability of red hair equal to p means that the same number of observations are squeezed down to a chamber of height p. By Boyle s law (that pressure volume is constant) the pressure is proportional to 1/p. In the figure, pressure is shown by observation density, the number of observations per unit area. Hence, the graphical metaphor is that a count can be represented visually by observation density when the count is fixed and area is varied (or by area when the observation density is fixed as in Figure 9.) The work done on the gas (or potential energy imparted to it) by compressing a small distance δy is the force on the piston times δy, which equals the pressure times the change in volume. Hence, the potential energy of a gas at height=p is 1 (1/y)dy, which is log(p), p so the energy in this model corresponds to negative log likelihood. [Figure 10 about here.]

9 3 CONCEPTUAL MODELS FOR VISUAL DISPLAYS 9 Fitting probabilities: Minimum energy, balanced forces Maximum likelihood estimation means literally finding the values, ˆπ i, of the parameters under which the observed data would have the highest probability of occurrence. We take derivatives of the (log-) likelihood function with respect to the parameters, set these to zero, and solve: log L π i = 0 n 1 π 1 = n 2 π 2 = = n c π c ˆπ i = n i n = p i As with the spring model, setting derivatives to zero means minimizing the potential energy; the maximum likelihood solution simply sets parameter values equal to corresponding sample quantities, where the forces are also balanced. In the mechanical model (Figure 11) this corresponds to stacking the gas containers with movable partitions between them, with one end of the bottom and top containers fixed at 0 and 1. The observations exert pressure on the partitions, the likelihood equations are precisely the conditions for the forces to balance, and the partitions move so that each chamber is of size p i = n i /n. Each chamber has potential energy of log p i, and the total energy, c i n i log p i is minimized. The constrained top and bottom force the probability estimates to sum to 1, and the number of movable partitions is literally and statistically the degrees of freedom of the system. [Figure 11 about here.] Testing a hypothesis This mechanical model also explains how we test hypotheses about the true probabilities (Figure 12). To test the hypothesis that the four hair color categories are equally probable, H 0 : π 1 = π 2 = π 3 = π 4 = 1 4 simply force the partitions to move to the hypothesized values and measure how much energy is required to force the constraint. Some of the chambers will then exert more pressure, some less than when the forces are allowed to balance without these additional restraints. The change in energy in each compartment is then (log p i log π i ) = log(p i /π i ), the change in negative log-likelihood. Sum these up and multiply by 2 to get the likelihood ratio G 2. [Figure 12 about here.] The pressure model also provides simple explanations of other results. For example, increased sample size increases power, because more observations means more pressure in each compartment, so it takes more energy to move the partitions and the test is sensitive to smaller differences between observed and hypothesized probabilities. Multi-way Tables The dynamic pressure model extends readily to multi-way tables. For a two-way table of hair color and eye color, partition the sample space according to the marginal proportions of eye

10 3 CONCEPTUAL MODELS FOR VISUAL DISPLAYS 10 color, and then partition the observations for each eye color according to hair color as before (Figure 13). Within each column the forces balance as before, so that the height of each chamber is n ij /n i+. Then the area of each cell is proportional to the maximum-likelihood estimate (MLE) of the cell probabilities, (n i+ /n) (n ij /n i+ ) = n ij /n = p ij,, which again is the sample cell proportion. [Figure 13 about here.] For a three-way table, the physical model is a cube with its third dimension partitioned according to conditional frequencies of the third variable, given the first two. If the third dimension is represented instead by partitioning a two-dimensional graph, the result is the mosaic display. Testing Independence For a two-way table of size I J independence is formally the same as the hypothesis that conditional probabilities (of hair color) are the same in all strata (eye colors). To test this hypothesis, force the partitions to align and measure the total additional energy required to effect the change (Figure 14). The degrees of freedom for the test is again the number of movable partitions, (I 1)(J 1). [Figure 14 about here.] Each log-linear model for three-way tables can be interpreted analogously. For example, the log-linear model [A][B][C] (complete independence), corresponds to the cube in which all chambers are forced to conform to the one way marginals, π ijk = π i++ π +j+ π ++k for all i,j,k. G 2 is again the total additional energy required to move the partitions from their positions in the saturated model in which the volume of each cell is p ijk = n ijk /n (so the pressures balance), to the positions where each cell is a cube of size π i++ π +j+ π ++k. Other models have a similar representation in the pressure model. Iterative Proportional Fitting For three-way (and higher) tables some log-linear models have closed-form solutions for expected cell frequencies. The cases in which direct estimates exist are analogous to the twoway case, where the estimates under the hypothesized model are products of the sufficient marginals. Here we see that the partitions in the observation space can be moved directly in planar slices to their positions under the hypothesis, so that iteration is unnecessary. When direct estimates do not exist, the MLEs can be estimated by iterative proportional fitting (IPF). This process simply matches the partitions corresponding to each of the sufficient marginals of the fitted frequencies to the same marginals of the data. For example, for the log-linear model [AB][BC][AC], the sufficient statistics are n ij+, n i+k, and n +jk. The conditions that the fitted margins must equal these observed margins are n ij+ ˆm ij+ = n i+k ˆm i+k = n +jk ˆm +jk = 1, (5)

11 4 CONCLUSION 11 which is equivalent to balancing the forces in each fitted marginal. The steps in IPF follow directly from Equation (5). For example, the first step in cycle t + 1 of IPF matches the frequencies in the [AB] marginal table, ˆm (t+1) ijk = ˆm (t) ijk ( n ij+ ˆm (t) ij+ ), (6) which makes the forces balance when Equation (6) is summed over variable C: ˆm (t+1) ij+ = n ij+. The other steps in each cycle make the forces balance in the [BC], and [AC] margins. The iterative process can be shown visually (Friendly, 1995), in a way that is graphically exact, by drawing chambers whose area is proportional to the fitted frequencies, ˆm ijk, and which are filled with a number of points equal to the observed n ijk. Such a figure will then show equal densities of points in cells that are fit well, but relatively high or low densities where n ijk > ˆm ijk or n ijk < ˆm ijk, respectively. The IPF algorithm can in fact be animated, by drawing one such frame for each step in the iterative process. When this is done, it is remarkable how quickly IPF converges, at least for small tables. Likewise, numerical methods for minimizing the negative log likelihood directly can also be interpreted in terms of the dynamic model (Farebrother, 1988; Friendly, 1995). For example, in steepest descent and Newton-Raphson iteration, the update step changes the estimated model parameters β (t+1) in proportion to the score vector f (t) of derivatives of the likelihood function, f (t) = log L/ β = X (n m (t) ) to give β (t+1) = β (t) + λf (t). But f (t) is just the vector of forces in the mechanical model attributed to the differences between n and m (t) as a function of the model parameters. 4 Conclusion I began this paper with the puzzling contrast in use and generality between graphical methods for quantitative data and those for categorical data, despite strong formal similarities in their underlying methods. The explanation I believe I have demonstrated is that categorical data require a different graphic metaphor, and hence a different visual representation (count area) from that which has been useful for quantitative data (magnitude position along a scale). The sieve diagram, mosaic, and the fourfold display all show frequencies in this way, and are valuable tools for both the analysis and presentation of categorical data. In the second part of this paper I have outlined concrete, physical models for both quantitative and categorical data and their graphic representation and have shown these to yield a wide range of interpretation for statistical principles and phenomena. Although the spring and pressure models differ fundamentally in their mechanics, both can be understood in terms of balancing of forces and the minimization of energy. The recognition of these conceptual models can make a graphical display a tool for thinking, as well as a tool for data summarization and exposure. Finally, as I look to the future development of graphical methods for categorical data, I see two areas where our report card, perhaps reflected in this volume, may be marked needs improvement : First, much of the power of graphical methods for quantitative data

12 REFERENCES 12 stems from the availability of tools that generalize readily to multivariable data and can make important contributions to model building, model criticism, and model interpretation. The mosaic display possesses some of these properties, and other papers here attest to the widespread utility of biplots and correspondence analysis. However, I believe there is need for further development of such methods, particularly as tools for constructing models and communicating their import. Second, I am reminded of the statement (Tukey, 1959, attributed to Churchill Eisenhart) that the practical power of any statistical tool is the product of its statistical power times its probability of use. It follows that statistical and graphical methods are of practical value to the extent that they are implemented in standard software, available, and easy to use. Statistical methods for categorical data analysis have nearly reached that point. Graphical methods still have some way to go. About the Author Michael Friendly is Associate Professor of Psychology and Coordinator of the Statistical Consulting Service at York University. He is an Associate Editor of the Journal of Computational and Graphical Statistics, and has been working on the development of graphical methods for categorical data. For further information, see References [1] Bertin, J. (1983), Semiology of Graphics (trans. W. Berg). Madison, WI: University of Wisconsin Press. [2] Bickel, P. J., Hammel, J. W. & O Connell, J. W. (1975). Sex bias in graduate admissions: data from Berkeley. Science, 187, [3] Farebrother, R. W. (1987), Mechanical representations of the L 1 and L 2 estimation problems, In Y. Dodge (ed.) Statistical data analysis based on the L1 norm and related methods, Amsterdam: North-Holland., [4] Farebrother, R. W. (1988), On an analogy between classical mechanics and maximum likelihood estimation, Osterrische Zeitschrift fur Statistik und Informatik, 18, [5] Friendly, M. (1992a), Graphical methods for categorical data. Proceedings of the SAS User s Group International Conference, 17, [6] Friendly, M. (1992b), Mosaic Displays for Loglinear Models. American Statistical Association, Proceedings of the Statistical Graphics Section, [7] Friendly, M. (1994a), Mosaic displays for multi-way contingency tables. Journal of the American Statistical Association, 89,

13 REFERENCES 13 [8] Friendly, M. (1994b), A fourfold display for 2 by 2 by k tables. Department of Psychology Reports, No. 217, York University. [9] Friendly, M. (1994c). SAS/IML graphics for fourfold displays. Observations, 3(4), [10] Friendly, M. (1995). Conceptual and visual models for categorical data. American Statistician, 1995, 49, [11] Hartigan, J. A., and Kleiner, B. (1981). Mosaics for contingency tables. In W. F. Eddy (Ed.), Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface. New York: Springer-Verlag, [12] Riedwyl, H., & Schüpbach, M. (1983). Siebdiagramme: Graphische Darstellung von Kontingenztafeln. Technical Report No. 12, Institute for Mathematical Statistics, University of Bern, Bern, Switzerland. [13] Riedwyl, H., and Schüpbach, M. (1994). Parquet diagram to plot contingency tables. In Softstat 93: Advances in Statistical Software, F. Faulbaum (Ed.). New York: Gustav Fischer, [14] Sall, J. (1991a), The conceptual model behind the picture, ASA Statistical Computing and Statistical Graphics Newsletter, 2 (April), 5 8. [15] Sall, J. (1991b), The conceptual model for categorical responses, ASA Statistical Computing and Statistical Graphics Newsletter, 3 (November), [16] Snee, R. D. (1974), Graphical display of two-way contingency tables, The American Statistician, 28, [17] Tukey, J. W. (1959), A quick, compact, two sample test to Duckworth s specifications. Technometrics, 1,

14 LIST OF TABLES 14 List of Tables 1 Hair-color eye-color data

15 TABLES 15 Table 1: Hair-color eye-color data. Hair Color Eye Color BLACK BROWN RED BLOND Total Green Hazel Blue Brown Total

16 TABLES 16 List of Figures 1 Expected frequencies under independence Sieve diagram for hair-color, eye-color data Condensed mosaic for Hair-color, Eye-color data Condensed mosaic, reordered and shaded Three-way mosaic, joint independence Four-fold display for Berkeley admissions Fourfold display of Berkeley admissions, by department Spring model for least squares regression Urn model for multinomial sampling Pressure model for categorical data Fitting probabilities for a one-way table Testing a hypothesis Two-way tables Testing independence

17 FIGURES 17 Green Hazel Eye Colour Blue Brown Black Brown Red Blond Hair Colour Figure 1: Expected frequencies under independence. Each box has area equal to its expected frequency, and is cross-ruled proportionally to the expected frequency.

18 FIGURES 18 Green Hazel Eye Colour Blue Brown Black Brown Red Blond Hair Colour Figure 2: Sieve diagram for hair-color, eye-color data. Observed frequencies are equal to the number squares in each cell, so departure from independence appears as variations in shading density.

19 FIGURES 19 Eye Colour Brown Blue Hazel Green Black Brown Red Blond Hair colour Figure 3: Condensed mosaic for Hair-color, Eye-color data. Each column is divided according to the conditional frequency of eye color given hair color. The area of each rectangle is proportional to observed frequency in that cell.

20 FIGURES 20 Eye Colour Brown Hazel Green Blue Standardized residuals: <-4-4:-2-2:-0 0:2 2:4 >4 Black Brown Red Blond Hair colour Figure 4: Condensed mosaic, reordered and shaded. Deviations from independence are shown by color and shading. The two levels of shading density correspond to standardized deviations greater than 2 and 4 in absolute value. This form of the display generalizes readily to multi-way tables.

21 FIGURES 21 Eye Colour Brown Hazel Green Blue Standardized residuals: <-4-4:-2-2:-0 0:2 2:4 >4 M F M F M F M F Black Brown Red Blond Figure 5: Three-way mosaic display for hair color, eye color, and sex. Residuals from the model of joint independence, [HE][S] are shown by shading. G 2 = on 15 df. The only lack of fit is an overabundance of females among blue-eyed blonds.

22 FIGURES 22 Sex: Male Admit?: Yes Admit?: No Sex: Female Figure 6: Four-fold display for Berkeley admissions: Evidence for sex bias? The area of each shaded quadrant shows the frequency, standardized to equate the margins for sex and admission. Circular arcs show the limits of a 99% confidence interval for the odds ratio.

23 FIGURES 23 Department: A Sex: Male Department: B Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Department: C Sex: Male Sex: Female Department: D Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Department: E Sex: Male Sex: Female Department: F Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Sex: Female Figure 7: Fourfold display of Berkeley admissions, by department. In each panel the confidence rings for adjacent quadrants overlap if the odds ratio for admission and sex does not differ significantly from 1. The data in each panel have been standardized as in Figure 6.

24 FIGURES y x Figure 8: Spring model for least squares regression. Fixed data points are connected to a movable rod by springs constrained to remain vertical. The least squares line is the position of balanced forces.

25 FIGURES Black 108 Brown 286 Red 71 Blond 127 Figure 9: Urn model for multinomial sampling. Each observation in Table 1 is represented by a token classified by hair color into the appropriate urn. This model provides a basis for the bar chart, but does not yield any further insights.

26 FIGURES pressure(1) = 0.8 energy = p 1 1 / height = - log( p ).6 pressure <-> obs. density energy <-> -log likelihood.4 pressure(p) = ( 1 / p ).2.0 pressure(0) = Figure 10: Pressure model for categorical data. Frequency of observations corresponds to pressure of gas in a chamber, shown visually as observation density; negative log likelihood corresponds to the energy required to compress the gas to a height p.

27 FIGURES Blond Red 71.6 df = # of movable.4 Brown 286 partitions.2 Black Figure 11: Fitting probabilities for a one-way table. The movable partitions naturally adjust to positions of balanced forces, which is the minimum energy configuration.

28 FIGURES G 2 components Blond Red Brown Black = G 2 Figure 12: Testing a hypothesis. The likelihood ratio G 2 measures how much energy is required to move the partitions to constrain the data to the hypothesized probabilities. The components of G 2 indicate the degree to which each chamber has low or high pressure, relative to the balanced state.

29 FIGURES Eye Colour Blond.8 Red.6.4 Brown.2 Black.0 Brown 220 Blue 215 Hazel 93 Green 64 Figure 13: Two-way tables. For multiple samples, the model represents each sample by a stack of pressure chambers whose width is proportional to the marginal frequencies of one variable.

30 FIGURES Eye Colour Blond.8 Red.6.4 Brown.2 Black.0 Brown 220 Blue 215 Hazel Green Figure 14: Testing independence. The chambers are forced to align with both sets of marginal frequencies, and the likelihood ratio G 2 again measures the additional energy required.

Conceptual and Visual Models for Categorical Data

Conceptual and Visual Models for Categorical Data Michael FRIENDLY Conceptual and Visual Models for Categorical Data A dynamic conceptual model for categorical data is described that likens observations to gas molecules in a pressure chamber. In this

More information

Extended Mosaic and Association Plots for Visualizing (Conditional) Independence. Achim Zeileis David Meyer Kurt Hornik

Extended Mosaic and Association Plots for Visualizing (Conditional) Independence. Achim Zeileis David Meyer Kurt Hornik Extended Mosaic and Association Plots for Visualizing (Conditional) Independence Achim Zeileis David Meyer Kurt Hornik Overview The independence problem in -way contingency tables Standard approach: χ

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Visualizing Independence Using Extended Association and Mosaic Plots. Achim Zeileis David Meyer Kurt Hornik

Visualizing Independence Using Extended Association and Mosaic Plots. Achim Zeileis David Meyer Kurt Hornik Visualizing Independence Using Extended Association and Mosaic Plots Achim Zeileis David Meyer Kurt Hornik Overview The independence problem in 2-way contingency tables Standard approach: χ 2 test Alternative

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Review of Statistics

Review of Statistics Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

More information

Stat 101 Exam 1 Important Formulas and Concepts 1

Stat 101 Exam 1 Important Formulas and Concepts 1 1 Chapter 1 1.1 Definitions Stat 101 Exam 1 Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2. Categorical/Qualitative

More information

Correspondence Analysis

Correspondence Analysis Correspondence Analysis Q: when independence of a 2-way contingency table is rejected, how to know where the dependence is coming from? The interaction terms in a GLM contain dependence information; however,

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Residual-based Shadings for Visualizing (Conditional) Independence

Residual-based Shadings for Visualizing (Conditional) Independence Residual-based Shadings for Visualizing (Conditional) Independence Achim Zeileis David Meyer Kurt Hornik http://www.ci.tuwien.ac.at/~zeileis/ Overview The independence problem in 2-way contingency tables

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

ColourMatrix: White Paper

ColourMatrix: White Paper ColourMatrix: White Paper Finding relationship gold in big data mines One of the most common user tasks when working with tabular data is identifying and quantifying correlations and associations. Fundamentally,

More information

INTRODUCTION TO LOG-LINEAR MODELING

INTRODUCTION TO LOG-LINEAR MODELING INTRODUCTION TO LOG-LINEAR MODELING Raymond Sin-Kwok Wong University of California-Santa Barbara September 8-12 Academia Sinica Taipei, Taiwan 9/8/2003 Raymond Wong 1 Hypothetical Data for Admission to

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression Rebecca Barter April 20, 2015 Fisher s Exact Test Fisher s Exact Test

More information

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts

More information

Solution to Tutorial 7

Solution to Tutorial 7 1. (a) We first fit the independence model ST3241 Categorical Data Analysis I Semester II, 2012-2013 Solution to Tutorial 7 log µ ij = λ + λ X i + λ Y j, i = 1, 2, j = 1, 2. The parameter estimates are

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878 Contingency Tables I. Definition & Examples. A) Contingency tables are tables where we are looking at two (or more - but we won t cover three or more way tables, it s way too complicated) factors, each

More information

STAT 525 Fall Final exam. Tuesday December 14, 2010

STAT 525 Fall Final exam. Tuesday December 14, 2010 STAT 525 Fall 2010 Final exam Tuesday December 14, 2010 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Describing Contingency tables

Describing Contingency tables Today s topics: Describing Contingency tables 1. Probability structure for contingency tables (distributions, sensitivity/specificity, sampling schemes). 2. Comparing two proportions (relative risk, odds

More information

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,

More information

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis 6 Why Is It There? Why Is It There? Getting Started with Geographic Information Systems Chapter 6 6.1 Describing Attributes 6.2 Statistical Analysis 6.3 Spatial Description 6.4 Spatial Analysis 6.5 Searching

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Three-Way Contingency Tables

Three-Way Contingency Tables Newsom PSY 50/60 Categorical Data Analysis, Fall 06 Three-Way Contingency Tables Three-way contingency tables involve three binary or categorical variables. I will stick mostly to the binary case to keep

More information

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing

More information

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as page1 Loglinear Models Loglinear models are a way to describe association and interaction patterns among categorical variables. They are commonly used to model cell counts in contingency tables. These

More information

REVIEW 8/2/2017 陈芳华东师大英语系

REVIEW 8/2/2017 陈芳华东师大英语系 REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Yu Xie, Institute for Social Research, 426 Thompson Street, University of Michigan, Ann

Yu Xie, Institute for Social Research, 426 Thompson Street, University of Michigan, Ann Association Model, Page 1 Yu Xie, Institute for Social Research, 426 Thompson Street, University of Michigan, Ann Arbor, MI 48106. Email: yuxie@umich.edu. Tel: (734)936-0039. Fax: (734)998-7415. Association

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Statistics: revision

Statistics: revision NST 1B Experimental Psychology Statistics practical 5 Statistics: revision Rudolf Cardinal & Mike Aitken 29 / 30 April 2004 Department of Experimental Psychology University of Cambridge Handouts: Answers

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Hypothesis Tests and Confidence Intervals. in Multiple Regression

Hypothesis Tests and Confidence Intervals. in Multiple Regression ECON4135, LN6 Hypothesis Tests and Confidence Intervals Outline 1. Why multipple regression? in Multiple Regression (SW Chapter 7) 2. Simpson s paradox (omitted variables bias) 3. Hypothesis tests and

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

 M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2 Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the

More information

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated,

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor

More information

Hypothesis Testing hypothesis testing approach

Hypothesis Testing hypothesis testing approach Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we

More information

Chapter 26: Comparing Counts (Chi Square)

Chapter 26: Comparing Counts (Chi Square) Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

Psych 230. Psychological Measurement and Statistics

Psych 230. Psychological Measurement and Statistics Psych 230 Psychological Measurement and Statistics Pedro Wolf December 9, 2009 This Time. Non-Parametric statistics Chi-Square test One-way Two-way Statistical Testing 1. Decide which test to use 2. State

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Lecture 5: ANOVA and Correlation

Lecture 5: ANOVA and Correlation Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62 Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

One-Way ANOVA. Some examples of when ANOVA would be appropriate include: One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement

More information

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21 Sections 2.3, 2.4 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 21 2.3 Partial association in stratified 2 2 tables In describing a relationship

More information

Generalized Linear Models: An Introduction

Generalized Linear Models: An Introduction Applied Statistics With R Generalized Linear Models: An Introduction John Fox WU Wien May/June 2006 2006 by John Fox Generalized Linear Models: An Introduction 1 A synthesis due to Nelder and Wedderburn,

More information

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Multilevel Models in Matrix Form Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Today s Lecture Linear models from a matrix perspective An example of how to do

More information

Analytical Graphing. lets start with the best graph ever made

Analytical Graphing. lets start with the best graph ever made Analytical Graphing lets start with the best graph ever made Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian

More information

Sleep data, two drugs Ch13.xls

Sleep data, two drugs Ch13.xls Model Based Statistics in Biology. Part IV. The General Linear Mixed Model.. Chapter 13.3 Fixed*Random Effects (Paired t-test) ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

Sociology 6Z03 Review I

Sociology 6Z03 Review I Sociology 6Z03 Review I John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review I Fall 2016 1 / 19 Outline: Review I Introduction Displaying Distributions Describing

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012

More information

Lecture 9: Linear Regression

Lecture 9: Linear Regression Lecture 9: Linear Regression Goals Develop basic concepts of linear regression from a probabilistic framework Estimating parameters and hypothesis testing with linear models Linear regression in R Regression

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Statistical Distribution Assumptions of General Linear Models

Statistical Distribution Assumptions of General Linear Models Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

Various Issues in Fitting Contingency Tables

Various Issues in Fitting Contingency Tables Various Issues in Fitting Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Complete Tables with Zero Entries In contingency tables, it is possible to have zero entries in a

More information

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018 Work all problems. 60 points are needed to pass at the Masters Level and 75

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Statistics 3858 : Contingency Tables

Statistics 3858 : Contingency Tables Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson

More information

Chapter 8. The analysis of count data. This is page 236 Printer: Opaque this

Chapter 8. The analysis of count data. This is page 236 Printer: Opaque this Chapter 8 The analysis of count data This is page 236 Printer: Opaque this For the most part, this book concerns itself with measurement data and the corresponding analyses based on normal distributions.

More information

California Common Core State Standards for Mathematics Standards Map Mathematics I

California Common Core State Standards for Mathematics Standards Map Mathematics I A Correlation of Pearson Integrated High School Mathematics Mathematics I Common Core, 2014 to the California Common Core State s for Mathematics s Map Mathematics I Copyright 2017 Pearson Education, Inc.

More information

Unit 27 One-Way Analysis of Variance

Unit 27 One-Way Analysis of Variance Unit 27 One-Way Analysis of Variance Objectives: To perform the hypothesis test in a one-way analysis of variance for comparing more than two population means Recall that a two sample t test is applied

More information

INTERVAL ESTIMATION AND HYPOTHESES TESTING

INTERVAL ESTIMATION AND HYPOTHESES TESTING INTERVAL ESTIMATION AND HYPOTHESES TESTING 1. IDEA An interval rather than a point estimate is often of interest. Confidence intervals are thus important in empirical work. To construct interval estimates,

More information

3 Way Tables Edpsy/Psych/Soc 589

3 Way Tables Edpsy/Psych/Soc 589 3 Way Tables Edpsy/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology I L L I N O I S university of illinois at urbana-champaign c Board of Trustees, University of Illinois Spring 2017

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?) 12. Comparing Groups: Analysis of Variance (ANOVA) Methods Response y Explanatory x var s Method Categorical Categorical Contingency tables (Ch. 8) (chi-squared, etc.) Quantitative Quantitative Regression

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information

Analytical Graphing. lets start with the best graph ever made

Analytical Graphing. lets start with the best graph ever made Analytical Graphing lets start with the best graph ever made Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian

More information

Bivariate data analysis

Bivariate data analysis Bivariate data analysis Categorical data - creating data set Upload the following data set to R Commander sex female male male male male female female male female female eye black black blue green green

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 -

More information

Statistical Process Control for Multivariate Categorical Processes

Statistical Process Control for Multivariate Categorical Processes Statistical Process Control for Multivariate Categorical Processes Fugee Tsung The Hong Kong University of Science and Technology Fugee Tsung 1/27 Introduction Typical Control Charts Univariate continuous

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. er15 Chapte Chi-Square Tests d Chi-Square Tests for -Fit Uniform Goodness- Poisson Goodness- Goodness- ECDF Tests (Optional) Contingency Tables A contingency table is a cross-tabulation of n paired observations

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

STAT 200 Chapter 1 Looking at Data - Distributions

STAT 200 Chapter 1 Looking at Data - Distributions STAT 200 Chapter 1 Looking at Data - Distributions What is Statistics? Statistics is a science that involves the design of studies, data collection, summarizing and analyzing the data, interpreting the

More information

Common Core State Standards for Mathematics Integrated Pathway: Mathematics I

Common Core State Standards for Mathematics Integrated Pathway: Mathematics I A CORRELATION OF TO THE Standards for Mathematics A Correlation of Table of Contents Unit 1: Relationships between Quantities... 1 Unit 2: Linear and Exponential Relationships... 4 Unit 3: Reasoning with

More information

Analysis of Categorical Data Three-Way Contingency Table

Analysis of Categorical Data Three-Way Contingency Table Yu Lecture 4 p. 1/17 Analysis of Categorical Data Three-Way Contingency Table Yu Lecture 4 p. 2/17 Outline Three way contingency tables Simpson s paradox Marginal vs. conditional independence Homogeneous

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Practice exam questions

Practice exam questions Practice exam questions Nathaniel Higgins nhiggins@jhu.edu, nhiggins@ers.usda.gov 1. The following question is based on the model y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + u. Discuss the following two hypotheses.

More information

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting

More information

STAC51: Categorical data Analysis

STAC51: Categorical data Analysis STAC51: Categorical data Analysis Mahinda Samarakoon January 26, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 32 Table of contents Contingency Tables 1 Contingency Tables Mahinda Samarakoon

More information