Psych Jan. 5, 2005

Size: px

Start display at page:

Download "Psych Jan. 5, 2005"

Spencer Hampton
5 years ago
Views:

1 Psych Wee 1: Introductory Notes on Variables and Probability Distributions (1/5/05) (Reading: Aron & Aron, Chaps. 1, 14, and this Handout.) All handouts are available outside Mija s office. Lecture notes and overheads are available on the website. 1. Data collection in class Let us collect some data, which we can use to illustrate a few concepts. Try to imagine the development of a hypothetical person from the age of 1 year up to age 41. In the Table below please write down (i) the height (in inches) at ages 1, 11,..., 41, and (ii) the lieableness at these ages. Lieableness is measured on a 100-point scale, where 0 represents a most obnoxious person and 100 represents a most pleasing person. (Ignore columns S and P.) Height (inches) Lieableness At age 41 Age (years) Age S P (iii) In column S write male or female. (iv) Consider a 3-point scale: 1 = thin, 2 = medium, 3 = non thin. In Column P write the appropriate number for your imagined person. (v) Using the same 1-3 scale as in (iv), state your own body type, P*. These data can be used to answer many questions about imaginary (or real) persons, as now illustrated. (a) What is the average height of 11 year olds? How does Height change with Age? We can compare the answers provided by our data with the answers provided by official data, such as, census data. Indeed, the graph on this page gives the height of British boys and girls from around Our class data come from a different country and a generation later. Therefore, we might expect our data to differ (In what ways, and why?) but, at least, such a comparison between our data and a normative data set is a useful starting point in understanding our data. (b) How many students answered Qu (iii) above? Why might this number differ from the total number of persons in the room? (c) Is the lieableness of females the same as that of males? Is lieableness related to P or to P*? etc. The individuals whose data we have constitute a sample. Occasionally, our questions concern only the individuals in the sample. Typically, however, our questions concern the total set of individuals, of which the sample is only a subset. This total set is called the population. We mae inferences about a population from the sample data. Later on we will use the symbols, e.g., M, x, X, y 1, etc., to denote sample averages or means, and the symbol µ to denote a population mean. The distinction between sample and population is an important one. Questions about data usually involve variables, e.g., height and age. A variable is something which can vary (sic!) from person to person in a population. Each person is characterised by a particular value (e.g., tall, 60

2 Psych ins.; young, 10 years) of the variable. We must learn to distinguish between words or symbols that refer to variables, on the one hand, and words or symbols that refer to values of these variables, on the other. 2. Types of variables (A&A, Chap. 1) 2.1. Qualitative and quantitative variables. In determining what are valid uses for numbers in a study, the first question that has to be answered is From what type of scale do the measurements come? (i) The variable, S, in the Table on p. 1 is a nominal or qualitative variable with values Female and Male. These 2 values differ in name and in quality, not in quantity. In other words, even if we coded our data as Female = 1 and Male = 2, we would not be saying that a 2 has more of anything than a 1 ; simply that a 2 is different from a 1. (ii) P is an ordinal variable with values, 1, 2, and 3, that connote an increasing quantity on a size dimension. It is not the case that the difference in size between a 2 and a 1 is the same as that between a 3 and a 2 ; all that is implied is that a 3 has more size than a 2, which has more than a 1. (iii) Lieableness is on an interval scale (not unquestionably), meaning, e.g., that the difference in lieableness between persons with 10 and 20 units is the same as that between persons with 70 and 80 units. There is no meaningful zero when a variable is measured on an interval scale (or on a nominal or ordinal scale). (iv) Age and height are on ratio scales, meaning, e.g., that a person aged 20 is twice as old as one aged 10, etc. On a ratio scale, there is a meaningful zero (e.g., zero height is meaningful, whereas zero lieableness is hard to interpret). (v) Finally, the number of students in Psych 10 is on an absolute scale. The absolute scale has no physical units (e.g., years or inches); the values on this scale are pure numbers. In statistical practice, the ey distinction is that between nominal or qualitative variables, on the one hand, and quantitative variables (i.e., ordinal, interval, ratio and absolute), on the other. We will learn the statistical methods that are appropriate to each type of variable (e.g., chi-square methods for qualitative variables, and t-tests, correlation, etc. for quantitative variables) Discrete and continuous variables. A second question concerning type of scale is whether the variable is discrete (taing on values that are not arbitrarily close together) or continuous. A crude chec is to as: Can the variable tae on a value in its range such as 1.335? If no, the variable is probably discrete; if yes, the variable is probably continuous. The variables, S~ and~ P, are discrete; and age and height are continuous, in principle. In practice, it is often convenient to regard continuous variables as discrete (e.g., height, which is continuous, is usually measured to the nearest inch, which is a discrete scale); and to regard discrete variables as continuous (e.g., number of correct answers, which is discrete, is often assumed to be approximately Normally distributed, and the Normal distribution refers to continuous variables) Random variables. We will be dealing often with random variables. Notice that the values of Age were determined beforehand in the above study, whereas the values of height and lieableness were not - you gave the values for height and lieableness, and no-one could have set them or predicted them exactly beforehand. Thus Age is a non-random variable (also called a fixed factor ), whereas Height is a random variable. The word random connotes our ignorance about, our inability to predict, etc., but we shall find that in most cases we do have information about the behavior of random variables -- information contained in the probability distribution of that random variable Probability distributions; percentiles. Let us consider the quantitative variable, height of 14-yearolds. Through extensive observation, we may be able to find the values, x P, of height, such that a proportion, P, of 14-year-olds is shorter than x P. This information, {P, x P }, for P =.05,.25,.50,...,.95, is an example of a probability distribution. X.05 is called the 5 th percentile, x.50 is called the 50 th percentile (also called the median), etc. The accompanying graph shows the distribution of height of U.S. boys between ages 2-18 around We now illustrate the sort of interesting information that can be extracted from this probability distribution. (i) The median increases from about 34 ins. at age 2 to about 70 ins. at age 18.

Psych 124 3 (ii) The spread or range or variability or dispersion of the distribution increases with age. There are many ways to measure variability.

3 Psych (ii) The spread or range or variability or dispersion of the distribution increases with age. There are many ways to measure variability. In this case, a simple index of variability is the interval between the 5 th and 95 th percentile ; this interval increases from about 4 ins. at age 2 to about 9 ins. at age Small and large values of a variable. In the Table below we show the distribution of boys height at certain ages; these data were read off of the graph on this page. Probability- (or p-) values Age Percentiles (yrs) When would we say that a person has a small value of height (i.e., is short ) or a large value of height (i.e., is tall )? Ans. It depends on the age of the person and on the probability distribution of height for that age. For example, 44.3 ins. is tall for a 4-year-old, but short for an 8-year-old. (Why?) In general, a small value of a random variable is a value such that most values of the variable are larger than it; and a large value of a random variable is a value such that few values of the variable are larger than it. Let us adopt the convention that a small value is a value that is less than or equal to the 5 th percentile of the distribution, and a large value is a value that is greater than or equal to the 95 th percentile of the distribution. Exercises. (a) Is 54.1 ins. short for a 12-yr-old? (b) Is 62.7 ins. tall for a 12-yr-old; short for a 16-yr-old? (c) Is 54.0 ins. tall for an 8-yrold? 2.6. Validity and reliability. The validity of a measure, such as Lieableness, tells us how well that measure reflects the concept it is supposed to be measuring. Related to this is What is the most valid behavioral manifestation of a concept? For example, what do you understand by lieableness? Also, a measure is reliable if repeated values of it do not vary by much (assuming the underlying concept stays constant). Note that if a measure is very unreliable one would suspect its validity Independent and dependent variables. If two variables X and Y are causally related, it is sometimes possible to say that X causes Y. In such a case, Y is said to be the dependent variable (since it depends on X), and X is the independent variable The summation notation (A&A, pp ). A convenient shorthand exists for expressing sums. Given numbers, x 1, x 2,..., x, we express their sum, S, as S = x 1 + x x = x i. Exercises. Suppose x 1 = 2, x 2 = 4, x 3 = 1 and x 4 = 2. Show (after brushing up on your high-school Algebra!) that (i) x i = 7, (ii) 3 x i = 18, (iii) x i x i = 0, (iv) x i = 25; i=2 i=2

4 Psych n 1 (v) a = a + a + a = 3a, and (vi) = 1 n n + 1 n n = n 1 n = Probability distribution of a discrete variable (cf. A&A, Chap. 1) Referring to the item about P in the data collected in class, let us count how many persons responded P = 1, 3 2 or 3. Let f i be the frequency (or count) of persons who responded P = i, i = 1, 2, 3; let N = f 1 + f 2 + f 3 = f i be the total number of persons who responded; and let rf i = f i responding P = i. (1) (2) (3) P f i rf i rf i rf i N be the relative frequency or proportion of persons The list of possible values of the variable, together with the frequency (or relative frequency) of each value is called the frequency (or relative frequency) histogram of the variable. Histogram is a synonym for distribution, and relative frequency is a synonym for proportion, which has almost the same meaning as probability. If we could observe an entire population (instead of just N persons), then the relative frequency of value i would be the probability of observing value i, denoted by p i. The {p i } form the probability distribution of the variable, and they are estimated by the {rf i }. Three examples, (1), (2) and (3), of distributions are given in the above Table. The histogram or distribution is an excellent tool for summarising the mass of data from a sample Descriptive statistics obtainable from an observed distribution Often we wish to summarise the information contained in a distribution (which is itself a summary of the raw data), especially when the number of values of the variable is large. The most important summaries of a distribution are (i) a measure of location (Where on the scale are most persons located?), and (ii) a measure of variability or dispersion (How spread apart are the persons on the scale?). Location (Chap. 2). We have already mentioned the mean or average as a measure of location ; however, we can calculate the mean only if the variable is measured on an interval, ratio or absolute scale. We have also mentioned the median (as the middle score), but the variable has to be on at least an ordinal scale for us to be able to calculate the median. A third measure of location is the mode, defined as the most frequently occurring value of the variable. This index can be defined for qualitative and for quantitative variables. We can now compare any two distributions with respect to location. Samples (1) and (2) in the Table above have different modes (the values, 2 and 1, respectively), and (1) and (3) have the same mode. Dispersion. We have mentioned, as a measure of variability or dispersion, the interval between the 5 th and 95 th percentiles of a distribution; but percentiles can be calculated only for variables on an ordinal (or higher) scale, and not for nominal variables. For nominal variables, the relative frequency at the mode is a good measure of concentration, which is the inverse of variability. If the rf at the mode is low (high), the dispersion is relatively high (low). ( Percentile and rf at mode are not discussed in A&A.) In comparing samples (1)-(3) above, we see that sample (1) is the least dispersed, and sample (2) is the most dispersed (because 0.78 > 0.6 > 0.45). To sum up, samples (1) and (3) have the same mode, but (1) is less dispersed than (3); (2) differs from (1) in both location and dispersion Inferential statistics related to a univariate distribution (Chap. 14) Suppose we have a sample that gives us the frequency distribution of a variable. We might wish to use these data to see if the population distribution from which we obtained our sample is the same as, or different from,

5 Psych another nown population distribution. Usually, it is not possible to observe the entire population, and we can only observe small subsets nown as samples. This process of using sample data (a particular subset) to infer something about a population (the entire set) is called statistical inference. The sample frequency distribution is {f i }, i = 1, 2,..., ; where f i = N. Let us denote the nown probability distribution by {p i }; i.e., p i is the proportion of i s in the entire nown population, and p i = 1. If our sample was drawn from a population with distribution {p i }, then we should find that f i Np i. But how close do the f i have to be to the Np i for us to conclude that the sample was drawn from the nown population? We answer this question in the following stylised way. First, we state the information about the nown population as a null hypothesis, H 0. For example, H 0 : p 1 = 0.22, p 2 = 0.68, p 3 = The question to be answered is whether H 0 provides a good fit to the observed frequencies in sample (1), given on p. 4. Let us denote the observed frequencies as O i (instead of f i ), and recall that N = O i is the total number of observations. Second, we calculate what the frequencies are expected to be if H 0 is true, i.e., if our sample were drawn from a population as given in H 0. These expected frequencies are denoted by E i, and are given by the formula, E i = Np i, which implies E i = Np i = N p i = N. i= 1 Third, we calculate an index, nown as chi-square or χ 2, of the distance between the 2 sets of frequencies, {O i } and {E i }: ( ) 2 O χ 2 = i E i (1). E i A large value of χ 2 would indicate a poor fit; in this case, we would reject H 0, and conclude that our sample was drawn from a population different from that described in the null hypothesis. A small value of χ 2 would indicate a good fit; in this case, we would retain H 0, and conclude that our sample was drawn from a population that is no different from that described in the null hypothesis. But, what is a large value of χ 2? Fourth, we need to calculate what a large value of χ 2 is. It is clear that, even if H 0 is true, the larger (the number of values) is, the larger χ 2 will tend to be. Therefore, the definition of large depends on. (Recall that, for a similar reason, the definition of a large height depends on the age, the analogue of, of the person.) More precisely, the definition of large depends on the number of independent terms in the sum that is χ 2. There are terms in the sum, but only -1 of them are independent, i.e., are free to tae on any value. Once -1 of the terms in the sum are nown, the th term is nown because the terms, O I E i, satisfy the constraint, ( O i E i ) = O i E i = N N = 0. The degrees of freedom (df) of χ 2 is the number of independent terms in the sum that defines χ 2 ; for this problem the df = -1. Given the df, we can consult the Statistical Tables for the probability distribution of χ 2 with the stated df to get the 95 th percentile. Recall that, earlier, we adopted the convention that a large value of a quantitative variable is any value greater than the 95 th percentile. We refer to this 95 th percentile as the critical value of χ 2. Below is the Table for df = 1,..., 8; the percentiles in this Table are derived from mathematical arguments beyond our scope.

6 Psych Table 10 Distribution of Chi-Square for Given Probability Levels Probability df Fifth, if the χ 2 goodness-of-fit index, as computed using the formula in Eq. (1) above, is greater than the critical value obtained from the Statistical Tables, we reject H 0. Otherwise, we retain H 0. Please go to the accompanying Problem Set Handout for Exercises on χ Test of contingency between 2 discrete variables (Chap. 14) Is there a relationship or contingency between the gender of a student (SG) and the gender of the target person imagined by the student (TG)? (TG refers to the imaginary person in the Class Project from the first lecture.) We answer this question by stating the null hypothesis: H 0 : There is no relationship or contingency between SG and TG. Given the data, should we reject or retain H 0? In class we will code both variables using the values, F and M, and will arrange the data in a contingency table (or bivariate frequency distribution) with 2 rows and 2 columns (i.e., a 2x2 table). Let us now use the data from a previous class. SG TG M F Total M (9.9) (13.1) F (44.1) (58.9) Total Among 23 males in this previous class, 18 (78%) imagined a Male target and 5 imagined a Female target. Among 103 females in this class, 36 (35%) imagined a Male target and 67 imagined a Female target. The numbers in parentheses are expected frequencies, to be defined below. There are two interesting sets of frequencies in the above contingency table. (a) One is the set of marginal frequencies, i.e., the frequencies in the margins of the table: (ai) the Row marginal frequencies, 23 and 103, give the (univariate) frequency distribution of the variable SG; this distribution tells us that there are 4.5 times as many F s as M s in the class. (aii) the Column marginal frequencies, 54 and 72, give the (univariate) frequency distribution of the variable TG; this distribution is not of primary interest it depends on the number of F s and M s in the class, and on their tendency to imagine female vs. male. (b) The other set of interesting frequencies are the cell frequencies (also called the joint frequencies), 18, 5, 36 and 67; it is these frequencies that are most relevant to deciding if there is a contingency between SG and TG. It is clear from this table (without doing a statistical test) that there is a relationship or contingency between SG and TG - female students tend to imagine the target as female, and male students tend to imagine the target as male. The pattern that would be consistent with no relation between SG and TG would be (i) 9.9 out of 23 males (43%) imagining Male, and (ii) 44.1 out of 103 (43%) females imagining Male. These expected frequencies are shown in parentheses in the table above. Note that the marginal frequencies derived from the expected frequencies

7 Psych are the same as those based on the observed joint frequencies. Let cell(i, j) be the cell at Row i and Column j; let R i be the total frequency in Row i, and let C j be the total frequency in Column j. Let the expected frequency in cell(i, j) be E ij. If there is no relationship between the Row and Column variables, then, in any row, the relative frequencies in any row should be the same as the relative frequencies of the Column totals; that is, for each i, E ij R i = C j N, implying that E ij = R i C j N. To test the null hypothesis of no contingency, we need to assess the goodness-of-fit between the observed and expected joint frequencies. We again use the chi-square goodness-of-fit index: ( ) 2 O χ 2 ij E ij =. i,j E ij However, for this problem, the degrees of freedom of χ 2 is now (r-1)(c-1), where r and c are the number of rows and columns, respectively, of the contingency table. (The justification for this formula will be given in the next Handout.) In the present data set, r = c = 2, implying that the df = 1. The observed χ 2 = ( ) 2 + ( ) 2 + ( ) 2 + ( ) 2 = With 1 df, the critical value of χ 2 is 3.84 (from the Statistical Tables). Since 14.4 > 3.84, the observed χ 2 is large, and we must reject H 0 and conclude that there is a relationship between SG and TG.

Inferential statistics

Inferential statistics Inference involves making a Generalization about a larger group of individuals on the basis of a subset or sample. Ahmed-Refat-ZU Null and alternative hypotheses In hypotheses testing,