CSS 211: Statistical Methods I

CSS 211: Statistical Methods I Zhaoxian Zhou School of Computing University of Southern Mississippi Zhaoxian.Zhou@usm.edu January 11, 2018 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 1 / 227

Overview 1 Looking at Data Distributions 2 Looking at Data Relationships 3 Producing Data 4 Probability: the Study of Randomness 5 Sampling Distributions 6 Introduction to Inference 7 Inference for Distributions Zhaoxian Zhou (USM) CSS 211 January 11, 2018 2 / 227

Objectives Displaying distributions with graphs Variables; Types of variables; Graphs for categorical variables (Bar graphs, Pie charts); Graphs for quantitative variables (Histograms, Stemplots, Stemplots versus histograms); Interpreting histograms; Time plots Describing distributions with numbers Measures of center (mean, median); Mean versus median; Measures of spread (quartiles, standard deviation); Five-number summary and boxplot; Choosing among summary statistics; Changing the unit of measurement Density curves and Normal distributions Density curves; Measuring center and spread for density curves; Normal distributions; The 68-95-99.7 rule; Standardizing observations; Using the standard Normal Table; Inverse Normal calculations; Normal quantile plots Zhaoxian Zhou (USM) CSS 211 January 11, 2018 3 / 227

Basic Concepts Statistics: the science of learning from data. Cases: the objects described by the set of data. Can be individuals, companies, animals, plants, or any object of interest. Variable: a characteristic of a case. Categorical: something that falls into one of several categories. Example: blood type, hair color, first language Quantitative: something that takes numerical values for which arithmetic operations, such as adding and averaging, make sense. Example: age, height, blood pressure Choose appropriate variable that measures what you want it to, eg, rate and count of occurrences Label: a special variable used in some data sets to distinguish the different cases. The distribution of a variable tells us what values the variable takes and how often it takes these values. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 4 / 227

Displaying Distribution with Graphs Ways to chart categorical data Bar graph: each category is represented by a bar Pie chart: the slices must represent the parts of one whole. Ways to chart quantitative data Stemplot, also called a stem-and-leaf plot. Each observation is represented by a stem, consisting of all digits except the final one, which is the leaf. Histogram: breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. Time plot: plots each observation against the time at which it was measured. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 5 / 227

Bar Graphs and Pie Charts for Categorical Variables Bar graph: each category is represented by a bar Pie chart: the slices must represent the parts of one whole, and all percents add up to 100. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 6 / 227

Stemplots for Quantitative Variables Leaf: final digit; Stem: all others Write the stems in vertical column with smallest at the top, and draw a vertical at the right. Write each leaf in the row to the right of its stem, in increasing order from the stem. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 7 / 227

Stemplots Variations Zhaoxian Zhou (USM) CSS 211 January 11, 2018 8 / 227

Histograms good for large data sets breaks the range of values into classes; shows the number of individual data points that fall in each interval. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 9 / 227

Examining Distributions Look for overall pattern and for striking deviations from the pattern Describe the overall pattern by shape, center, and spread. Look for outlier, an individual value that falls outside the overall pattern. A large gap in the distribution is typically a sign of an outlier. Explain any outliers: errors in recording data? Equipment failure? Modes: major peaks. A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram extends much farther out than the left side. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 10 / 227

Time Plots Plot of each observation against time at which it was measured Reveal trends or other changes over time, despite small irregularities. The time is on horizontal scale. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 11 / 227

Time Plots An Example Zhaoxian Zhou (USM) CSS 211 January 11, 2018 12 / 227

Describing Distributions with Numbers Measuring Center The mean or average: x = x 1 +x 2 + +x n n or x = 1 n i x i The median M: the midpoint of a distribution, the number such that half of the observations are smaller and half are larger Sort all observations Find the midpoint value Example? If n is odd: xn+1 2 ; If n is even: the mean of xn 2 and x n 2 +1. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 13 / 227

Comparing the Mean and the Median The median is a measure of center that is resistant to skew and outliers. The mean is not. Example: Powerball jackpot of $1.5B and its effect to the average and the median household income of US ($43,585 in 2012). Symmetric distribution: the mean equals to the median. Skewed distribution: the mean is farther out in the long tail than the median. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 14 / 227

Describing Distributions with Numbers Measuring Spread or Variability The 50 th percentile: the median The upper quartile: the median of the upper half of the data The lower quartile: the median of the lower half of the data The p th percentile of a distribution is the value that has p percent of the observations fall at or below it. The quartiles Q 1 and Q 3 : Sort the observations in increasing order. Find the median M. Find Q 1 : the median of the left half of the data Find Q 3 : the median of the right half of the data Interquartile range: IQR = Q 3 Q 1 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 15 / 227

Five-Number Summary and Boxplot The five-number summary of a set of observations is Minimum Q 1 M Q 3 Maximum A boxplot is a graph of the five-number summary. An example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 16 / 227

Suspected Outlier and Modified Boxplot The 1.5 IQR rule: suspected outlier if it falls more than 1.5 IQR above the third quartile or below the first quartile. A modified boxplot: the lines extend out from the center box only to the smallest and largest observations that are not flagged by the 1.5 IQR rule. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 17 / 227

Standard Deviation The variance(s 2 ) is defined as s 2 = 1 n 1 (x i x) 2 The standard deviation(s) is defined as s = 1 n 1 i (x i x) 2 The number n 1 is the degree of freedom of the variance or standard deviation. s measures the spread about the mean. s = 0 means there is no spread. s is not resistant to outliers and has the same units as the original observations. i Zhaoxian Zhou (USM) CSS 211 January 11, 2018 18 / 227

Example: Calculate the Mean and the Standard Deviation Given a data set: x = [1,3,5,4,6,3,6,3,5,4] Calculate the mean: x = 1+3+5+4+6+3+6+3+5+4 10 = 40 10 = 4. Calculate standard deviation x 1 3 5 4 6 3 6 3 5 4 x x -3-1 1 0 2-1 2-1 1 0 (x x) 2 9 1 1 0 4 1 4 1 1 0 Variance Standard deviation σ 2 = 9+1+1+0+4+1+4+1+1+0 10 1 σ = 22 9 = 1.5635. = 22 9. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 19 / 227

Linear Transformation of Data A linear transformation x new = a+bx, where a shifts the value of x upward or downward; b changes the size of the unit of the measurement. Linear transformation does not change the shape of a distribution. Multiplication only: the measure of center (mean and median) and the measure of the spread (interquartile range and standard deviation) are multiplied by b. Addition only: the measures of center and percentiles are added by a, but not the measures of spread. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 20 / 227

Linear Transformation Example 1 Fahrenheit is a thermodynamic temperature scale, where the freezing point of water is 32 degrees Fahrenheit and the boiling point is 212 degrees at standard atmospheric pressure. Celsius and Fahrenheit scales are related by F = 32+1.8 C. An example of distributions of the same temperature set is 0.04 C F 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 40 20 0 20 40 60 80 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 21 / 227

Linear Transformation Example 2 Curved scores y and raw scores x are related by y = 1.2x +20. Suppose the mean and the standard deviation of the raw scores are x = 50,σ x = 10. The mean and standard deviation of the curved score are ȳ =??;σ y =?? How about the median and interquartile of the curved score? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 22 / 227

Density Curves A density curve: a smooth fitting curve to the data, a mathematical model of a distribution. The area between the curve and the horizontal axis is 1. The area under the curve and above any range of values is the proportion of all observations that fall in that range. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 23 / 227

Center of Density Curves The median of a density curve is the equal-areas point; the point that divides the area under the curve in half. The mean of a density curve is the balance point if made of solid material. The median and mean are the same for a symmetric density curve; the mean of a skewed curve is pulled in the direction of the long tail. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 24 / 227

Normal Curves The Normal (Gaussian) distribution: symmetric, unimodal, and bell-shaped All normal curves N(µ,σ) have the same overall shape: the height of the curve at point x is given by Example: f(x) = 1 σ 2π e 1 2( x µ σ ) 2. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 25 / 227

The 68-95-99.7 Rule In the normal distributions with mean µ and standard deviation σ, Approximately 68% of the observations fall within σ of the mean µ. Approximately 95% of the observations fall within 2σ of the mean µ. Approximately 99.7% of the observations fall within 3σ of the mean µ. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 26 / 227

The Standard Normal Curves Because all Normal distributions share the same properties, we can standardize our data to transform any Normal curve N(µ, σ) into the standard Normal curve N(0, 1). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 27 / 227

Standardizing and z Scores All normal distributions N(µ,σ) are the same if measured in units of size σ about the mean µ. Standard normal distribution is the normal distribution N(0, 1). z score: the standard value of x is z = x µ, which can be positive, zero, or negative. σ Example: The height of young women are normal with µ = 64.5 inches and σ = 2.5 inches. The standard z score is z = x 64.5. 2.5 For height x = 68 inches, the z score is 1.4 For height x = 60 inches, the z score is -1.8 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 28 / 227

Normal Distributions Cumulative proportion: the proportion of observations in a distribution that lie at or below a given value. When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the left of a given value. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 29 / 227

Using the Standard Normal Table Zhaoxian Zhou (USM) CSS 211 January 11, 2018 30 / 227

Normal Distribution Calculations Illustrations Zhaoxian Zhou (USM) CSS 211 January 11, 2018 31 / 227

Normal Distribution Calculations Example Question: SAT scores are normal distribution N(1026, 209). How many students are in range (720, 820)? Solution: Step 1: standardize or 720 x 820 = 720 1026 209 1.46 Z 0.99 Z 820 1026 209 Step 2: use the standard normal table 0.1611 0.0721 = 0.0890 9%. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 32 / 227

Normal Distribution Calculations Inverse Problem Question: SAT scores satisfy N(505, 110). What score can place a student at top 10%? Solution: Step 1: top 10% means higher than 90%. Step 2: use the standard normal table, find that for value 0.9, the corresponding z is 1.28. Step 3: unstandardizing: x 505 110 = 1.28 = x = 505+1.28 110 = 645.8 The general rule for unstandardizing is x = µ+zσ. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 33 / 227

Objectives Scatterplots Scatterplots; Explanatory and response variables; Interpreting scatterplots; Outliers; Categorical variables in scatterplots; Scatterplot smoothers Correlation The correlation coefficient; Influential points Least-squares regression Regression lines; Prediction and Extrapolation; Correlation and r 2 Cautions about correlation and regression Residuals; Outliers and influential points; Lurking variables; Correlation/regression using averages Data analysis for two-way tables Two-way tables; Joint distributions; Marginal distributions; Conditional distributions; Simpson s paradox Zhaoxian Zhou (USM) CSS 211 January 11, 2018 35 / 227

Introduction In last chapter we discussed the distributions of single variable. In this chapter we will discuss relationships between pairs of variables. Both variables are quantitative? If yes, we use scatterplots for graphical display. Both variables are associated? If yes, we examine correlation (calculate correlation coefficient). For mathematical modelling: We use least-square regression to fit the data. Both variables are categorical? If yes, we do data analysis for two-way tables. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 36 / 227

Examining Relationships Most statistical studies involve more than one variable. Questions: What cases do the data describe? What variables are present and how are they measured? Are all of the variables quantitative? Do some of the variables explain or even cause changes in other variables? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 37 / 227

Looking at Relationships Start with a graph Look for an overall pattern and deviations from the pattern Use numerical descriptions of the data and overall pattern (if appropriate) Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 38 / 227

Scatterplots One axis represents each of the variables, and the data are plotted as points on the graph. Describe the relationship by examining the form, direction, and strength of the association. Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the form Deviations from that pattern: outliers Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 39 / 227

Form of an Association Zhaoxian Zhou (USM) CSS 211 January 11, 2018 40 / 227

Direction of an Association Zhaoxian Zhou (USM) CSS 211 January 11, 2018 41 / 227

Strength of an Association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 42 / 227

Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 43 / 227

Example Zhaoxian Zhou (USM) CSS 211 January 11, 2018 44 / 227

Scatterplot Smoothers When an association is more complex than linear, we can still describe the overall pattern by smoothing the scatterplot. Simply average the y values separately for each x value When a data set does not have many y values for a given x, software smoothers form an overall pattern by looking at the y values for points in the neighborhood of each x value. Smoothers are resistant to outliers. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 45 / 227

Correlation Coefficient Definition r = 1 n 1 n ( ) ( ) xi x yi ȳ. σ x σ y }{{}}{{} z score for x i z score for y i i=1 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 46 / 227

Correlation Coefficient Example Given data set (1,2),(2,5),(3,6),(4,8),(5,11),(6,12),(7,16),(8,18),(9,19),(10,20), which has n = 10 pairs of data Step 1: Calculate means and standard deviations for X and Y Mean for x data, x = 1+2+3+4+5+6+7+8+9+10 10 = 5.5; Similarly, mean for y data, ȳ = 11.7. Standard deviation for x data, 1 σ x = 10 1 [(1 5.5)2 + +(10 5.5) 2 ] = 3.0277; Similarly for y data, σ y = 6.3779. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 47 / 227

Example Cont d Step 2: Calculate z-scores of x i and y i, and the product of the z-scores for each data point: Point 1: z x = 1 5.5 3.0277 = 1.4863;z y = 2 11.7 6.3779 = 1.5209;z xz y = 2.2605 Point 2: z x = 2 5.5 3.0277 = 1.1560;z y = 5 11.7 6.3779 = 1.0505;z xz y = 1.2144 continue until the last point Point 10: z x = 10 5.5 3.0277 = 1.4863;z y = 20 11.7 6.3779 = 1.3014;z xz y = 1.9342 Step 3: Calculate correlation coefficient by adding all products and dividing by n 1: r = 2.2605+1.2144+ +1.9342 10 1 = 0.9926 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 48 / 227

Correlation Coefficient Properties The correlation coefficient is a measure of the direction (sign) and strength (absolute value) of a linear relationship. It is calculated using the mean and the standard deviation of both the x and y variables. Correlation can only be used to describe quantitative variables. Categorical variables do not have means and standard deviations. The correlation coefficient treats x and y symmetrically, it does not distinguish x and y. The correlation coefficient is unitless. Allows us to compare correlations between data sets where variables are measured in different units or when variables are different. Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 49 / 227

Range of Correlation Coefficient Zhaoxian Zhou (USM) CSS 211 January 11, 2018 50 / 227

Regression Line Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. We would like to have a numerical description of how both variables vary together. A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Use a regression line to predict the value of y for a given value of x. How does y change as x changes? In regression, the distinction between explanatory and response variables is important. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 51 / 227

Concept of Least-Square Regression There are many lines that fit the data. But which line best describes the data? The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is as small as possible (least). Exceptionally helpful in statistics. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 52 / 227

Least-Square Regression Line The least-squares regression line is given by ŷ = b 0 +b 1 x, where b 1 = r σ y = 1 (xi x)(y i ȳ) σ x n 1 (xi x) 2 = 1 σ xy n 1σxx 2 ;b 0 = ȳ b 1 x. r is the correlation, σ is the standard deviation, x and ȳ are means. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 53 / 227

y Example Least-Square Regression Line Data set (1,2),(2,5),(3,6),(4,8),(5,11),(6,12),(7,16),(8,18),(9,19),(10,20) Step 1: Calculate (in previous example) x = 5.5;ȳ = 11.7;σ x = 3.0277;σ y = 6.3779;r = 0.9926. Step 2: Calculate the slope of the least-square regression line: b 1 = r σy = 0.9926 6.3779 σ x 3.0277 = 2.0909. Step 3: Calculate the y-intercept of the least-square regression line: b 0 = ȳ b 1 x = 11.7 2.0909 5.5 = 0.2000. 25 data least square regression 20 15 10 y=2.0909 x+0.2000 5 0 0 2 4 6 8 10 12 x Zhaoxian Zhou (USM) CSS 211 January 11, 2018 54 / 227

Making Predictions Interpolation Interpolation: the equation of the least-squares regression allows you to predict y for any x within the range studied. Example: nobody in the study drank 6.5 beers, but by finding the value of ŷ from the regression line for x = 6.5 we would expect a blood alcohol content of ŷ = 0.0144 6.5+0.0008 = 0.094mg/ml. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 55 / 227

Making Predictions Extrapolation Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line. Extrapolation can be wrong. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 56 / 227

Correlation and Regression Correlation measures the spread (scatter) in both the x and y directions in the linear relationship. Deal with multiplication of z-scores. Regression examines the variation in the response variable (y) given change in the explanatory variable (x). If the y intercept is zero, recall that b 1 = r σ y σ x and y = b 1 x = y σ y = r x σ x. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 57 / 227

Coefficient of Determination Coefficient of determination (r 2 ): the square of the correlation coefficient. r 2 represents the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x, or. r 2 = variance of the predicted value ŷ variance of the observed value y. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 58 / 227

6 5 4 3 2 1 0 0 1 2 3 4 1 6 5 4 3 2 1 0 0 1 2 3 4 1 Coefficient of Determination Example Assume first data set (1,1), (2,2), (3,3) then x = 2,σ x = 1;z x = 1,0,1;ȳ = 2,σ y = 1;z y = 1,0,1;r = 1. b 1 = r σy σ x = 1;b 0 = ȳ b 1 x = 0;= LS regression line: ŷ = x. Assume second data set (1,1), (2,5), (3,3), then x = 2,σ x = 1;z x = 1,01;ȳ = 3,σ y = 2;z y = 1,1,0;r = 0.5. b 1 = r σy σ x = 1;b 0 = ȳ b 1 x = 1;= LS regression line: ŷ = x +1. For the second data set, predicted values are 2,3,4. σ 2 pred = (2 3)2 +(3 3) 3 +(4 3) 2 3 1 How about the first data set? = 1;σ 2 obsv = 4;r 2 = 0.25. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 59 / 227

Coefficient of Determination Examples Zhaoxian Zhou (USM) CSS 211 January 11, 2018 60 / 227

Residuals and Residual Plots Residual: the distances from each point to the least-square regression line. It is the contribution of individual data points to the overall pattern of scatter. If residuals are scattered randomly around 0, the data fit a linear model, normally distributed, and there are no outliers. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 61 / 227

Examples of Residual Plots Zhaoxian Zhou (USM) CSS 211 January 11, 2018 62 / 227

Outliers and Influential Points Outlier: observation that lies outside the overall pattern of observations. Influential individual: observation that markedly changes the regression if removed. This is often an outlier on the x-axis. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 63 / 227

Two-Way Tables Two-way tables: organize data about two categorical variables obtained from a two-way, or block, design. (There are now two ways to group the data). We call education the row variable and age group the column variable. Each combination of values for these two variables is called a cell. For each cell, we can compute a proportion by dividing the cell entry by the total sample size. The collection of these proportions would be the joint distribution of the two variables. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 64 / 227

Marginal Distributions The marginal distribution, expressed in counts or percentages, is the distribution of a single categorical variable in a two-way table. The marginal distributions can be displayed on separate bar graphs, typically expressed as percents instead of raw counts. Each graph represents only one of the two variables, completely ignoring the second one. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 65 / 227

Conditional Distributions The Conditional distribution: the distribution of other variables conditioning on the value of one variable. The conditional distributions can be graphically compared using side by side bar graphs of one variable for each value of the other variable. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 66 / 227

Conditional vs. Marginal Distributions Conditional distribution is a probability distribution for a sub-population. In other words, it shows the probability that a randomly selected item in a sub-population has a characteristic youre interested in. Marginal distributions are the totals for the probabilities. They are found in the margins. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 67 / 227

Simpson s Paradox Simpson s paradox: an association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group, usually because of lurking variable a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables. Example: Zhaoxian Zhou (USM) CSS 211 January 11, 2018 68 / 227

Objectives Design of experiments. Anecdotal and available data; Comparative experiments; Randomization; Randomized comparative experiments; Cautions about experimentation; Matched pairs designs; Block designs Sampling designs; Toward statistical inference Sampling methods; Simple random samples; Stratified samples; Caution about sampling surveys; Population versus sample; Toward statistical inference; Sampling variability; Capture-recapture sampling Ethics Institutional review boards; Informed consent; Confidentiality; Clinical trials; Behavioral and social science experiments Zhaoxian Zhou (USM) CSS 211 January 11, 2018 70 / 227

Obtaining Data Available data: data that were produced in the past for some other purpose but that may help answer a present question inexpensively. The library and the Internet are sources of available data. Anecdotal evidence: is based on haphazardly selected individual cases, which we tend to remember because they are unusual in some way. They also may not be representative of any larger group of cases. Example: In 2013 in US, the one-year odds of death from all motor vehicle accidents is one in 8,938; one-year odds of death from lightning is one in 13,744,732. The odds of winning the Powerball jackpot is one in 292 million. Some questions require data produced specifically to answer them. This leads to designing observational or experimental studies. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 71 / 227

Basic Concepts Population: the entire group of individuals in which we are interested but can t usually assess directly. Example: how long can LED bulbs last? Sample: the part of the population we actually examine and for which we do have data. How well the sample represents the population depends on the sample design. A parameter: a number describing a characteristic of the population. A statistic is a number describing a characteristic of a sample. Observational study: record data on individuals without attempting to influence the responses. Example: what is the is average life span of the items in the sample? Experimental study: deliberately impose a treatment on individuals and record their responses. Influential factors can be controlled. Example: what is the effect of a drug to a desease? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 72 / 227

Observational Studies vs. Experiments Observational studies are essential sources of data on a variety of topics. However, when our goal is to understand cause and effect, experiments are the only source of fully convincing data. Two variables are confounded when their effects on a response variable cannot be distinguished from each other. Example: If we simply observe cell phone use and brain cancer, any effect of radiation on the occurrence of brain cancer is confounded with lurking variables such as age, occupation, and place of residence. Well designed experiments take steps to defeat confounding. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 73 / 227

Terminology The individuals in an experiment are the experimental units. If they are human, we call them subjects. In an experiment, we do something to the subject and measure the response. The something we do is a called a treatment, or factor. If the experiment involves giving two different doses of a drug, we say that we are testing two levels of the factor. A response to a treatment is statistically significant if it is larger than you would expect by chance (due to random variation among the subjects). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 74 / 227

Comparative Experiments Experiments are comparative in nature: we compare the response to a treatment to another treatment no treatment (a control) a placebo or any combination of the above A control is a situation where no treatment is administered. It serves as a reference mark for an actual treatment (e.g., a group of subjects does not receive any drug or pill of any kind). A placebo is a fake treatment, such as a sugar pill. This is to test the hypothesis that the response to the actual treatment is due to the actual treatment and not the subject s apparent treatment. The placebo effect is an improvement in health not due to any treatment, but only to the patient s belief that he or she will improve. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 75 / 227

Designing Controlled Experiments Example: experiment to evaluate the success of various fertilizer treatments was worthless because of poor experimental design: Fertilizer had been applied to a field one year and not another, in order to compare the yield of grain produced in the two years Fertilizer was applied to one field and not to a nearby field in the same year. Fisher s solution: Randomized comparative experiments Zhaoxian Zhou (USM) CSS 211 January 11, 2018 76 / 227

Randomization One way to randomize an experiment is to rely on random digits to make choices in a neutral way. We can use a table of random digits or the random sampling function of a statistical software. Randomly choose n individuals from a group of N Label each of the N individuals with a number (typically from 1 to N, or 0 to N 1). A list of random digits is parsed into digits the same length as N. The parsed list is read in sequence and the first n digits corresponding to a label in our group of N are selected. The n individuals within these labels constitute the selection. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 77 / 227

Randomization Example Problem: randomly select five students from a class of 20. List and number all students as 01,02, 20. The number 20 is two digits long, so parse the list of random digits into numbers that are two digits long. Here we chose to start with line 103 for no particular reason. Randomly choose five students by reading through the list of two-digit random numbers, starting with line 103 and on. The first five random numbers that match the numbers assigned to students make our selection. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 78 / 227

Principles of Experimental Design Three big ideas of experimental design: Control the effects of lurking variables on the response, simply by comparing two or more treatments. Randomize: use impersonal chance to assign subjects to treatments. Replicate each treatment on enough subjects to reduce chance variation in the results. Statistical significance: an observed effect so large that it would rarely occur by chance is called statistically significant. Completely randomized experimental designs: individuals are randomly assigned to groups, then the groups are randomly assigned to treatments. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 79 / 227

Biased Design The design of a study is biased if it systematically favors certain outcomes. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 80 / 227

Caution about Experimentation Ways to remove bias: Randomize the design: both the individuals and treatments are assigned randomly. A double-blind experiment: neither the subjects nor the experimenter know which individuals got which treatment until the experiment is completed. The goal is to avoid forms of placebo effects and biases based on interpretation. Replicate your experiment: ensures that particular results are not due to uncontrolled factors or errors of manipulation. Lack of realism is a serious weakness of experimentation. The subjects or treatments or setting of an experiment may not realistically duplicate the conditions we really want to study. In that case, we cannot generalize about the conclusions of the experiment. Example: studying the effects of hair spray on rats to determine what will happen to women with big hair. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 81 / 227

Block Designs In a block, or stratified, design, subjects are divided into groups, or blocks, prior to experiments, to test hypotheses about differences between the groups. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 82 / 227

Matched Pairs Designs In a Matched pairs: choose pairs of subjects that are closely matched - e.g., same sex, height, weight, age, and race. Within each pair, randomly assign who will receive which treatment. It is also possible to just use a single person, and give the two treatments to this person over time in random order. In this case, the matched pair is just the same person at different points in time. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 83 / 227

Sampling Methods Convenience sampling: just ask whoever is around. Bias: Opinions limited to individuals present. Voluntary Response Sampling: often called public opinion polls, these are not considered valid or scientific, because different people are motivated to respond or not. Probability or random sampling: individuals are randomly selected. No one group should be over-represented. Sampling randomly gets rid of bias. A simple random sample (SRS) is made of randomly selected individuals. Each individual in the population has the same probability of being in the sample. All possible samples of size n have the same chance of being drawn. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 84 / 227

Sampling Methods Stratified Samples A stratified random sample is essentially a series of SRSs performed on subgroups of a given population. The subgroups are chosen to contain all the individuals with a certain characteristic. Examples: Divide the population of USM students into males and females. Divide the population of California by major ethnic group. The SRS taken within each group in a stratified random sample need not be of the same size. For example: A stratified random sample of 100 male and 150 female USM students A stratified random sample of a total of 100 Californians, representing proportionately the major ethnic groups Multistage samples use multiple stages of stratification. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 85 / 227

Caution about Sampling Surveys Nonresponse: people who feel they have something to hide or who don t like their privacy being invaded probably won t answer. Yet they are part of the population. Response bias: fancy term for lying when you think you should not tell the truth, or forgetting. This is particularly important when the questions are very personal (e.g., How much do you drink? ) or related to the past. Wording effects: questions worded like Do you agree that it is awful that are prompting you to give a particular response. Undercoverage: occurs when parts of the population are left out in the process of choosing the sample. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 86 / 227

Toward Statistical Inference The techniques of inferential statistics allow us to draw inferences or conclusions about a population in a sample. Your estimate of the population is only as good as your sampling design. So work hard to eliminate biases. Your sample is only an estimate and if you randomly sampled again you would probably get a somewhat different result. The bigger the sample the better. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 87 / 227

Sampling Variability Each time we take a random sample from a population, we are likely to get a different set of individuals and a different statistic. This is called sampling variability. The good news is that, if we take lots of random samples of the same size from a given population, the variation from sample to sample the sampling distribution will follow a predictable pattern. All of statistical inference is based on this knowledge. The variability of a statistic is described by the spread of its sampling distribution. This spread depends on the sampling design and the sample size n, with larger sample sizes leading to lower variability. Statistics from large samples are almost always close estimates of the true population parameter. However, this only applies to random samples. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 88 / 227

Sampling Variability Cont d Zhaoxian Zhou (USM) CSS 211 January 11, 2018 89 / 227

Capture Recapture Sampling Repeated sampling can be used to estimate the size N of a population. Example: What is the number of a bird species (least flycatcher) migrating along a major route? Least flycatchers are caught in nets, tagged, and released. The following year, the birds are caught again and the numbers tagged versus not tagged recorded. Solution: the proportion of tagged birds in the sample should be a reasonable estimate of the proportion of tagged birds in the population. This works well if both samples are SRSs from the population and the population remains unchanged between samples. In practice, however, some of the birds tagged last year died before this year s migration. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 90 / 227

Ethics Institutional Review Boards The organization that carries out the study must have an institutional review board that reviews all planned studies in advance in order to protect the subjects from possible harm. The purpose of an institutional review board is to protect the rights and welfare of human subjects (including patients) recruited to participate in research activities. The institutional review board: reviews the plan of study can require changes reviews the consent form monitors progress at least once a year Zhaoxian Zhou (USM) CSS 211 January 11, 2018 91 / 227

Ethics Informed Consent All subjects must give their informed consent before data are collected. Subjects must be informed in advance about the nature of a study and any risk of harm it might bring. Subjects must then consent in writing. Who can t give informed consent? prison inmates very young children people with mental disorders Zhaoxian Zhou (USM) CSS 211 January 11, 2018 92 / 227

Ethics Confidentiality All individual data must be kept confidential. Only statistical summaries may be made public. Confidentiality is not the same as anonymity. Anonymity prevents follow-ups to improve non-response or inform subjects of results. Separate the identity of the subjects from the rest of the data immediately! Example: Citizens are required to give information to the government (tax returns, social security contributions). Some people feel that individuals should be able to forbid any other use of their data, even with all identification removed. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 93 / 227

Ethics Clinical Trials Clinical trials study the effectiveness of medical treatments on actual patients - these treatments can harm as well as heal. Points for a discussion: Randomized comparative experiments are the only way to see the true effects of new treatments. Most benefits of clinical trials go to future patients. We must balance future benefits against present risks. The interests of the subject must always prevail over the interests of science and society. In the 1930s, the Public Health Service Tuskegee study recruited 399 poor blacks with syphilis and 201 without the disease in order to observe how syphilis progressed without treatment. The Public Health Service prevented any treatment until word leaked out and forced an end to the study in the 1970s. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 94 / 227

Ethics Behavioral and Social Science Experiments Many behavioral experiments rely on hiding the true purpose of the study. Subjects would change their behavior if told in advance what investigators were looking for. The Ethical Principles of the American Psychological Association require consent unless a study merely observes behavior in a public space. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 95 / 227

Objectives Randomness Probability models Probability and Randomness; Sample spaces; Probability rules; Assigning probabilities: finite number of outcomes; Assigning probabilities: equally likely outcomes; Independence and multiplication rule Random variables Means and variances of random variables Discrete random variables; Continuous random variables; Normal probability distributions; Mean of a random variable; Law of large numbers; Variance of a random variable; Rules for means and variances General probability rules General addition rules; Conditional probability; General multiplication rules; Tree diagrams; Bayes s rule Zhaoxian Zhou (USM) CSS 211 January 11, 2018 97 / 227

Basic Concepts A phenomenon is random if individual outcomes are uncertain, but there is nonetheless a regular distribution of outcomes in a large number of repetitions. The probability of any outcome of a random phenomenon can be defined as the proportion of times the outcome would occur in a very long series of repetitions. Two events are independent if the probability that one event occurs on any given trial of an experiment is not affected or changed by the occurrence of the other event. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 98 / 227

Example Coin Toss The result of any single coin toss is random. But the result over many tosses is predictable, as long as the trials are independent (i.e., the outcome of a new coin flip is not influenced by the result of the previous flip). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 99 / 227

Understanding Probability Sometimes Not Obvious Example: Monty Hall problem (three-door game) is a counter-intuitive statistics puzzle. There are 3 doors, behind which are two goats and a car. You pick a door (1), hoping for the car. Monty Hall opens the one with a goat of the other two (2 and 3). Here s the game: Do you stick with door 1 (original guess) or switch to the other unopened door? Does it matter? Calculation: Door 1 gives you 1/3 of winning chance if you stick to it, and 2/3 of loosing, which will win if you switch. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227

Understanding Probability Sometimes Not Obvious Example: Monty Hall problem (three-door game) is a counter-intuitive statistics puzzle. There are 3 doors, behind which are two goats and a car. You pick a door (1), hoping for the car. Monty Hall opens the one with a goat of the other two (2 and 3). Here s the game: Do you stick with door 1 (original guess) or switch to the other unopened door? Does it matter? Calculation: Door 1 gives you 1/3 of winning chance if you stick to it, and 2/3 of loosing, which will win if you switch. Still cannot not image why you need to switch? How about 1000 doors? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227

Understanding Probability Sometimes Not Obvious Example: Monty Hall problem (three-door game) is a counter-intuitive statistics puzzle. There are 3 doors, behind which are two goats and a car. You pick a door (1), hoping for the car. Monty Hall opens the one with a goat of the other two (2 and 3). Here s the game: Do you stick with door 1 (original guess) or switch to the other unopened door? Does it matter? Calculation: Door 1 gives you 1/3 of winning chance if you stick to it, and 2/3 of loosing, which will win if you switch. Still cannot not image why you need to switch? How about 1000 doors? So it is very clear? Explain why switch is NOT 1/3 of winning. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227

Probability Models Probability models describe, mathematically, the outcome of random processes. They consist of two parts: S = Sample Space: This is a set, or list, of all possible outcomes of a random process. An event is a subset of the sample space. A probability for each possible event in the sample space S. Example 1: Probability model for a coin toss: S = Head, Tail Probability of heads = 0.5 Probability of tails = 0.5 Example 2: Probability model for a two-coin toss event: Ordered? Non-ordered? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 101 / 227

Probability Rules Addition Rule for Disjoint Events Probabilities range from 0 (no chance of the event) to 1 (the event has to happen). For any event A, 0 P(A) 1. Because some outcome must occur on every trial, the sum of the probabilities for all possible outcomes (the sample space) must be exactly 1. P(sample space) = 1 Two events A and B are disjoint if they have no outcomes in common and can never happen together. The probability that A or B occurs is then the sum of their individual probabilities. P(A or B) = P(A B) = P(A)+P(B). This is the addition rule for disjoint events. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 102 / 227

Probability Rules Complement Rule The complement of any event A is the event that A does not occur, written as A c. The complement rule states that the probability of an event not occurring is 1 minus the probability that it does occur. P(not A) = P(A c ) = 1 - P(A) Example 1: Tail c = not Tail = Head; P(Tail c ) = P(Head)=0.5 Example 2: if P(score 80)=0.6, then P(score< 80)=0.4 Venn diagram: Sample space made up of an event A and its complementary A c, i.e., everything that is not A. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 103 / 227

Probabilities: Finite Number of Outcomes Finite sample spaces deal with discrete data data that can only take on a limited number of values. The individual outcomes of a random phenomenon are always disjoint. Addition rule: the probability of any event is the sum of the probabilities of the outcomes making up the event. Example: M&M candies If you draw an M&M candy at random from a bag, the candy will have one of six colors. Assume The probability that an M&M chosen at random is blue: P(blue)=1-[P(brown)+P(red)+P(yellow)+P(green)+P(orange)]=0.1 The probability that a random M&M is either red, yellow, or orange: P(red or yellow or orange)=p(red)+p(yellow)+p(orange)=0.5 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 104 / 227

Probabilities: Equally Likely Outcomes We can assign probabilities either: empirically: from our knowledge of numerous similar past events or theoretically : from our understanding of the phenomenon and symmetries in the problem If a random phenomenon has k equally likely possible outcomes, then each individual outcome has probability 1/k. And, for any event A: Example: Toss two dice, P(A) = count of outcomes in A count of outcomes in S P(the roll of two dice sums to 5)=P(1,4) + P(2,3) + P(3,2) + P(4,1) = 4 / 36 = 0.111 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 105 / 227

Discussion: Do You Want to Play the Roulette? The payout, for American and European roulette, can be calculated by payout = 36 1, where n is the number of squares the player is betting on. The n initial bet is returned in addition to the mentioned payout. The house average or house edge (expected value) is the amount the player loses relative for any bet made, on average. The hold is the average percentage of the money originally brought to the table that the player loses before he leaves. The average win/hold for double zero wheels is between 21 30%, significantly more than the 5.26% house edge. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 106 / 227

Probability Rules Multiplication Rule for Independent Events Two events A and B are independent if knowing that one occurs does not change the probability that the other occurs. Multiplication rule for independent events: if A and B are independent, P(A and B) = P(A)P(B). Example: Two consecutive coin tosses: P(first Tail and second Tail) = P(first Tail) P(second Tail) = 0.5 0.5 = 0.25. Venn diagram: Event A and event B. The intersection represents the event {A and B} and outcomes common to both A and B. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 107 / 227

Example Application of Addition/Multiplication Rules A couple wants three children. What are the arrangements of boys (B) and girls (G)? Assume that the probability that a baby is a boy or a girl is the same, 0.5. Sample space: BBB, BBG, BGB, GBB, GGB, GBG, BGG, GGG All eight outcomes in the sample space are equally likely. The probability of each is thus 1/8. Each birth is independent of the next, so we can use the multiplication rule. Example: P(BBB) = P(B) P(B) P(B) = 1 2 1 2 1 2 = 1 8 A couple wants three children. What are the numbers of girls (X) they could have? The same genetic laws apply. We can use the probabilities above and the addition rule for disjoint events to calculate the probabilities for X. Sample space: 0, 1, 2, 3 P(X = 0) = P(BBB) = 1/8 P(X = 1) = P(BBG or BGB or GBB) = P(BBG) + P(BGB) + P(GBB) = 3/8 and so on, Value of X 0 1 2 3 Probability 1 8 3 8 3 8 1 8 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 108 / 227

Discrete Random Variables A random variable is a variable whose value is a numerical outcome of a random phenomenon. A discrete random variable X has a finite number of possible values. The probability distribution of a random variable X lists the values and their probabilities. The probabilities p i must add up to 1. Value of X x 1 x 2 x 3 x k Probability p 1 p 2 p 3 p k The probability of any event is the sum of the probabilities p i of the values of X that make up the event. A coin was tossed twice. What is the probability to have at least 1 head? A coin was tossed five times. What is the probability to have at least 3 heads? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 109 / 227

Continuous Random Variables A continuous random variable X takes all values in an interval. How do we assign probabilities to events in an infinite sample space? Use density curves and compute probabilities for intervals. The probability of any event is the area under the density curve for the values of X that make up the event. The probability of a single event is meaningless for a continuous random variable. Only intervals can have a non-zero probability, represented by the area under the density curve for that interval. The shaded area under a density curve shows the proportion of individuals in a population with values of X between x 1 and x 2. Because the probability of drawing one individual at random depends on the frequency of this type of individual in the population, the probability is also the shaded area under the curve. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 110 / 227

Normal Probability Distributions The probability distribution of many random variables is a normal distribution. It shows what values the random variable can take and is used to assign probabilities to those values. To calculate probabilities with the normal distribution, we will standardize the random variable (z score) and use Table A. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 111 / 227

Example: Normal Probability Distributions What is the probability, if we pick one woman at random, that her height will be some value X? For instance, between 68 and 70 inches P(68 < X < 70)? Because the woman is selected at random, X is a random variable. z-scores: z(68) = 68 64.5 2.5 = 1.4;z(70) = 70 64.5 2.5 = 2.2. The area under the curve for the interval is 0.9861-0.9192 = 0.0669. Thus, the probability that a randomly chosen woman falls into this range is 6.69%, or P(68 < X < 70) = 6.69%. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 112 / 227

Mean of a Random Variable The mean x of a set of observations is their arithmetic average. The mean µ of a random variable X (also called expected value of X) is a weighted average of the possible values of X, reflecting the fact that all outcomes might not be equally likely. For a discrete random variable X with probability distribution, µ X = x 1 p 1 +x 2 p 2 + +x k p k = x i p i. Example: the probability distribution is Then the mean µ of X is Value of X 0 1 2 3 Probability 0.027 0.189 0.441 0.43 µ X = 0 0.027+1 0.189+2 0.441+3 0.43 = 2.1. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 113 / 227

Law of Large Numbers The law of large numbers: as the number of randomly drawn observations (n) in a sample increases, the mean of the sample ( x) gets closer and closer to the population mean µ. It is valid for any population. We often intuitively expect predictability over a few random observations, but it is wrong. Example: the first toss is head, will the second one be tail? The law of large numbers only applies to really large numbers. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 114 / 227

Variance of a Random Variable The variance and the standard deviation are the measures of spread that accompany the choice of the mean to measure center. The variance σx 2 of a random variable is a weighted average of the squared deviations (X µ X ) 2 of the variable X from its mean µ X. Each outcome is weighted by its probability in order to take into account outcomes that are not equally likely. The larger the variance of X, the more scattered the values of X on average. The positive square root of the variance gives the standard deviation σ of X. For a discrete random variable X, the variance σx 2 is σ 2 x = i (x i µ X ) 2 p i. Example: the probability distribution is Then the mean µ of X is The variance of X is Value of X 0 1 2 3 Probability 0.027 0.189 0.441 0.43 µ X = 0 0.027+1 0.189+2 0.441+3 0.43 = 2.1. σ 2 = 0.027 (0 2.1) 2 +0.189 (1 2.1) 2 +0.441 (2 2.1) 2 +0.343 (3 2.1) 2 = 0.63 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 115 / 227

Rules for Means and Variances If X is a random variable and a and b are fixed numbers, then µ a+bx = a+bµ X ; σa+bx 2 = b2 σx 2. If X and Y are two independent random variables, then µ X±Y = µ X ±µ Y ; σx±y 2 = σ2 X +σ2 Y. If X and Y are NOT independent but have correlation r, then µ X±Y = µ X ±µ Y ; σx±y 2 = σ2 X +σ2 Y ±2rσ Xσ Y. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 116 / 227

Example: Investment You invest 20% of your funds in Treasury bills and 80% in an index fund that represents all U.S. common stocks. Your rate of return over time is proportional to that of the T-bills (X) and of the index fund (Y), such that R = 0.2X +0.8Y. Based on annual returns between 1950 and 2003: Annual return on T-bills µ X = 5.0% and σ X = 2.9% Annual return on stocks µ Y = 13.2% and σ Y = 17.6%. Correlation between X and Y is r = 0.11. Solution: µ R = 0.2µ X +0.8µ Y = 0.2 5+0.8 13.2 = 11.56%. σ 2 R = σ2 0.2X +σ2 0.8Y +2rσ 0.2Xσ 0.8Y = 0.2 2 σ 2 X +0.82 σ 2 Y +2r 0.2σ X 0.8σ Y σ R = 196.786 = 14.03%. The portfolio has a smaller mean return than an all-stock portfolio, but it is also less risky. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 117 / 227

General Addition Rule Recall: addition rule for disjoint events: P(A or B) = P(A B) = P(A)+P(B). General addition rule for any two events A and B: The probability that A occurs, B occurs, or both events occur is: P(A or B) = P(A)+P(B) P(A and B). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 118 / 227

Example: General Addition Rule Question: what is the probability of randomly drawing either an ace or a heart from a deck of 52 playing cards? Facts: there are 4 aces in the pack and 13 hearts. However, 1 card is both an ace and a heart. Solution: P(ace) = 4 13 ;P(heart) = 52 52 ;P( ace and heart) = 1 52 ; P( ace or heart) = P(ace)+P(heart) P( ace and heart) = 4 52 + 13 52 1 52 = 16 52 = 4 13. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 119 / 227

Conditional Probability Example 1: the probability that a cloudy day will result in rain is different if you live in Los Angeles than if you live in Seattle. Example 2: what is the probability that you win the Jackpot? How about if I tell you that the winner is inside the classroom? Conditional probabilities reflect how the probability of an event can change if we know that some other event has occurred or is occurring. Our brains effortlessly calculate conditional probabilities, updating our degree of belief with each new piece of evidence. The conditional probability of event B given event A is (provided that P(A) 0): P(A and B) P(B A) =. P(A) Zhaoxian Zhou (USM) CSS 211 January 11, 2018 120 / 227

Example: Conditional Probability The conditional probability of event B given event A is (provided that P(A) 0): P(A and B) P(B A) =. P(A) Example: assume a number set includes whole numbers from 1 to 100. Define event A as a number is an even number, and define event B as a number is in the fourth quartile. Facts: P(A) = 50 25 100 = 0.5;P(B) = 100 = 0.25; P(B and A) = 13 13 13 100 ;P(B A) = 50 ;P(A B) = 25. Calculations: P(A and B) P(B A) = = 13/100 P(A) 50/100 = 13 50 ; P(A B) = Formula is verified. P(B and A) P(B) = 13/100 25/100 = 13 25. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 121 / 227

Example Understanding Conditional Probability Bertrand s box paradox: There are three boxes, each with one drawer on each of two sides. Each drawer contains a coin. One box has a gold coin on each side (GG), one a silver coin on each side (SS), and the other a gold coin on one side and a silver coin on the other (GS). A box is chosen at random, a random drawer is opened, and a gold coin is found inside it. What is the chance of the coin on the other side being gold? Reasoning 1: Originally, all three boxes were equally likely to be chosen. The chosen box cannot be box SS. So it must be box GG or GS. The two remaining possibilities are equally likely. So the probability that the box is GG, and the other coin is also gold, is 1/2. Reasoning 2: Originally, all six coins were equally likely to be chosen. The chosen coin cannot be from drawer S of box GS, or from either drawer of box SS. So it must come from the G drawer of box GS, or either drawer of box GG. The three remaining possibilities are equally likely, so the probability that the drawer is from box GG is 2/3. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 122 / 227

General Multiplication Rules The general multiplication rule: the probability that any two events, A and B, both occur is: P(A and B) = P(A)P(B A). Example: what is the probability of randomly drawing two hearts from a deck of 52 playing cards? There are 13 hearts in the pack. Let A and B be the events that the first and second cards drawn are hearts, respectively. Assume that the first card is not replaced before the second card is drawn. P(A) = 13 52 = 1 12 ;P(B A) = 4 51. P(two hearts) = P(A) P(B A) = 1 4 12 51 = 1 17. Notice that the probability of a heart on the second draw depends on which card was removed on the first draw. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 123 / 227

Independent Events Two events A and B that both have positive probability are independent if P(B A) = P(B) If A and B are independent, then P(A and B) = P(A)P(B) (A and B are independent when they have no influence on each other s occurrence.) Example: what is the probability of randomly drawing two hearts from a deck of 52 playing cards if the first card (event A) is replaced (and the cards re-shuffled) before the second card (event B) is drawn. P(A) = 1 4 ;P(B) = 1 4 ;P(B A) = 1 4. Because P(B) = P(B A), the two draws are independent events. P(A and B) = P(A) P(B) = 1 4 1 4 = 1 16. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 124 / 227

Probability Trees Conditional probabilities can get complex, and it is often a good strategy to build a probability tree that represents all possible outcomes graphically and assigns conditional probabilities to subsets of events. Example: tree diagram for chat room habits for three adults age groups: P(chatting)=0.136+0.099+0.017=0.252 About 25% of all adult Internet users visit chat rooms. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 125 / 227

Example: Probability Trees If a woman in her 20s gets screened for breast cancer and receives a positive test result, what is the probability that she does have breast cancer? Solution: Possible outcomes given the positive diagnosis: positive test and breast cancer or positive test but no cancer (false positive). P(c p) = P(c and p) P(c and p)+p(nc and p) = 0.0004 0.8 0.004 0.8+0.9996 0.1 0.3%. This value is called the positive predictive value, or PV+. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 126 / 227

Bayes s Rule An important application of conditional probabilities is Bayes s rule. It is the foundation of many modern statistical applications beyond the scope of this textbook. If a sample space is decomposed in k disjoint events A 1,A 2,,A k, none with a null probability but P(A 1 )+P(A 2 )+ +P(A k ) = 1, and if C is any other event such that P(C) is not 0 or 1, then P(A i C) = P(C A i )P(A i ) P(C A 1 )P(A 1 )+P(C A 2 )P(A 2 )+ +P(C A k )P(A k ). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 127 / 227

Example: Bayes s Rule If a woman in her 20s gets screened for breast cancer and receives a positive test result, what is the probability that she does have breast cancer? Solution: A 1 is cancer, A 2 is no cancer, C is a positive test result. Use Bayes s rule: P(p c)p(c) P(c p) = P(p c)p(c) + P(p nc)p(nc) = 0.8 0.0004 0.8 0.004+0.1 0.9996 0.3%. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 128 / 227

Objectives Sampling distribution of a sample mean; The mean and standard deviation of x; For normally distributed populations; The central limit theorem; Weibull distributions Sampling distributions for counts and proportions Binomial distributions for sample counts; Binomial distributions in statistical sampling; Binomial mean and standard deviation; Sample proportions; Normal approximation; Binomial formulas Zhaoxian Zhou (USM) CSS 211 January 11, 2018 130 / 227

Reminder The two types of data Quantitative: Something that can be counted or measured and then averaged across individuals in the population (e.g., your height, your age, your IQ score) Categorical: Something that falls into one of several categories. What can be counted is the proportion of individuals in each category (e.g., your gender, your hair color, your blood type - A, B, AB, O). How do you figure it out? Ask: What are the n individuals/units in the sample (of size n )? What is being recorded about those n individuals/units? Is that a number (quantitative) or a statement (categorical)? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 131 / 227

Sampling Distribution of the Sample Mean We take many random samples of a given size n from a population with mean µ and standard deviation σ. Some sample means will be above the population mean µ and some will be below, making up the sampling distribution. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 132 / 227

Sampling Distribution The sampling distribution of a statistic is the distribution of all possible values taken by the statistic }{{} when }{{} all possible e.g.mean e.g.how many? samples of a fixed size }{{} n are taken from the population. }{{} e.g.20 e.g.usm,15000 It is a theoretical idea - we do not actually build it. Why? The number of all different samples is ( ) 15000 = 20 15000! 14980! 20! 1.35 1065. The sampling distribution of a statistic is the probability distribution of that statistic. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 133 / 227

Center of the SD of the Sample Mean For any population with mean µ and standard deviation σ, Mean of a sampling distribution of x The mean, or center of the sampling distribution of x, is equal to the population mean µ: µ x = µ. There is no tendency for a sample mean to fall systematically above or below µ, even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean µ it will be correct on average in many samples. Example: Assume student scores in all USM classes satisfy a distrubution (Normal or not), with a mean of 80 and standard deviation of 10. Taking each student as a sample of size 4, then the mean of the mean scores of all students is approximately??? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 134 / 227

Spread of the SD of the Sample Mean For any population with mean µ and standard deviation σ, Standard deviation of a sampling distribution of x The standard deviation of the sampling distribution is σ x = σ n, where n is the sample size. The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. Meaning: averages of samples are less variable than individual observations. Example: Assume student scores in all USM classes satisfy a distrubution (Normal or not), with a mean of 80 and standard deviation of 10. Taking each student as a sample of size 4, then the standard deviation of the mean scores of all students is approximately???. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 135 / 227

For Normally Distributed Populations When a variable in a population is normally distributed, the sampling distribution of x for all possible samples of size n is also normally distributed Zhaoxian Zhou (USM) CSS 211 January 11, 2018 136 / 227

Example Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/dl. Assume that we know a patient whose measured potassium levels vary daily according to a normal distribution N(3.8, 0.2). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia? z = x µ σ = 3.5 3.8 0.2 = 1.5;P(z < 1.5) 7%. Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis if average is used? z = x µ σ/ n = 3.5 3.8 0.2/ = 3.0;P(z < 3.0) 0.1%. 4 Note: Make sure to standardize (z) using the standard deviation for the sampling distribution. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 137 / 227

The Central Limit Theorem Central Limit Theorem: When randomly sampling from any population with mean µ and standard deviation σ, when n is large enough, the sampling distribution of x is approximately normal: N(µ,σ/ n). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 138 / 227

Linear Combination of Independent Normal Random Variables Any linear combination of independent normal random variables is also normally distributed. Example: assume X satisfies N(20, 4), Y satisfies N(18, 8), then the difference X Y is also Normally distributed. Its mean µ X Y = µ X µ Y = 20 18 = 2. Its variance σ 2 X Y = σ2 X +σ2 Y = 80. Therefore the difference X Y satisfies N(2, 8.94). The probability that X < Y is P(X < Y) = P(X Y < 0). and z = 0 2 8.94 = 0.22, P(X < Y) = P(z < 0.22) = 0.4129. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 139 / 227

Further Properties More generally, the central limit theorem is valid as long as we are sampling many small random events, even if the events have different distributions (as long as no one random event dominates the others). It explains why the normal distribution is so common. Example: Height seems to be determined by a large number of genetic and environmental factors, like nutrition. The individuals are genes and environmental factors. Your height is a mean. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 140 / 227

Weibull Distributions Weibull distributions are used to model time to failure/product lifetime and are common in engineering to study product reliability. Product lifetimes can be measured in units of time, distances, or number of cycles for example. Some applications include: Quality control (breaking strength of products and parts, food shelf life) Maintenance planning (scheduled car revision, airplane maintenance) Cost analysis and control (number of returns under warranty, delivery time) Research (materials properties, microbial resistance to treatment) Zhaoxian Zhou (USM) CSS 211 January 11, 2018 141 / 227

Examples of Weibull Distributions Density curves of three members of the Weibull family describing a different type of product time to failure in manufacturing. Infant mortality: Many products fail immediately and the remainders last a long time. Manufacturers only ship the products after inspection. Early failure: Products usually fail shortly after they are sold. The design or production must be fixed. Old-age wear out: Most products wear out over time, and many fail at about the same age. This should be disclosed to customers. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 142 / 227

Binomial Distributions Binomial distributions are models for some categorical variables, typically representing the number of successes in a series of n trials. The observations must meet these requirements: The total number of observations n is fixed in advance. Each observation is 1 of 2 categories: success and failure. The outcomes of all n observations are statistically independent. All n observations have the same probability of success, p. Example: I have 10 coins to toss at the same time. What is distribution of the number X of head? Here n = 10; the two categories are head (success) and tail (failure); the output of each toss is independent; p = 0.5. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 143 / 227

Binomial Distributions for Sample Counts Express a binomial distribution for the count X of successes among n observations as a function of the parameters n and p: B(n,p), where p is the probability of success on each observation. Example 1: coin tossing. The binomial distribution for the count X of head is B(10, 1 2 ). Example 2: record the next 50 births at a local hospital. Each newborn is either a boy or a girl; each baby is either born on a Sunday or not. What is the binomial distribution for boys? What is the binomial distribution to be on Sunday? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 144 / 227

Binomial Distribution in Statistical Sampling Choosing a simple random sample (SRS) from any population is not quite a binomial setting. However, when the population is large, removing a few items has a very small effect on the composition of the remaining population: successive observations are very nearly independent. A population contains a proportion p of successes. If the population is much larger than the sample, the count X of successes in an SRS of size n has approximately the binomial distribution B(n, p). The n observations will be nearly independent when the size of the population is much larger than the size of the sample. As a rule of thumb, the binomial sampling distribution for counts can be used when the population is at least 20 times as large as the sample. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 145 / 227

Calculations for Binomial Probabilities The binomial probability P(X = k) is binomial coefficient multiplied by the probability of any specific arrangement of the k successes: ( ) n P(X = k) = p k (1 p) n k n! = k k!(n k)! pk (1 p) n k. The probability that a binomial random variable takes any range of values is the sum of each probability for getting exactly that many successes in n observations. Example: P(X 2) = P(X = 0)+P(X = 1)+P(X = 2). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 146 / 227

Example: Color Blindness The frequency of color blindness (dyschromatopsia) in the Caucasian American male population is estimated to be about 8%. We take a random sample of size 25 from this population. The population is definitely larger than 20 times the sample size, thus we can approximate the sampling distribution by B(n = 25, p = 0.08). What is the probability that exactly five will be color blind? ( ) n P(X = 5) = p k (1 p) n k = 25! k 5!20! 0.085 (1 0.08) 20 = 0.0329. What is the probability that five individuals or fewer in the sample are color blind? What is the probability that more than five will be color blind? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 147 / 227

Example Cont d Zhaoxian Zhou (USM) CSS 211 January 11, 2018 148 / 227

Binomial Mean and Standard Deviation The center and spread of the binomial distribution for a count X are defined by the mean µ and standard deviation σ. µ = np;σ = np(1 p) = npq. Example: the effect of changing p when n is fixed at n = 10: p = 0.25;p = 0.5, and p = 0.75. For small samples, binomial distributions are skewed when p is different from 0.5. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 149 / 227

Example Binomial Mean and Standard Deviation The mean and standard deviation of the count of color blind individuals in the SRS of 25 Caucasian American males: µ = np = 25 0.08 = 2;σ = np(1 p) = 25 0.08 0.92 = 1.36. When size is 10, µ = 0.8;σ = 0.86. When size is 75, µ = 6;σ = 3.35. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 150 / 227

Sample Proportions The proportion of successes can be more informative than the count. In statistical sampling the sample proportion of successes, ˆp, is used to estimate the proportion p of successes in a population. For any SRS of size n, the sample proportion of successes is ˆp = count of successes in the sample n = X n. If the sample size is much smaller than the size of a population with proportion p of successes, then the mean and standard deviation of ˆp are p(1 p) µˆp = p;σˆp = n The sample proportion in an SRS is an unbiased estimator of the population proportion p. The variability decreases as the sample size increases. So larger samples usually give closer estimates of the population proportion p. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 151 / 227

Normal Approximation If n is large, and p is not too close to 0 or 1, the binomial distribution can be approximated by the normal distribution N(µ = np,σ 2 = np(1 p)). Practically, the Normal approximation can be used when both np 10 and n(1 p) 10. If X is the count of successes in the sample and the sample proportion of successes ˆp = X n, their sampling distributions for large n, are: X approximately N(µ = np,σ 2 = np(1 p)) ˆp is approximately N(µ = p,σ 2 = p(1 p) n ). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 152 / 227

Normal Approximation Cont d The sampling distribution of ˆp is never exactly normal. But as the sample size increases, the sampling distribution of ˆp becomes approximately normal. The normal approximation is most accurate for any fixed n when p is close to 0.5, and least accurate when p is near 0 or near 1. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 153 / 227

Normal Approximation Continuity Correction Assume the frequency of color blindness in the Caucasian American male population is about 8%. Take a random sample of size 125 from this population. What is the probability that six individuals or fewer in the sample are color blind? Sampling distribution of the count X is binomial: so B(n = 125,p = 0.08), P(X 6) = 0.1198. Normal approximation for the count X: N(np, np(1 p)), or N(10,3.033), so, z = x µ σ = 6 10 3.033 = 1.32 = P(X 6) = 0.0934. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 154 / 227

Normal Approximation Continuity Correction Cont d A binomial random variable is a discrete variable that can only take whole numerical values. In contrast, a normal random variable is a continuous variable that can take any numerical value. The normal distribution is a better approximation of the binomial distribution with a continuity correction: variable x = x +0.5 is substituted for x, and P(X x) is replaced by P(X x +0.5). In this example, P(X 6.5) = 0.1243, which approximates the binomial distribution better than Normal distribution without continuity correction does. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 155 / 227

Objectives Estimating with confidence Statistical confidence; Confidence intervals; Confidence interval for a population mean; How confidence intervals behave; Choosing the sample size Tests of significance The reasoning of significance tests; Stating hypotheses; The P-value; Statistical significance; Tests for a population mean; Confidence intervals to test hypotheses Use and abuse of tests Power and inference as a decision Cautions about significance tests; Power of a test; Type I and II errors; Error probabilities Zhaoxian Zhou (USM) CSS 211 January 11, 2018 157 / 227

Overview of Inference Methods for drawing conclusions about a population from sample data are called statistical inference. Methods Confidence Intervals - estimating a value of a population parameter. A range of values with an associated confidence level C. Tests of significance - assess evidence for a claim about a population. Significance level: The largest P-value tolerated for rejecting a true null hypothesis. Inference is appropriate when data are produced by either a random sample or a randomized experiment Zhaoxian Zhou (USM) CSS 211 January 11, 2018 158 / 227

Statistical Confidence Although the sample mean x is a unique number for any particular sample, if you pick a different sample you will probably get a different sample mean. In fact, there are many different values for the sample mean, and virtually none of them would actually equal the true population mean µ. But the sample distribution is narrower than the population distribution, by a factor of n. Thus the estimates gained from our samples are always relatively close to the population parameter µ. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 159 / 227

Discussion: How to Obtain Population Parameter from Samples Statistics Number of samples: Do we need multiple samples? How about only one sample? Type of approximation: Point estimates are the single, most likely value of a parameter. For example, the point estimate of population mean (the parameter) is the sample mean (the parameter estimate). Confidence intervals are a range of values likely to contain the population parameter. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 160 / 227

The Essence of Statistical Inference If we know the population parameter µ: 95% of all sample means will be within roughly 2 standard deviations ( 2σ n ) of the population parameter µ. If we DO NOT know the population parameter µ: Distances are symmetrical which implies that the population parameter µ must be within roughly 2 standard deviations from the sample average x, in 95% of all samples. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 161 / 227

Example: Statistical Confidence The weight of single eggs of the brown variety is normally distributed N(65g, 5g). Think of a carton of 12 brown eggs as an SRS of size 12. σ The distribution of the sample means x is N(µ, n ) = N(65g,1.44g). The middle 95% of the sample means distribution is roughly ± 2σ n of x from the mean, or 65g ±2.88g. You buy a carton of 12 white eggs instead. The box weighs 770 g. The average egg weight from that SRS is thus x = 64.2g. Knowing that the standard deviation of egg weight is 5 g, what can you infer about the mean µ of the white egg population? We are 95% confident that the population mean µ is between 64.2g ±2.88g, or roughly within ± 2σ n of x. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 162 / 227

Confidence Intervals The confidence interval is a range of values with an associated confidence level C. Confidence intervals for means are intervals constructed using a procedure that will contain the population mean a specified proportion (C) of the time. ±4.2 is a 95% confidence interval for the population parameter µ. This equation says that in 95% of the cases, the actual value of µ will be within 4.2 units of the value of x. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 163 / 227

Implications We don t need to take a lot of random samples to rebuild the sampling distribution and find µ at its center. All we need is one SRS of size n and rely on the properties of the sample means distribution to infer the population mean µ. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 164 / 227

Confidence Intervals Cont d With 95% confidence, we can say that µ should be within roughly 2 standard deviations 2σ n from our sample mean x. In 95% of all possible samples of this size n, µ will indeed fall in our confidence interval. In only 5% of samples would x be farther from µ. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 165 / 227

Understanding Confidence Intervals Can we interpret a 95% confidence interval as an interval with a 0.95 probability of containing the population mean? Question: Strictly speaking, what is the best interpretation of a 95% confidence interval for the mean? If repeated samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean. A 95% confidence interval has a 0.95 probability of containing the population mean. 95% of the population distribution is contained in the confidence interval. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 166 / 227

Confidence Intervals Cont d A confidence interval can be expressed as: Mean ±m, where m is called the margin of error, i.e., µ is within x ±m. Example: 120±6 Two endpoints of an interval: µ is within ( x m) to ( x +m). Example: 114 to 126. A confidence level C (in %) indicates if we were to repeat the whole experiment N times, under the same conditions, then we would have N different confidence intervals. The confidence level is the proportion of these intervals which contain the true mean of the population. It represents the area under the normal curve within ±m of the center of the curve. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 167 / 227

Review: Standardizing the Normal Curve Using z σ is the standard deviation of the original population. Here, we work with the sampling distribution, and σ n is its standard deviation (spread). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 168 / 227

Varying Confidence Levels Confidence intervals contain the population mean µ in C% of samples. Different areas under the curve give different confidence levels C. Practical use of z : z z is related to the chosen confidence level C. C is the area under the standard normal curve between z and z. The margin of error and confidence interval are thus z σ n and x ±z σ n. Example: For an 80% confidence level C, 80% of the normal curve s area is contained in the interval. Use Table D to find specific z values. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 169 / 227

Link Between Confidence Level and Margin of Error The confidence level C determines the value of z. The margin of error also depends on z. Higher confidence C implies a larger margin of error m (thus less precision in our estimates). A lower confidence level C produces a smaller margin of error m (thus better precision in our estimates). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 170 / 227

Example: Different Confidence Intervals for the Same Set of Measurements Density of bacteria in solution: Measurement equipment has standard deviation σ = 1 10 6 bacteria/ml fluid. Three measurements: 24, 29, and 31 10 6 bacteria/ml fluid. The mean is 28 10 6 bacteria/ml. Find the 96% and 70% CI. 96% confidence interval for the true density, z = 2.054, and write x ±z σ n = 28±2.054 1 3 = 28±1.19 10 6 bacteria/ml. 70% confidence interval for the true density, z = 1.036, and write x ±z σ n = 28±1.036 1 3 = 28±0.60 10 6 bacteria/ml. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 171 / 227

Properties of Confidence Intervals User chooses the confidence level, margin of error follows from this choice. The margin of error, z σ n, gets smaller when z (and thus the confidence level C) gets smaller, σ is smaller, or n is larger. The spread in the sampling distribution of the mean is a function of the number of individuals per sample. The larger the sample size, the smaller the standard deviation (spread) of the sample mean distribution. But the spread only decreases at a rate equal to n. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 172 / 227

Sample Size and Experimental Design You may need a certain margin of error (e.g., drug trial, manufacturing specs). In many cases, the population variability (σ) is fixed, but we can choose the number of measurements (n). So plan ahead what sample size to use to achieve that margin of error. m = z σ ( z ) σ 2 n =. n m Example: Measurement equipment has standard deviation σ = 1 10 6 bacteria/ml fluid. How many measurements should you make to obtain a margin of error of at most 0.5 10 6 bacteria/ml with a confidence level of 95%? For a 95% confidence interval, z = 1.96. ( z ) σ 2 ( ) 1.96 1 2 n = = = 15.3664. m 0.5 Therefore, we need at least 16 measurements. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 173 / 227

Interpretation of Confidence Intervals Conditions under which an inference method is valid are never fully met in practice. Exploratory data analysis and judgment should be used when deciding whether or not to use a statistical procedure. Any individual confidence interval either will or will not contain the true population mean. It is wrong to say that the probability is 95% that the true mean falls in the confidence interval. The correct interpretation of a 95% confidence interval is that we are 95% confident that the true mean falls within the interval. The confidence interval was calculated by a method that gives correct results in 95% of all possible samples. In other words, if many such confidence intervals were constructed, 95% of these intervals would contain the true mean. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 174 / 227

Reasoning of Significance Tests We have seen that the properties of the sampling distribution of x help us estimate a range of likely values for population mean µ. We can also rely on the properties of the sample distribution to test hypotheses. Example: You are in charge of quality control in your food company. You sample randomly four packs of cherry tomatoes, each labeled 0.5 lb. (227 g). The average weight from your four boxes is 222 g. Obviously, we cannot expect boxes filled with whole tomatoes to all weigh exactly half a pound. Thus, Is the somewhat smaller weight simply due to chance variation? Is it evidence that the calibrating machine that sorts cherry tomatoes into packs needs revision? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 175 / 227

Hypotheses A test of statistical significance tests a specific hypothesis using sample data to decide on the validity of the hypothesis. In statistics, a hypothesis is an assumption or a theory about the characteristics of one or more variables in one or more populations. In the example above, What you want to know: does the calibrating machine that sorts cherry tomatoes into packs need revision? The same question reframed statistically: is the population mean µ for the distribution of weights of cherry tomato packages equal to 227 g (i.e., half a pound)? Another example: Assume USM average GPA is 2.5. We found that the average GPA of our class is 3.9. Is this normal? In other words, we found that our mean GPA is 3.9, can we conclude that the USM mean GPA is 2.5? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 176 / 227

Stating Hypotheses The null hypothesis is a very specific statement about a parameter of the population(s). It is labeled H 0. The alternative hypothesis is a more general statement about a parameter of the population(s) that is exclusive of the null hypothesis. It is labeled H a. Exmple: weight of cherry tomato packs: H 0 : µ = 227 g (µ is the average weight of the population of packs) H a : µ 227 g (µ is either larger or smaller). Exmple: USM GPA: H 0 : µ = 2.5 (µ is the mean GPA of the USM population) H a : µ > 2.5 (µ is larger). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 177 / 227

One-Sided and Two-Sided Tests A two-tail or two-sided test of the population mean has these null and alternative hypotheses: H 0 : µ=[a specific number]; H a : µ [a specific number] A one-tail or one-sided test of a population mean has these null and alternative hypotheses: H 0 : µ= [a specific number]; H a : µ < [a specific number] Or H 0 : µ = [a specific number]; H a : µ > [a specific number] Example: The FDA tests whether a generic drug has an absorption extent similar to the known absorption extent of the brand-name drug it is copying. Higher or lower absorption would both be problematic, thus we test: H 0 : µ generic = µ brand ; H a : µ generic µ brand. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 178 / 227

The P-Value Example: The packaging process has a known standard deviation σ = 5g. The null and alternative hypotheses are: H 0 : µ = 227g; H a : µ 227g. The average weight from your four random boxes is 222 g. What is the probability of drawing a random sample such as yours if H 0 is true (which means the mean of the population is indeed 227 g)? Tests of statistical significance quantify the chance of obtaining a particular random sample result if the null hypothesis were true. This quantity is the P-value. This is a way of assessing the believability of the null hypothesis, given the evidence provided by a random sample. Example: the odds of winning powerball grand prize is 1 in 292,201,338. A random person (you) won the Jackpot. What is H 0? What is H a? What is the p-value? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 179 / 227

The P-Value Rejecting Null Hypothsis We know that the null hypothsis and the sample statistic do not match all the time. You ask: could random variation alone account for the difference between the null hypothesis and observations from a random sample? Small P-value implies that random variation due to the sampling process alone is not likely to account for the observed difference. With a small p-value we reject H 0, which means that the true property of the population is significantly different from what was stated in H 0. Thus, small P-values are strong evidence AGAINST H 0. In the powerball example, what is your conclusion? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 180 / 227

Significant P-Value When the shaded area is small, the probability of drawing such a sample at random gets very slim. Oftentimes, a P-value of 0.05 or less is considered significant: the phenomenon observed is unlikely to be entirely due to chance event from the random sampling. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 181 / 227

Testing of the Null Hypothesis To test the hypothesis H 0 : µ = µ 0 based on an SRS of size n from a Normal population with unknown mean µ and known standard deviation σ, we rely on the properties of the sampling distribution N(µ, σ n ). The P-value is the area under the sampling distribution for values at least as extreme, in the direction of H a, as that of our random sample. z-score: z = x µ σ n The p-value of one-sided test for H a : µ > µ 0 is P(Z z) or for H a : µ < µ 0 is P(Z z). The p-value of two-sided test for H a : µ µ 0 is 2P(Z z ). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 182 / 227

Example: Does the Packaging Machine Need Revision? H 0 : µ = 227g versus H a : µ 227g. What is the probability of drawing a random sample such as yours if H 0 is true? x = 222g;σ = 5g;n = 4, then z = 222 227 5/ = 2. 4 P-value of the two-sided test = 2 P(z > 2) = 2 0.0228 = 4.56%. The probability of getting a random sample average so different from µ is so low that we reject H 0. The machine does need recalibration. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 183 / 227

Summary: Steps for Tests of Significance State the null hypotheses H 0 and the alternative hypothesis H a. H 0 represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. The alternative hypothesis, H a, is a statement of what a statistical hypothesis test is set up to establish. Calculate value of the test statistic. Determine the P-value for the observed data. State a conclusion. We either reject H 0 in favor of H a or do not reject H 0 We never conclude reject H a, or even accept H a. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 184 / 227

Understanding the P-Value Reference from https://en.wikipedia.org/wiki/p-value The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either. The p-value is not the probability that a finding is merely a fluke. The p-value is not the probability of falsely rejecting the null hypothesis. The p-value is not the probability that replicating the experiment would yield the same conclusion. The significance level, such as 0.05, is not determined by the p-value. The p-value does not indicate the size or importance of the observed effect. The concept of p-value is far from perfect. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 185 / 227

The Significance Level: α The significance level, α, is the largest P-value tolerated for rejecting a true null hypothesis (how much evidence against H 0 we require). This value is decided arbitrarily before conducting the test. If the P-value is equal to or less than α(p α), then we reject H 0. If the P-value is greater than α(p > α), then we fail to reject H 0. Example: Does the packaging machine need revision? Answer: two-sided test; the P-value is 4.56%. If α had been set to 5%, then the P-value would be significant. If α had been set to 1%, then the P-value would not be significant. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 186 / 227

The Significance Level Cont d When the z score falls within the rejection region (shaded area on the tail-side), the p-value is smaller than α and you have shown statistical significance. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 187 / 227

Rejection Region for a Two-Tail Test of µ with α = 0.05 A two-sided test means that α is spread between both tails of the curve, thus a middle area C of 1 α = 95%, and an upper tail area of α/2 = 0.025. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 188 / 227

Confidence Intervals to Test Hypotheses A two-sided test is symmetrical, we can also use a confidence interval to test a two-sided hypothesis. In a two-sided test, C = 1 α,where C is confidence level, α is significance level. Example: Packs of cherry tomatoes (σ = 5g) H 0 : µ = 227g versus H a : µ 227g. Sample average 222 g. 95% CI for µ = 222±1.96 5/ 4 = 222g ±4.9g. 227 g does not belong to the 95% CI (217.1 to 226.9 g). Thus, we reject H 0. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 189 / 227

Logic of Confidence Interval Test A confidence interval gives a black and white answer: reject or don t reject H 0. But it also estimates a range of likely values for the true population mean µ. A P-value quantifies how strong the evidence is against the H 0. But if you reject H 0, it doesn t provide any information about the true population mean µ. Example: a sample gives a 99% confidence interval of x ±m = 0.84±0.0101. With 99% confidence, could samples be from populations with µ = 0.86? µ = 0.85? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 190 / 227

Choosing the Significance Level α Factors often considered: What are the consequences of rejecting the null hypothesis (e.g., global warming, convicting a person for life with DNA evidence)? Are you conducting a preliminary study? If so, you may want a larger α so that you will be less likely to miss an interesting result. Some conventions: We typically use the standards of our field of work. There are no sharp cutoffs: e.g., 4.9% versus 5.1%. It is the order of magnitude of the P-value that matters: somewhat significant, significant, or very significant. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 191 / 227

Practical Significance Statistical significance only says whether the effect observed is likely to be due to chance alone because of random sampling. Statistical significance may not be practically important. That is because statistical significance doesn t tell you about the magnitude of the effect, only that there is one. An effect could be too small to be relevant. And with a large enough sample size, significance can be reached even for the tiniest effect. Example: a drug to lower temperature is found to reproducibly lower patient temperature by 0.4 Celsius (P-value< 0.01). But clinical benefits of temperature reduction only appear for a 1 decrease or larger. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 192 / 227

Don t Ignore Lack of Significance Failing to find statistical significance in results is not rejecting the null hypothesis. This is very different from actually accepting it. The sample size, for instance, could be too small to overcome large variability in the population. When comparing two populations, lack of significance does not imply that the two samples come from the same population. They could represent two very distinct populations with similar mathematical properties. Consider this provocative title from the British Medical Journal: Absence of evidence is not evidence of absence. Having no proof of who committed a murder does not imply that the murder was not committed. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 193 / 227

Interpreting Effect Size: Its All About Context There is no consensus on how big an effect has to be in order to be considered meaningful. In some cases, effects that may appear to be trivial can be very important. Example: Improving the format of a computerized test reduces the average response time by about 2 seconds. Although this effect is small, it is important since this is done millions of times a year. The cumulative time savings of using the better format is gigantic. Always think about the context. Try to plot your results, and compare them with a baseline or results from similar studies. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 194 / 227

Objectives Inference for the mean of a population The t distributions; The one-sample t confidence interval; The one-sample t test; Matched pairs t procedures; Robustness Comparing two means Two-sample z statistic; Two-samples t procedures; Two-sample t significance test; Two-sample t confidence interval; Robustness Zhaoxian Zhou (USM) CSS 211 January 11, 2018 196 / 227

An Example: Sweetening Colas Cola manufacturers want to test how much the sweetness of a new cola drink is affected by storage. The sweetness loss due to storage was evaluated by 10 professional tasters (by comparing the sweetness before and after storage): Taster Sweetness loss 1 2.0 2 0.4 3 0.7 4 2.0 5 0.4 6 2.2 Obviously, we want to test if storage results in a loss of sweetness, thus: H 0 : µ = 0 versus H a : µ > 0 This looks familiar. However, here we do not know the population parameter σ. The population of all cola drinkers is too large. Since this is a new cola recipe, we have no population data. This situation is very common with real data. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 197 / 227

When σ is Unknown The sample standard deviation s provides an estimate of the population standard deviation σ. When the sample size is large, the sample is likely to contain elements representative of the whole population. Then s is a good estimate of σ. But when the sample size is small, the sample contains only a few individuals. Then s is a mediocre estimate of σ. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 198 / 227

Standard Deviation s Standard Error s/ n For a sample of size n, the sample standard deviation s is: 1 s = (xi x) n 1 2, n 1 is the degrees of freedom. The value s/ n is called the standard error of the mean SEM. Scientists often present sample results as mean ± SEM. Example: A study examined the effect of a new medication on the seated systolic blood pressure. The results, presented as mean ± SEM for 25 patients, are 113.5±8.9. What is the standard deviation s of the sample data? Solution: SEM=s/ n, so s = 8.9 25 = 44.5. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 199 / 227

The t Distributions Suppose that an SRS of size n is drawn from an N(µ,σ) population. When σ is known, the sampling distribution is N(µ,σ/ n). When σ is estimated from the sample standard deviation s, the sampling distribution follows a t distribution t(µ,s/ n) with degrees of freedom n 1. t = x µ s/ n is the one-sample t statistic. When n is very large, s is a very good estimate of σ, and the corresponding t distributions are very close to the normal distribution. The t distributions become wider for smaller sample sizes, reflecting the lack of precision in estimating σ from s. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 200 / 227

Standardizing t Distribution As with the normal distribution, the first step is to standardize the data. Then we can use Table D to obtain the area under the curve Here, µ is the mean (center) of the sampling distribution, and the standard error of the mean s/ n is its standard deviation (width). You obtain s, the standard deviation of the sample, with your calculator. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 201 / 227

Standardizing t Distribution Cont d Zhaoxian Zhou (USM) CSS 211 January 11, 2018 202 / 227

Table A and Table D Zhaoxian Zhou (USM) CSS 211 January 11, 2018 203 / 227

The One-Sample t-confidence Interval The level C confidence interval is an interval with confidence C of containing the true population parameter. We have a data set from a population with both µ and σ unknown. We use x to estimate µ and s to estimate σ, using a t distribution (df n 1). Practical use of t : t C is the area between t and t. We find t in the line of Table D for df= n 1 and confidence level C. The margin of error m is: m = t s/ n. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 204 / 227

Example: Red Wine To see if moderate red wine consumption increases the average blood level of polyphenols, a group of nine randomly selected healthy men were assigned to drink half a bottle of red wine daily for two weeks. Their blood polyphenol levels were assessed before and after the study, and the percent change is presented as 0.7;3.5;4;4.9;5.5;7;7.4;8.1;8.4. Are the data approximately normal? Yes, there is a low value, but overall the data can be considered reasonably normal Zhaoxian Zhou (USM) CSS 211 January 11, 2018 205 / 227

Example: Red Wine Cont d What is the 95% confidence interval for the average percent change? Sample average = 5.5; s = 2.517; df = n-1 = 8 The sampling distribution is a t distribution with n 1 degrees of freedom. For df = 8 and C = 95%,t = 2.306, the margin of error m is: m = t s/ n = 2.306 2.517/ 9 1.9 With 95% confidence, the population average percent increase in polyphenol blood levels of healthy men drinking half a bottle of red wine daily is between 3.6% and 7.4%. Important: The confidence interval shows how large the increase is, but not if it can have an impact on men s health. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 206 / 227

The One-Sample t-test As in the previous chapter, a test of hypotheses requires a few steps: Stating the null and alternative hypotheses (H 0 versus H a ) Deciding on a one-sided or two-sided test Choosing a significance level α Calculating t and its degrees of freedom Finding the area under the curve with Table D Stating the p-value and interpreting the result Zhaoxian Zhou (USM) CSS 211 January 11, 2018 207 / 227

The One-Sample t-test Cont d The p-value is the probability, if H 0 is true, of randomly drawing a sample like the one obtained or more extreme, in the direction of H a ; or represents the probability that random fluctuations alone could have generated results that differed from H 0, in the direction of H a, by at least as much as what you observed in your data. The p-value is calculated as the corresponding area under the curve, one-tailed or two-tailed depending on H a : Zhaoxian Zhou (USM) CSS 211 January 11, 2018 208 / 227

The One-Sample t-test Cont d Zhaoxian Zhou (USM) CSS 211 January 11, 2018 209 / 227

Example: Sweetening Colas Cont d Is there evidence that storage results in sweetness loss for the new cola recipe at the 0.05 level of significance (α = 5%)? H 0 : µ = 0 versus H a : µ > 0 (one-sided test) t = x µ 0 s/ n = 1.02 0 1.196/ 10 = 2.70 The critical value t α = 1.833. t > t α thus the result is significant. Or 2.398 < t = 2.70 < 2.821 thus 0.02 > p > 0.01. p < α thus the result is significant. The t-test has a significant p-value. We reject H 0. There is a significant loss of sweetness, on average, following storage. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 210 / 227

Matched Pairs t Procedures Sometimes we want to compare treatments or conditions at the individual level. These situations produce two samples that are not independent they are related to each other. The members of one sample are identical to, or matched (paired) with, the members of the other sample. Example: Pre-test and post-test studies look at data collected on the same sample elements before and after some experiment is performed. Example: Twin studies often try to sort out the influence of genetic factors by comparing a variable between sets of twins. Example: Using people matched for age, sex, and education in social studies allows canceling out the effect of these potential lurking variables. In these cases, we use the paired data to test the difference in the two population means. The variable studied becomes X difference = X 1 X 2, and H 0 : µ difference = 0;H a : µ difference > 0(or < 0,or 0). Conceptually, this is not different from tests on one population. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 211 / 227

Example: Does Lack of Caffeine Increase Depression? Individuals diagnosed as caffeine-dependent are deprived of caffeine-rich foods and assigned to receive daily pills. Sometimes, the pills contain caffeine and other times they contain a placebo. Depression was assessed. There are 2 data points for each subject, but well only look at the difference. For each individual in the sample, we have calculated a difference in depression score (placebo minus caffeine). There were 11 difference points, thus df= n 1 = 10. x = 7.36;s = 6.92 H 0 : µ difference = 0;H 0 : µ difference > 0. t = x 0 s/ n = 3.53 For df = 10, 3.169 < t = 3.53 < 3.581 = 0.005 > p > 0.0025. Caffeine deprivation causes a significant increase in depression Zhaoxian Zhou (USM) CSS 211 January 11, 2018 212 / 227

Robustness The t procedures are exactly correct when the population is distributed exactly normally. However, most real data are not exactly normal. The t procedures are robust to small deviations from normality the results will not be affected too much. Factors that strongly matter: Random sampling. The sample must be an SRS from the population. Outliers and skewness. They strongly influence the mean and therefore the t procedures. However, their impact diminishes as the sample size gets larger because of the Central Limit Theorem. Specifically: When n < 15, the data must be close to normal and without outliers. When 15 n 40, mild skewness is acceptable but not outliers. When n > 40, the t-statistic will be valid even with strong skewness. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 213 / 227

Power of the t-test The power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H 0 ) when the alternative hypothesis (H a ) is true. The power of the one sample t-test for a specific alternative value of the population mean µ, assuming a fixed significance level α, is the probability that the test will reject the null hypothesis when the alternative value of the mean is true. Calculation of the exact power of the t-test is a bit complex. But an approximate calculation that acts as if α were known is almost always adequate for planning a study. This calculation is very much like that for the z-test. When guessing α, it is always better to err on the side of a standard deviation that is a little larger rather than smaller. We want to avoid failing to find an effect because we did not have enough data. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 214 / 227

Inference for Non-Normal Distributions What if the population is clearly non-normal and your sample is small? If the data are skewed, you can attempt to transform the variable to bring it closer to normality (e.g., logarithm transformation). The t procedures applied to transformed data are quite accurate for even moderate sample sizes. A distribution other than a normal distribution might describe your data well. Many non-normal models have been developed to provide inference procedures too. You can always use a distribution-free (nonparametric) inference procedure that does not assume any specific distribution for the population. But it is usually less powerful than distribution-driven tests (e.g., t test). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 215 / 227

Nonparametric Method: the Sign Test for Median A distribution-free test usually makes a statement of hypotheses about the median rather than the mean. A simple distribution-free test is the sign test for matched pairs. Assume that our random variable X is a continuous random variable with unknown median m. Upon taking a random sample X 1,X 2,,X n, we ll be interested in testing whether the median m takes on a particular value m 0. H 0 : m = m 0 ;H a : m > m 0 or H a : m < m 0 or H a : m m 0 Considering the quantity X i m 0 for i = 1,2,,n. If the null hypothesis is true, that is, m = m 0, then we should expect about half of the x i m 0 quantities obtained to be positive and half to be negative. This analysis of X i m 0 under the three situations m = m 0,m > m 0, and m < m 0 suggests that a reasonable test for testing the value of a median m should depend on X i m 0. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 216 / 227

The Sign Test for a Median Steps Calculate the matched difference for each individual in the sample. Ignore pairs with difference 0. The number of trials n is the count of the remaining pairs. Record N = the number of negative signs and N + = the number of positive signs. If the null hypothesis is true (m = m 0 ), then N and N + both follow a binomial distribution with parameters n and p = 1 2. Calculate p-value and make statement: For H a : m > m 0, reject the H 0 if n, or alternatively, p-value=p(n n ) is small. For H a : m < m 0, reject the H 0 if n +, or alternatively, p-value=p(n + n + ) is small. For H a : m m 0, reject the H 0 if min(n,n + ), or alternatively, p-value=2p(n min min(n,n + )) is small. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 217 / 227

Example the Sign Test for a Median Question: A random sample of 20 numbers 9.4, 13.4, 15.6, 16.2, 16.4, 16.8, 18.1, 18.7, 18.9, 19.1, 19.3, 20.1, 20.4, 21.6, 21.9, 23.4, 23.5, 24.8, 24.9, 26.8 Is there sufficient evidence to conclude that the median is smaller than 22? Solution: testing the null hypothesis H 0 : m = 22 against the alternative hypothesis H a : m < 22. First calculate x i 22. The observed number of positive signs is n + = 5. Therefore, we need to calculate how likely it would be to observe as few as 5 positive signs if the null hypothesis were true. The p-value is P(X <= 5) = P(X = 5)+P(X = 4)+P(X = 3)+P(X = 2)+P(X = 1)+P(X = 0) = 0.0207 < 0.05. There is sufficient evidence, at the 0.05 level, to conclude that the median is smaller than 22. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 218 / 227

Comparing Two Samples Independent samples: subjects in one samples are completely unrelated to subjects in the other sample. We often compare two treatments used on independent samples. Is the difference between both treatments due only to variations from the random sampling (B), or does it reflect a true difference in population means (A)? Zhaoxian Zhou (USM) CSS 211 January 11, 2018 219 / 227

Two-sample z statistic We have two independent SRSs (simple random samples) possibly coming from two distinct populations with (µ 1,σ 1 ) and (µ 2,σ 2 ). We use x 1 and x 2 to estimate the unknown µ 1 and µ 2. When both populations are normal, the sampling distribution of ( x 1 x 2 ) is also normal, with standard deviation: σ1 2 + σ2 2. n 1 n 2 Then the two-sample z statistic has the standard normal N(0, 1) sampling distribution. z = ( x 1 x 2 ) (µ 1 µ 2 ) σ 2 1 n 1 + σ2 2 n 2. The null hypothesis is typically that both population means µ 1 and µ 2 are equal, thus their difference is equal to zero. H 0 : µ 1 = µ 2 µ 1 µ 2 = 0 with either a one-sided or a two-sided alternative hypothesis. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 220 / 227

Two Independent Samples t Distribution We have two independent SRSs (simple random samples) possibly coming from two distinct populations with (µ 1,σ 1) and (µ 2,σ 2) unknown. We use ( x 1,s 1) and ( x 2,s 2) to estimate (µ 1,σ 1) and (µ 2,σ 2), respectively. To compare the means, both populations should be normally distributed. However, in practice, it is enough that the two distributions have similar shapes and that the sample data contain no strong outliers. The two-sample t statistic follows approximately the t distribution with a standard error SE (spread) reflecting variation from both samples: s1 2 SE = + s2 2. n 1 n 2 Conservatively, the degrees of freedom is equal to the smallest of (n 1 1,n 2 1). Zhaoxian Zhou (USM) CSS 211 January 11, 2018 221 / 227

Two-Sample t Significance Test The null hypothesis is that both population means µ 1 and µ 2 are equal, thus their difference is equal to zero. H 0 : µ 1 = µ 2 µ 1 µ 2 = 0 with either a one-sided or a two-sided alternative hypothesis. We find how many standard errors (SE) away from (µ 1 µ 2 ) is ( x 1 x 2 ) by standardizing with t: t = ( x 1 x 2 ) (µ 1 µ 2 ). SE Because in a two-sample test H 0 poses µ 1 µ 2 =0, we simply use With df = min(n 1 1,n 2 1). t = x 1 x 2. s 2 1 n 1 + s2 2 n 2 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 222 / 227

Example: Two-Sample t Significance Test We want to know whether parental smoking decreases children s lung capacity as measured by the forced vital capacity (FVC) test. Is the mean FVC lower in the population of children exposed to parental smoking? Parental smoking FVC x s n Yes 75.5 9.3 30 No 88.2 15.1 30 H 0 : µ smoke = µ no µ smoke µ no = 0, H a : µ smoke < µ no µ smoke µ no < 0 (one sided) The difference in sample averages follows approximately the t distribution: t 0, ssmoke 2 + s2 no,df = 29 n smoke n no We calculate the t statistic: t = x smoke x no = s smoke 2 + s2 no n smoke n no 75.5 88.2 9.3 2 30 + 15.12 30 = 3.9 In table D, for df=29 we find t > 3.659 = p < 0.0005 (one sided). It s a very significant difference, we reject H 0, i.e., lung capacity is significantly impaired in children of smoking parents. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 223 / 227

Two-Sample t Confidence Interval Because we have two independent samples we use the difference between both sample averages ( x 1 x 2 ) to estimate (µ 1 µ 2 ). Practical use of t : t C is the area between t and t. We find t in the line of Table D for df = min (n 1 1;n 2 1) and the column for confidence level C. The margin of error m is: m = t s 2 1 n 1 + s2 2 n 2 = t SE Zhaoxian Zhou (USM) CSS 211 January 11, 2018 224 / 227

Common Mistake A common mistake is to calculate a one-sample confidence interval for µ 1 and then check whether µ 2 falls within that confidence interval, or vice-versa. This is WRONG because the variability in the sampling distribution for two independent samples is more complex and must take into account variability coming from both samples. Hence the more complex formula for the standard error. s1 2 SE = + s2 2 n 1 n 2 Zhaoxian Zhou (USM) CSS 211 January 11, 2018 225 / 227

Example: Two-Sample t Confidence Interval Can directed reading activities in the classroom help improve reading ability? A class of 21 third-graders participates in these activities for 8 weeks while a control classroom of 23 third-graders follows the same curriculum without the activities. After 8 weeks, all children take a reading test (scores in table). 95% confidence interval for (µ 1 µ 2 ), with df = 20 conservatively, t = 2.086: CI : ( x 1 x 2 )±m;m = t s 2 1 n 1 + s2 2 n 2 = 2.086 4.31 = 8.99 With 95% confidence, (µ 1 µ 2 ), falls within 9.96±8.99 or 1.0 to 18.9. Zhaoxian Zhou (USM) CSS 211 January 11, 2018 226 / 227