Math 494: Mathematical Statistics

Size: px

Start display at page:

Download "Math 494: Mathematical Statistics"

Jemima McGee
5 years ago
Views:

1 Math 494: Mathematical Statistics Instructor: Jimin Ding Department of Mathematics Washington University in St. Louis Class materials are available on course website ( jmding/math494/ ) Spring 2018 Jimin Ding, Math WUSTL Math 494 Spring / 44

2 Introduction to Statistical Inferences Jimin Ding, Math WUSTL Math 494 Spring / 44

3 Statistical Problem A typical statistical problem can be summarized in the following flow: random experiments data analysis inferences The random experiments may come from natural sciences, social sciences, or engineered system. Sometimes they are well designed to control the factors of interest (experimental design), and they are from real-world processes (observational studies). Jimin Ding, Math WUSTL Math 494 Spring / 44

4 Statistical Problem A typical statistical problem can be summarized in the following flow: random experiments data analysis inferences The random experiments may come from natural sciences, social sciences, or engineered system. Sometimes they are well designed to control the factors of interest (experimental design), and they are from real-world processes (observational studies). The collected data could be scalars (exam responses), vectors (stock price), matrices (digital image), arrays (contingency table), characters (text mining), or functions (time series, fmri). Jimin Ding, Math WUSTL Math 494 Spring / 44

5 Statistical Analysis There are two ways of analyzing data: Descriptive data analysis (explanatory data analysis) summarizes data into some statistics, such as the mean, median, range, standard deviation... visualize features of data by histogram, pie chart, boxplot... Jimin Ding, Math WUSTL Math 494 Spring / 44

6 Statistical Analysis There are two ways of analyzing data: Descriptive data analysis (explanatory data analysis) summarizes data into some statistics, such as the mean, median, range, standard deviation... visualize features of data by histogram, pie chart, boxplot... Inferential Statistical Analysis assumes a probability model on the collected data (sample) investigates features of the distribution or a family of distributions to infer the features of the entire group of individuals (population) that may be impossible or too expansive to exam. Jimin Ding, Math WUSTL Math 494 Spring / 44

7 Statistical Analysis There are two ways of analyzing data: Descriptive data analysis (explanatory data analysis) summarizes data into some statistics, such as the mean, median, range, standard deviation... visualize features of data by histogram, pie chart, boxplot... Inferential Statistical Analysis assumes a probability model on the collected data (sample) investigates features of the distribution or a family of distributions to infer the features of the entire group of individuals (population) that may be impossible or too expansive to exam. We will focus on the later one in this course. Jimin Ding, Math WUSTL Math 494 Spring / 44

8 Example 1: Quality Control Consider a population of N elements, for instance, a shipment of manufactured items. An unknown number N θ of these elements are defective. It will be expensive to exam all items for large N (or impossible if the inspection is destructive). To learn θ, one may randomly draw a sample of n without replacement and inspect. (Assume all items have the same probability to be defective.) Population: Sample (data): Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

9 Example 1: Quality Control Consider a population of N elements, for instance, a shipment of manufactured items. An unknown number N θ of these elements are defective. It will be expensive to exam all items for large N (or impossible if the inspection is destructive). To learn θ, one may randomly draw a sample of n without replacement and inspect. (Assume all items have the same probability to be defective.) Population: N items, N θ defective items Sample (data): Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

10 Example 1: Quality Control Consider a population of N elements, for instance, a shipment of manufactured items. An unknown number N θ of these elements are defective. It will be expensive to exam all items for large N (or impossible if the inspection is destructive). To learn θ, one may randomly draw a sample of n without replacement and inspect. (Assume all items have the same probability to be defective.) Population: N items, N θ defective items Sample (data): n sampled items, X defective items in sample Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

11 Example 1: Quality Control Consider a population of N elements, for instance, a shipment of manufactured items. An unknown number N θ of these elements are defective. It will be expensive to exam all items for large N (or impossible if the inspection is destructive). To learn θ, one may randomly draw a sample of n without replacement and inspect. (Assume all items have the same probability to be defective.) Population: N items, N θ defective items Sample (data): n sampled items, X defective items in sample Probability model: P r(x = k) = ( Nθ k )( N Nθ n k ) ( N n), k = 0, 1,, min(nθ, n). This is the hypergeometric distribution, H(N θ, N, n), which is a family of distributions indexed by θ. Here θ is the unknown parameter that we want to estimate. Jimin Ding, Math WUSTL Math 494 Spring / 44

12 Example 2: Measurement Problem An experimenter makes n independent determinations of a physical constant µ. Population: Sample (data): Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

13 Example 2: Measurement Problem An experimenter makes n independent determinations of a physical constant µ. Population: all measurements for physical experiments. Sample (data): Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

14 Example 2: Measurement Problem An experimenter makes n independent determinations of a physical constant µ. Population: all measurements for physical experiments. Sample (data): Observed measurements X 1,, X n from recorded experiments, subject to measurement errors. Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

15 Example 2: Measurement Problem An experimenter makes n independent determinations of a physical constant µ. Population: all measurements for physical experiments. Sample (data): Observed measurements X 1,, X n from recorded experiments, subject to measurement errors. Probability model: X i = µ + ɛ i, 1 i n. Jimin Ding, Math WUSTL Math 494 Spring / 44

16 Example 2: Measurement Problem An experimenter makes n independent determinations of a physical constant µ. Population: all measurements for physical experiments. Sample (data): Observed measurements X 1,, X n from recorded experiments, subject to measurement errors. Probability model: X i = µ + ɛ i, 1 i n. Different distribution assumptions can be considered for ɛ. 1. ɛ i are independent. 2. ɛ i are identically distributed 3. the distribution of ɛ i does not depend on µ. 4. (1) ɛ i iid N(0, σ 2 ). Here θ = (µ, σ 2 ) Θ, Θ = R R + R 2. (2) ɛ i iid F, where F is a cdf with expectation 0 and finite variance σ 2. Jimin Ding, Math WUSTL Math 494 Spring / 44

17 Example 3: Treatment Effect To compare the treatment effect of drug A and B for treating a given disease, m subjects were assign to drug A and n were assign to drug B. Let X 1,..., X m be responses of m subjects receiving drug A, and Y 1,..., Y n be responses of n subjects receiving drug B. If drug A is placebo, then X 1,..., X n are referred to as control observations, Y 1,..., Y n are treatment observations. Population: Sample (data): Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

18 Example 3: Treatment Effect To compare the treatment effect of drug A and B for treating a given disease, m subjects were assign to drug A and n were assign to drug B. Let X 1,..., X m be responses of m subjects receiving drug A, and Y 1,..., Y n be responses of n subjects receiving drug B. If drug A is placebo, then X 1,..., X n are referred to as control observations, Y 1,..., Y n are treatment observations. Population: Responses of all subjects receiving drug A and B. Sample (data): Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

19 Example 3: Treatment Effect To compare the treatment effect of drug A and B for treating a given disease, m subjects were assign to drug A and n were assign to drug B. Let X 1,..., X m be responses of m subjects receiving drug A, and Y 1,..., Y n be responses of n subjects receiving drug B. If drug A is placebo, then X 1,..., X n are referred to as control observations, Y 1,..., Y n are treatment observations. Population: Responses of all subjects receiving drug A and B. Sample (data): X 1,..., X m for A, Y 1,..., Y n for B. Probability model: Jimin Ding, Math WUSTL Math 494 Spring / 44

20 Example 3: Treatment Effect To compare the treatment effect of drug A and B for treating a given disease, m subjects were assign to drug A and n were assign to drug B. Let X 1,..., X m be responses of m subjects receiving drug A, and Y 1,..., Y n be responses of n subjects receiving drug B. If drug A is placebo, then X 1,..., X n are referred to as control observations, Y 1,..., Y n are treatment observations. Population: Responses of all subjects receiving drug A and B. Sample (data): X 1,..., X m for A, Y 1,..., Y n for B. Probability model: Different models can be considered: 1. X i iid F, Yj iid G, then the model is (F, G) 2. F N(µ, σ 2 ) and G N(µ +, σ 2 ), then the parameter is (, µ, σ 2 ). Jimin Ding, Math WUSTL Math 494 Spring / 44

21 Two Types of Questions iid Given X 1,, X n F or f, (referred to as a random sample of size n from the distribution F (or f) ), how do we infer F or f? F = {f(x) :f > 0, f(x)dx = 1} or F = {F (x) :F ( ) = 0, F (+ ) = 1, R F is monotone increasing and right continuous.} Jimin Ding, Math WUSTL Math 494 Spring / 44

22 Two Types of Questions iid Given X 1,, X n F or f, (referred to as a random sample of size n from the distribution F (or f) ), how do we infer F or f? F = {f(x) :f > 0, f(x)dx = 1} or F = {F (x) :F ( ) = 0, F (+ ) = 1, R F is monotone increasing and right continuous.} Given X 1,, X n iid F (x; θ) or f(x; θ), how do we infer θ? F = {f(x; θ) : θ Θ} (where θ is an unknown parameter take can take values in the parameter space Θ.) Jimin Ding, Math WUSTL Math 494 Spring / 44

23 Two Types of Questions iid Given X 1,, X n F or f, (referred to as a random sample of size n from the distribution F (or f) ), how do we infer F or f? F = {f(x) :f > 0, f(x)dx = 1} or F = {F (x) :F ( ) = 0, F (+ ) = 1, Nonparametric model R F is monotone increasing and right continuous.} Given X 1,, X n iid F (x; θ) or f(x; θ), how do we infer θ? F = {f(x; θ) : θ Θ} (where θ is an unknown parameter take can take values in the parameter space Θ.) Parametric model Jimin Ding, Math WUSTL Math 494 Spring / 44

24 Type of Statistical Models Nonparametric model: goal is F or f Parametric model: goal is θ Jimin Ding, Math WUSTL Math 494 Spring / 44

25 Type of Statistical Models Nonparametric model: goal is F or f Parametric model: goal is θ Since the second question has a smaller space of candidates, we first start with question 2 (simpler). Jimin Ding, Math WUSTL Math 494 Spring / 44

26 Type of Statistical Models Nonparametric model: goal is F or f Parametric model: goal is θ Since the second question has a smaller space of candidates, we first start with question 2 (simpler). Sometimes, θ is a vector that we are only interested in one (or some) component of θ. In this case, the remaining parameters are referred to as nuisance parameter. Jimin Ding, Math WUSTL Math 494 Spring / 44

27 Type of Statistical Models Nonparametric model: goal is F or f Parametric model: goal is θ Since the second question has a smaller space of candidates, we first start with question 2 (simpler). Sometimes, θ is a vector that we are only interested in one (or some) component of θ. In this case, the remaining parameters are referred to as nuisance parameter. Semiparametric model is a combination of parametric and nonparametric model and the goal is f(x; θ, g) which is only partially specified by the parameter θ, and g is a unspecified function (infinite-dimension). Jimin Ding, Math WUSTL Math 494 Spring / 44

28 What is Statistical Inference? Statistical inference is the process of using data to infer the distribution that generated data. Jimin Ding, Math WUSTL Math 494 Spring / 44

29 What is Statistical Inference? Statistical inference is the process of using data to infer the distribution that generated data. Three components of statistical inference: Point Estimation Confidence Interval Hypothesis Testing Jimin Ding, Math WUSTL Math 494 Spring / 44

30 Point Estimation Jimin Ding, Math WUSTL Math 494 Spring / 44

31 Point Estimation Parameter of Interest: A fixed and unknown population parameter, θ, or a function of model parameter, g(θ). Eg: population mean µ, population standard deviation σ,... Point Estimator: A statistic from a sample that is used to estimate the parameter of interest, ˆθ (which is a r.v.). Eg: sample mean X = i X i/n for µ, (X sample standard deviation S = i X) 2 i (n 1) for σ. Estimate: The numerical value of an estimator in an observed sample. Eg: observed sample mean x = i x i/n for µ, (x observed sample standard deviation s = i x) 2 i (n 1) for σ. Point estimation: the process of providing a point estimator. Jimin Ding, Math WUSTL Math 494 Spring / 44

32 Example 1: Poisson Model (Ex 4.1.3) Suppose the number of customers X that enter a store during 9-10am. follows a Poisson distribution with parameter θ. Suppose a random sample of the number of customers that enter the store during 9-10am. for 10 days results in the values: What is a good guess of θ? 9, 7, 9, 15, 10, 13, 11, 7, 2, 12. go back to CI Jimin Ding, Math WUSTL Math 494 Spring / 44

33 Maximum Likelihood Estimation (MLE) The observed values in a random sample, x 1, x 2,, x n, can be called as realizations of the sample. Likelihood function: joint pdf of the realizations (observed data) L(θ) =f 1 (x 1 ; θ)f 2 (x 2 ; θ) f n (x n ; θ) =Π n i=1f(x i ; θ) (if iid) The larger L(θ) is, the more possible that we observe these realizations x 1, x 2,, x n. The θ that maximizes the likelihood function, L(θ), is referred to as maximum likelihood estimator (MLE). ˆθ = arg max θ Θ L(θ) = arg max log L(θ) θ Θ Jimin Ding, Math WUSTL Math 494 Spring / 44

34 Example 2: Normal Model (Example 4.1.3) Jimin Ding, Math WUSTL Math 494 Spring / 44

35 Example 3: Uniform Distribution (Example 4.1.4) Jimin Ding, Math WUSTL Math 494 Spring / 44

36 Properties of MLE Plug-In Theorem: If ˆθ is the MLE of θ, then g(ˆθ) is the MLE of g(θ). The MLE is consistent. The MLE is asymptotically unbiased. The MLE is efficient. The MLE is often asymptotically normal. The MLE is equivalent to LSE in regression under normality assumption. Jimin Ding, Math WUSTL Math 494 Spring / 44

37 Nonparametric MLE: for CDF Empirical CDF for F ˆF X (x) = 1 n n 1(X i x). i=1 This stepwise function is called the empirical CDF. It is an unbiased estimator of CDF. Jimin Ding, Math WUSTL Math 494 Spring / 44

38 Nonparametric MLE: for CDF Empirical CDF for F ˆF X (x) = 1 n n 1(X i x). i=1 This stepwise function is called the empirical CDF. It is an unbiased estimator of CDF. Because it can be viewed as an average of Bernoulli random variables, it follows LLN as n. It can be proved to be a n-consistent estimator of CDF, and it is asymptotically normal. It can be further proved it is a uniformly consistent estimator of CDF. Jimin Ding, Math WUSTL Math 494 Spring / 44

39 Nonparametric MLE: for PDF or PMF Discrete case: If the sample space (possible outcome of X) is finite, one may list all possible values {a 1, a 2,, a m }, then define ˆf(a j ) = 1 n n 1(X i = a j ), for j = 1,, m i=1 If the sample space is infinite, one may select some higher possible values and group the others {a 1, a 2,, a m, ã m+1 }, where ã m+1 = {a m+1, a m+2, } Rule of thumb: select m s.t. ˆf(am ) > ˆf(ã m+1 ) Continuous case: For a given x, consider (x h, x + h) for some h > 0, Jimin Ding, Math WUSTL Math 494 Spring / 44

40 Nonparametric MLE: for PDF or PMF Discrete case: If the sample space (possible outcome of X) is finite, one may list all possible values {a 1, a 2,, a m }, then define ˆf(a j ) = 1 n n 1(X i = a j ), for j = 1,, m i=1 If the sample space is infinite, one may select some higher possible values and group the others {a 1, a 2,, a m, ã m+1 }, where ã m+1 = {a m+1, a m+2, } Rule of thumb: select m s.t. ˆf(am ) > ˆf(ã m+1 ) Continuous case: For a given x, consider (x h, x + h) for some h > 0, ˆf(x) = n i=1 1{X i (x h, x + h)} 2hn Jimin Ding, Math WUSTL Math 494 Spring / 44

41 Confidence Interval Jimin Ding, Math WUSTL Math 494 Spring / 44

42 Motivation for Confidence Interval In the previous Normal Model example, ˆµ ML = X maximizes the likelihood of observing the recorded data. But P (ˆµ = µ) =? Jimin Ding, Math WUSTL Math 494 Spring / 44

43 Motivation for Confidence Interval In the previous Normal Model example, ˆµ ML = X maximizes the likelihood of observing the recorded data. But P (ˆµ = µ) =? So we need an estimate of estimation error. In this example, we want to estimate SD(ˆµ) or V ar(ˆµ), to understand how much did ˆµ miss µ. In general, let θ be the parameter of interest and ˆθ be an estimator of θ. We call an estimator of the stardard deviation of ˆθ as standard error of ˆθ, and denoted by se(ˆθ). se(ˆθ) := ˆ SD(ˆθ). Jimin Ding, Math WUSTL Math 494 Spring / 44

44 Confidence Interval Interval Estimate: interval bounded by two values and used to estimate the population parameter of interest. (The bound could be infinity.) Some intervals are more useful than others: Jimin Ding, Math WUSTL Math 494 Spring / 44

45 Confidence Interval Interval Estimate: interval bounded by two values and used to estimate the population parameter of interest. (The bound could be infinity.) Some intervals are more useful than others: we want intervals to 1. have small length Jimin Ding, Math WUSTL Math 494 Spring / 44

46 Confidence Interval Interval Estimate: interval bounded by two values and used to estimate the population parameter of interest. (The bound could be infinity.) Some intervals are more useful than others: we want intervals to 1. have small length and 2.have a high chance to capture the true parameter. Jimin Ding, Math WUSTL Math 494 Spring / 44

47 Confidence Interval Interval Estimate: interval bounded by two values and used to estimate the population parameter of interest. (The bound could be infinity.) Some intervals are more useful than others: we want intervals to 1. have small length and 2.have a high chance to capture the true parameter. When we increase the interval we are always more confident that true parameter is inside it because we ve include more values. Jimin Ding, Math WUSTL Math 494 Spring / 44

48 Confidence Interval Interval Estimate: interval bounded by two values and used to estimate the population parameter of interest. (The bound could be infinity.) Some intervals are more useful than others: we want intervals to 1. have small length and 2.have a high chance to capture the true parameter. When we increase the interval we are always more confident that true parameter is inside it because we ve include more values. Confidence Level/Coefficient: the likelihood of an interval estimate to capture the population parameter of interest, often denoted by 1 α, for some α (0, 1). Jimin Ding, Math WUSTL Math 494 Spring / 44

49 Confidence Interval Interval Estimate: interval bounded by two values and used to estimate the population parameter of interest. (The bound could be infinity.) Some intervals are more useful than others: we want intervals to 1. have small length and 2.have a high chance to capture the true parameter. When we increase the interval we are always more confident that true parameter is inside it because we ve include more values. Confidence Level/Coefficient: the likelihood of an interval estimate to capture the population parameter of interest, often denoted by 1 α, for some α (0, 1). Confidence Interval (CI) Estimate: an interval estimate (L, U) that has a specified (and justified) level of confidence, say 100(1 α)%, where α [0, 1]. 1 α = P (θ (L, U)) Jimin Ding, Math WUSTL Math 494 Spring / 44

50 How to Interpret Confidence Intervals Since sample is random, so are confidence intervals. Just like an estimator. If I collect many different samples and so have many different 100(1 α)% CIs, then I expect 100(1 α)% of these intervals to capture the true parameter. Practical interpretation: I am 100(1 α)% confident that the interval contains/captures the population parameter (mean, variance,median...). The interpretation rests on the idea of repeated samples. Just like the sampling distributions. Note: once the sample is drawn, the realized value of the confidence interval is (l, u), an interval of real numbers, which is not random any longer. Jimin Ding, Math WUSTL Math 494 Spring / 44

51 How to Interpret Confidence Intervals Since sample is random, so are confidence intervals. Just like an estimator. If I collect many different samples and so have many different 100(1 α)% CIs, then I expect 100(1 α)% of these intervals to capture the true parameter. Practical interpretation: I am 100(1 α)% confident that the interval contains/captures the population parameter (mean, variance,median...). The interpretation rests on the idea of repeated samples. Just like the sampling distributions. Note: once the sample is drawn, the realized value of the confidence interval is (l, u), an interval of real numbers, which is not random any longer. It either contains θ or not. Jimin Ding, Math WUSTL Math 494 Spring / 44

52 Example: CI for µ Under normality Case 1: σ 2 is known. Case 2: σ 2 is unknown. (Example 4.2.1) Jimin Ding, Math WUSTL Math 494 Spring / 44

53 Pivotal Statistics In the construction of above CIs, we have used the statistics Z = X µ σ/ X µ and T = n S/. These statistics are called pivotal n statistics. Jimin Ding, Math WUSTL Math 494 Spring / 44

54 Pivotal Statistics In the construction of above CIs, we have used the statistics Z = X µ σ/ X µ and T = n S/. These statistics are called pivotal n statistics. They are random variables with the following two properties I. A function of the unknown parameter of interest, µ, but contains no other unknown parameters. II. The distribution of a pivotal statistic is known and free of any unknown parameters. Jimin Ding, Math WUSTL Math 494 Spring / 44

55 Pivotal Statistics In the construction of above CIs, we have used the statistics Z = X µ σ/ X µ and T = n S/. These statistics are called pivotal n statistics. They are random variables with the following two properties I. A function of the unknown parameter of interest, µ, but contains no other unknown parameters. II. The distribution of a pivotal statistic is known and free of any unknown parameters. It is a common technique to use pivotal statistics to construct confidence intervals and other statistical inference procedures in statistics. Jimin Ding, Math WUSTL Math 494 Spring / 44

56 Example: CI for µ Without normality Recall CLT: Let X 1, X 2,, X n iid with µ and σ 2 <. Then X µ σ/ n D N(0, 1), n. The result still hold if σ is replaced by a consistent estimator ˆσ (to be proved in Chapter 5), such as sample standard deviation ˆσ = S. Case 1: large sample (Example 4.2.2) Jimin Ding, Math WUSTL Math 494 Spring / 44

57 Example: CI for µ Without normality Case 2: large sample of Bernoulli rv (Example 4.2.3) Jimin Ding, Math WUSTL Math 494 Spring / 44

58 Example: CI for µ Without normality Case 3: small sample (Poisson Model) Jimin Ding, Math WUSTL Math 494 Spring / 44

59 Example: CI for µ 1 µ 2 Case 1: large sample Jimin Ding, Math WUSTL Math 494 Spring / 44

60 Example: CI for µ 1 µ 2 Case 2: under normality (Example 4.2.4) Jimin Ding, Math WUSTL Math 494 Spring / 44

61 Example: CI for p 1 p 2 in Bernoulli/Binomial Jimin Ding, Math WUSTL Math 494 Spring / 44

62 Hypothesis Testing Jimin Ding, Math WUSTL Math 494 Spring / 44

63 Motivation Estimation was concerned with taking the data and making guesses about what the parameter could be. What if we first guess what the parameter is, and decide whether the data support that guess? Our initial guess is called the null hypothesis, denoted by H 0. The alternative hypothesis, denoted by H 1, is the opposite of the null hypothesis. Hypothesis testing: a process to make a decision based data between two opposing hypotheses. Statistical hypotheses are certain explicit statements about population parameters. Jimin Ding, Math WUSTL Math 494 Spring / 44

64 Hypothesis Testing Hypothesis testing is an inference tool to use sample data to examine whether the statistical hypotheses are true. It is usually done in the following 5 steps. Jimin Ding, Math WUSTL Math 494 Spring / 44

65 Hypothesis Testing Hypothesis testing is an inference tool to use sample data to examine whether the statistical hypotheses are true. It is usually done in the following 5 steps. I. State null, H 0, and alternative, H 1, hypotheses. H 0 : neutral / no difference / no effect H 1 : a contradictory claim to H 0. It depends on the questions we want to answer. Jimin Ding, Math WUSTL Math 494 Spring / 44

66 Hypothesis Testing Hypothesis testing is an inference tool to use sample data to examine whether the statistical hypotheses are true. It is usually done in the following 5 steps. I. State null, H 0, and alternative, H 1, hypotheses. H 0 : neutral / no difference / no effect H 1 : a contradictory claim to H 0. It depends on the questions we want to answer. II. Specify the significance level: pre-determined a small positive number to bound p-value, denoted by α Jimin Ding, Math WUSTL Math 494 Spring / 44

67 Hypothesis Testing Hypothesis testing is an inference tool to use sample data to examine whether the statistical hypotheses are true. It is usually done in the following 5 steps. I. State null, H 0, and alternative, H 1, hypotheses. H 0 : neutral / no difference / no effect H 1 : a contradictory claim to H 0. It depends on the questions we want to answer. II. Specify the significance level: pre-determined a small positive number to bound p-value, denoted by α III. Computer test statistic: numerical summary of data which measures the deviation between H 0 and H 1, usually a pivotal statistic. Jimin Ding, Math WUSTL Math 494 Spring / 44

68 Hypothesis Testing Hypothesis testing is an inference tool to use sample data to examine whether the statistical hypotheses are true. It is usually done in the following 5 steps. I. State null, H 0, and alternative, H 1, hypotheses. H 0 : neutral / no difference / no effect H 1 : a contradictory claim to H 0. It depends on the questions we want to answer. II. Specify the significance level: pre-determined a small positive number to bound p-value, denoted by α III. Computer test statistic: numerical summary of data which measures the deviation between H 0 and H 1, usually a pivotal statistic. IV. Derive rejection region, or find p-value which is the probability of rejecting H 0 when it is actually true. Jimin Ding, Math WUSTL Math 494 Spring / 44

69 Hypothesis Testing Hypothesis testing is an inference tool to use sample data to examine whether the statistical hypotheses are true. It is usually done in the following 5 steps. I. State null, H 0, and alternative, H 1, hypotheses. H 0 : neutral / no difference / no effect H 1 : a contradictory claim to H 0. It depends on the questions we want to answer. II. Specify the significance level: pre-determined a small positive number to bound p-value, denoted by α III. Computer test statistic: numerical summary of data which measures the deviation between H 0 and H 1, usually a pivotal statistic. IV. Derive rejection region, or find p-value which is the probability of rejecting H 0 when it is actually true. V. Draw conclusions: reject H 0 if p-value is smaller than α or observed data fall in to rejection region. Jimin Ding, Math WUSTL Math 494 Spring / 44

70 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) H 0 is true H 1 is true Jimin Ding, Math WUSTL Math 494 Spring / 44

71 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) Decision: Reject H 0 Decision: Fail to Reject H 0 (Accept H 0 ) H 0 is true H 1 is true Jimin Ding, Math WUSTL Math 494 Spring / 44

72 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) Decision: Reject H 0 H 0 is true H 1 is true Correct Decision Decision: Fail to Reject H 0 (Accept H 0 ) Correct Decision Jimin Ding, Math WUSTL Math 494 Spring / 44

73 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) H 0 is true H 1 is true Decision: Type I error Correct Decision Reject H 0 Decision: Correct Decision Type II error Fail to Reject H 0 (Accept H 0 ) Jimin Ding, Math WUSTL Math 494 Spring / 44

74 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) H 0 is true H 1 is true Decision: Type I error Correct Decision Reject H 0 Probability=α Decision: Correct Decision Type II error Fail to Reject H 0 Probability=β (Accept H 0 ) Jimin Ding, Math WUSTL Math 494 Spring / 44

75 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) H 0 is true H 1 is true Decision: Type I error Correct Decision Reject H 0 Probability=α Power (1 β) Decision: Correct Decision Type II error Fail to Reject H 0 Probability= 1 α Probability=β (Accept H 0 ) Jimin Ding, Math WUSTL Math 494 Spring / 44

76 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) H 0 is true H 1 is true Decision: Type I error Correct Decision Reject H 0 Probability=α Power (1 β) (False Positive) Decision: Correct Decision Type II error Fail to Reject H 0 Probability= 1 α Probability=β (Accept H 0 ) (False Negative) Jimin Ding, Math WUSTL Math 494 Spring / 44

77 Decisions in Hypothesis Testing Let X 1,, X n iid f(x; θ), θ Θ. Consider to test H 0 : θ Θ 0, v.s. H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = Ø. There are two types of wrong decisions in hypothesis testing. When H 0 is true, we reject H 0. (False positive, Type I error) When H 1 is true, we fail to reject H 0. (False negative, Type II error) H 0 is true H 1 is true Decision: Type I error Correct Decision Reject H 0 Probability=α Power (1 β) (False Positive) (Correct Positive) Decision: Correct Decision Type II error Fail to Reject H 0 Probability= 1 α Probability=β (Accept H 0 ) (Correct Negative) (False Negative) Jimin Ding, Math WUSTL Math 494 Spring / 44

78 Decision Rule If (X 1,, X n ) C, then reject H 0 ; (rejection/critical region) If (X 1,, X n ) C c, then fail to reject H 0. (acceptance region) Goal: select C so that the probability of making errors are minimized. Typically we consider Type I error is more serious. So we first bound the probability of Type I error and then minimize the probability of Type II error. The probability of making type II error is denoted by β. Power: The probability of making a correct rejection, 1 β. Generally, the probability of making Type I error increases as the probability of making Type II error decreases. Jimin Ding, Math WUSTL Math 494 Spring / 44

79 Size of the Test and Power We say a rejection/critical region C is of size α if α = max θ Θ 0 P θ {(X 1,, X n ) C}, which is the upper bound of the probability of false rejection. Furthermore, power is defined as power = P θ {(X 1,, X n ) C c }, θ Θ 1, which is 1 probability of making Type II error. Jimin Ding, Math WUSTL Math 494 Spring / 44

80 Size of the Test and Power We say a rejection/critical region C is of size α if α = max θ Θ 0 P θ {(X 1,, X n ) C}, which is the upper bound of the probability of false rejection. Furthermore, power is defined as power = P θ {(X 1,, X n ) C c }, θ Θ 1, which is 1 probability of making Type II error. We see the power of a test depends on its critical region (rule). Denote the power function by r C (θ) = P θ {(X 1,, X n ) C c }, θ Θ. Jimin Ding, Math WUSTL Math 494 Spring / 44

81 Size of the Test and Power We say a rejection/critical region C is of size α if α = max θ Θ 0 P θ {(X 1,, X n ) C}, which is the upper bound of the probability of false rejection. Furthermore, power is defined as power = P θ {(X 1,, X n ) C c }, θ Θ 1, which is 1 probability of making Type II error. We see the power of a test depends on its critical region (rule). Denote the power function by r C (θ) = P θ {(X 1,, X n ) C c }, θ Θ. Given two critical regions, C 1 and C 2, which are both of size of α, we claim C 1 is better than C 2 if r C1 (θ) r C2 (θ), θ Θ 1. Jimin Ding, Math WUSTL Math 494 Spring / 44

82 Example 1: Binomial Model Jimin Ding, Math WUSTL Math 494 Spring / 44

83 Jimin Ding, Math WUSTL Math 494 Spring / 44

84 Example 2: Normal Model Jimin Ding, Math WUSTL Math 494 Spring / 44

85 P-value So far, we have derived decision rules and rejection regions using α. We see they depend on the choice of significance level α. But no data information, except sample size, was used in decision rules. Jimin Ding, Math WUSTL Math 494 Spring / 44

86 P-value So far, we have derived decision rules and rejection regions using α. We see they depend on the choice of significance level α. But no data information, except sample size, was used in decision rules. In practice, however, the data are often already collected, and one may not know a good choice of α in advance, or want to know all decision rules for a set values of α. For example, the observed sample mean is given x = 5, should we reject H 0 for α = 0.01, or α = 0.05, or α = Jimin Ding, Math WUSTL Math 494 Spring / 44

87 P-value So far, we have derived decision rules and rejection regions using α. We see they depend on the choice of significance level α. But no data information, except sample size, was used in decision rules. In practice, however, the data are often already collected, and one may not know a good choice of α in advance, or want to know all decision rules for a set values of α. For example, the observed sample mean is given x = 5, should we reject H 0 for α = 0.01, or α = 0.05, or α = In this case, a p-value instead of a rejection region is often used to provide more information. Precisely, p-value is, given that the H 0 is true, the conditional probability of observing more extreme data in the direction of H 1. The p-value can be viewed as an observed significance level. Jimin Ding, Math WUSTL Math 494 Spring / 44

88 Example: Zea mays Growth (Example 4.5.1,4.5.5) In 1878, Darwin recorded the heights of Zea mays plants to see the difference between cross- and self- fertilization. On each of the 15 plots, one cross-fertilized plant and one self-fertilized plant were planted to grow and then measured. The height difference of the cross-fertilized plant and the self-fertilized plant in the same plot is recorded. For 15 plots, the mean is x = 2.62 and standard deviation is s = Assume the difference in height is independent, and normally distributed. Is cross-fertilized plant taller than self-fertilized plant? Population Sample Probability model Jimin Ding, Math WUSTL Math 494 Spring / 44

89 Example: Zea mays Growth (Example 4.5.1,4.5.5) In 1878, Darwin recorded the heights of Zea mays plants to see the difference between cross- and self- fertilization. On each of the 15 plots, one cross-fertilized plant and one self-fertilized plant were planted to grow and then measured. The height difference of the cross-fertilized plant and the self-fertilized plant in the same plot is recorded. For 15 plots, the mean is x = 2.62 and standard deviation is s = Assume the difference in height is independent, and normally distributed. Is cross-fertilized plant taller than self-fertilized plant? Population Sample Probability model Parameter of interest Null and alternative hypotheses Test statistics P-value Conclusion Jimin Ding, Math WUSTL Math 494 Spring / 44

90 Example 3: F-test for Variances Recall in testing two sample means, to improve power, we assumed the variances of two normal samples were same. Is this assumption supported by data? Can we test the assumption formally? Jimin Ding, Math WUSTL Math 494 Spring / 44

91 Example 3: F-test for Variances Recall in testing two sample means, to improve power, we assumed the variances of two normal samples were same. Is this assumption supported by data? Can we test the assumption formally? Simpson, Olsen, and Eden (1975) describe an experiment, in which a random sample of 26 clouds were seeded with silver nitrate to see if they produced more rain than unseeded clouds. Suppose that on a log scale, the rainfall in both samples is approximated normal. And the sample means and sample variances are x = 5.13, ȳ = 3.99, s 2 x = 63.96, s 2 y = Jimin Ding, Math WUSTL Math 494 Spring / 44

92 Level, Power, P-value of F-test in One-sided Test Consider H 0 : σ 2 1 σ 2 2 v.s. H 1 : σ 2 1 > σ 2 2. Decision rule: reject H 0 in favor of H 1 if S 2 x/s 2 y c s.t. α = max P (Sx/S 2 σ1 2 y 2 c σ1, 2 σ2). 2,σ2 2 Jimin Ding, Math WUSTL Math 494 Spring / 44

93 Level, Power, P-value of F-test in One-sided Test Consider H 0 : σ 2 1 σ 2 2 v.s. H 1 : σ 2 1 > σ 2 2. Decision rule: reject H 0 in favor of H 1 if S 2 x/s 2 y c s.t. α = max P (Sx/S 2 σ1 2 y 2 c σ1, 2 σ2). 2,σ2 2 Proposition: Let V = S 2 x/s 2 y, c be the α upper tail quantile of F n 1,m 1, and G n 1,m 1 be the c.d.f. of F n 1,m 1 distribution. (That is, 1 α = G(c)). Then the power function of the test P (V c) satisfies: 1. P (V c) = 1 G n 1,m 1 (σ 2 2 /σ2 1 c). 2. P (V c) = α, if σ 2 1 = σ P (V c) < α, if σ 2 1 < σ P (V c) > α, if σ 2 1 > σ P (V c) 0, if σ 2 1 /σ P (V c) 1, if σ 2 1 /σ2 2. Jimin Ding, Math WUSTL Math 494 Spring / 44

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/