Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Size: px

Start display at page:

Download "Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]"

August Fowler
5 years ago
Views:

1 Introduction to Statistical Analysis Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

2 2 Timeline 9:30 Morning I I 45mn Lecture: data type, summary statistics and graphical displays 15mn Quiz 10:30 15mn Co ee & Tea break I 60mn Lecture: some statistical distributions + CLT I 15mn Exercises & discussion 12:00 Lunch break 13:00 Afternoon I 45mn Lecture: Parametric tests for the mean I 30mn Exercises with shiny apps & discussion I I 30mn Lecture: Non-parametric tests for the mean 15mn Exercises & discussion 15:00 15mn Co ee & Tea break I 15mn Lecture: Tests for categorical variables I 15mn Exercises & discussion 16:00 Group based exercises I 60mn

3 Grand Picture of Statistics Population Sample Data (x 1,x 2,...,x n ) Parameters µ 2 Statistics bµ b 2 b 3

4 4 Data Types x 1 x 2 x 3 x n Cancer status C C C C Nucleic acid sequence C T T A 5-level pain score #ofdailyadmissionsata&e Gene expression intensity

5 5 Summary statistics and plots for qualitative data 5-level answers of 21 patients to the question How much did pain due to your ureteric stones interfere with your day to day activities? : 3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4, where I 1= Notatall, I 2= Alittlebit, I 3 = Somewhat, I 4= Quiteabit, I 5= Verymuch.

6 6 Summary statistics and plots for quantative data Gene expression values of gene CCND3 Cyclin D3 from 27 patients diagnosed with acute lymphoblastic leukaemia: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) x (11) x (12) x (13) x (14) x (15) x (16) x (17) x (18) x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27)

7 6 Summary statistics and plots for quantative data Gene expression values of gene CCND3 Cyclin D3 from 27 patients diagnosed with acute lymphoblastic leukaemia: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) x (11) x (12) x (13) x (14) x (15) x (16) x (17) x (18) x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27)

8 6 Summary statistics and plots for quantative data Gene expression values of gene CCND3 Cyclin D3 from 27 patients diagnosed with acute lymphoblastic leukaemia: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) x (11) x (12) x (13) x (14) x (15) x (16) x (17) x (18) x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27)

9 7 Two-sample case: independent versus paired samples Permeability constants of a placental membrane at term (X) and between 12 to 26 weeks gestational age (Y) X Y Hamilton depression scale factor measurements in 9 patients with mixed anxiety and depression, taken at the first (X) and second (Y) visit after initiation of a therapy (administration of a tranquilizer) X Y Y X

10 8 Quiz Time Sections 1 to 4 -ISfSCGp_EIVPPI_mnrJHttaKxln8vVoyjJFvS8BL1w/viewform

11 9 Some parametric distributions: Bernoulli distribution x 1 x 2 x 3 x n Cancer status C C C C If I n independent experiments, I outcome of each experiment is dichotomous (success/failure), I the probability of success is the same for all experiments, then, each dichotomous experiment, X i, follows a Bernoulli distribution with parameter : X i Bernoulli( ) P (X i =1)= P (X i =0)=1

12 Some parametric distributions: Binomial distribution 10 If I n independent experiments, I outcome of each experiment is dichotomous (success/failure), I the probability of success is the same for all experiments, then, I the number of successes out of n trials (experiments), Y = P n i=1 Xi, follows a binomial distribution with parameters n and : Y Bin(n, ), I the probability of observing exactly y successes out of n experiments, is given by P (Y = y n, ) = n! (n y)!y! y (1 ) n y.

13 Some parametric distributions: Binomial distribution If I n independent experiments, I outcome of each experiment is dichotomous (success/failure), I the probability of success is the same for all experiments, then, I the number of successes out of n trials (experiments), Y = P n i=1 Xi, follows a binomial distribution with parameters n and : 0.25 Y Bin(n, ), 0.20 probability P (Y = y n, ) = n! (n y)!y! y (1 ) n y Number of successes out of 50 experiments

14 Some parametric distributions: Poisson distribution 11 If, during a time interval or in a given area, I events occur independently, I at the same rate, I and the probability of an event to occur in a small interval (area) is proportional to the length of the interval (size of the area), then, I the number of events occurring in a fixed time interval or in a given area, X, may be modelled by means of a Poisson distribution with parameter : X Poisson( ), I the probability of observing x during a fixed time interval or in a given area is given by P (X = x )= x e x!.

15 Some parametric distributions: Poisson distribution If, during a time interval or in a given area, I events occur independently, I at the same rate, I and the probability of an event to occur in a small interval (area) is proportional to the length of the interval (size of the area), then, I the number of events occurring in a fixed time interval or in a given area, X, may be modelled by means of a Poisson distribution with parameter : X Poisson( ), 0.25 Probability P (X = x )= x e x! Number of chronic conditions per patient (US National Medical Expenditure Survey)

16 Some parametric distributions: Continuous distrib. 12 Density ex Gaussian skew t Gamma Inverse Gamma Gaussian

17 Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. Probability density function, f Z (z), of a standard normal: Density

18 Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. Probability density function, f Z (z), of a standard normal: µ 3 µ 2 µ µ µ + µ +2 µ % 95.45% 99.73%

19 Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (i) Suitable modelling for a lot of variables

20 Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (i) Suitable modelling for a lot of variables

21 Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 (x µ)2 e 2 2 E[X] =µ, Var[X] = 2, Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (i) Suitable modelling for a lot of variables: IQ % 95.45% 99.73%

22 Some parametric distributions: Normal distribution 13 X N(µ, 2 ), f X (x) = 1 p 2 2 E[X] =µ, Var[X] = 2, (x µ)2 e 2 2 Z = X µ N(0, 1), f Z (z) = 1 p 2 e x2 2. (ii) Central limit theorem (Lindeberg-Lévy CLT). Let (X 1,...,X n ) be n independent and identically distributed (iid) random variables drawn from distributions of expected values given by µ and finite variances given by 2,. then If X i N(µ, P n i=1 bµ = X = X i n d! N µ, 2 n. 2 ),thisresultistrueforallsamplesizes.

23 14 Central limit theorem shiny app: Distribution of the mean

24 95% Confidence interval for µ, the population mean, when X i N(µ, 2 ) 15 I if X N(µ, I if X N(µ, 2 ),thenx N µ, 2 n, 2 ),thenz = X µ N(0, 1), P < <! =0.95

25 95% Confidence interval for µ, the population mean, when X i N(µ, 2 ) I if X N(µ, 2 ),thenx N µ, 2 n, I if X N(µ, 2 ),thenz = X µ N(0, 1), I if unknown, then T = X µ St s n 1. P < <! =0.95 Density X N(0, 1) T St30 T St10 T St2 ft (t n 1) = ( n 2 2 ) ( n 1 2 )[(n 1) ] 1/2 1+ t2 n 1 n % % % %

26 95% Confidence interval for µ, the population mean, when X i N(µ, 2 ) I if X N(µ, 2 ),thenx N µ, 2 n, I if X N(µ, 2 ),thenz = X µ N(0, 1), I if unknown, then T = X µ St s n 1. P < <! =

27 95% Confidence interval for µ, the population mean, when X i iid(µ, 2 ) I CLT: X d! N µ, 2 n, I if X N(µ, 2 ),thenz = X µ N(0, 1), I if unknown, then T = X µ St s n 1. P < <! = Probability Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)

28 95% Confidence interval for µ Y µ X,thedi erence between population means 17 If we have I X i iid(µ X, 2 X ), i =1,...,n X, I Y i iid(µ Y, 2 Y ), i =1,...,n Y, then I if 2 X = Y 2 [Student s t-test equation], q. CI (µ Y µ X, 0.95) = (Y X) ± t 1 2,n X +n Y 2s p I if 2 X 6= 2 Y where s p = (n X 1)s 2 X +(n Y 1)s 2 Y n X +n Y 2, [Welch-Satterthwaite s t-test equation],. CI (µ Y µ X, 0.95) = (Y X) ± t 1 df = s 2 2 X n + s2 Y X ny s 2 2 X n X n X 1 + s 2 2 Y n Y n Y 1. q s 2 2,df X nx 1 n X + 1 n Y + s2 Y n Y,where

29 18 Central limit theorem shiny app: Coverage of Student s asymptotic confidence intervals

30 Quiz Time Practical IntroductionToStats/practical.html

31 PART II: Parametric tests Cancer Research UK 24 th of April 2017 D.-L. Couturier / M. Dunning / M. Eldridge [Bioinformatics core]

32 Grand Picture of Statistics Population Sample Data (x 1,x 2,...,x n ) Parameters µ 2 Statistics bµ b 2 b 21

33 Statistical hypothesis testing 22 A hypothesis test describes a phenomenon by means of two non-overlapping idealised models/descriptions: I the null hypothesis H0, generally assumed to be true until evidence indicates otherwise I the alternative hypothesis H1. The aim of the test is to reject the null hypothesis in favour of the alternative hypothesis, and conclude, with a probability of being wrong, that the idealised model/description of H1 is true. Theory 1: Dieters lose more fat than the exercisers Theory 2: There is no majority for Brexit now Theory 3: Serum vitamin C is reduced in patients

34 Statistical hypothesis testing 23 Several-step process: I Define H0 and H1 according to a theory I Set, the probability of rejecting H0 when it is true (type I error), I Define n, the sample size, allowing you to reject H0 when H1 is true with a probability 1 (Power), I Determine the test statistic to be used, I Collect the data, I Perform the statistical test, define the p-value, and reject (or not) the null hypothesis.

35 24 Statistical hypothesis testing Example: One-sample two-sided t-test We test: H0: µ IQ =100, H1: µ IQ > We have X i N(µ, We know I X N µ, 2 ),i=1,...,n, 2 n, I Z = X µ p n N(0, 1), Thus, if H0 is true, we have: I Z = X µ 0 p n N(0, 1). Define the p-value: I p value = P ( T >T obs ) Density % 95.45% 99.73%

36 Statistical hypothesis testing 4possibleoutcomes 25 Conclude: I if p-value >! do not reject H0. I if p-value <! reject H0 in favour of H1. Unknown Truth Test Outcome H0 not rejected H1 accepted H0 true 1 H1 true 1 where I is the type I error, I is the type II error.

37 Statistical hypothesis testing Example: One-sided binomial exact test We test: H0: =5%, H1: >5%. We have X i Bernoulli( ),i=1,...,n, We know I Y = P n i=1 Xi Binomial (, n), Thus, if H0 is true, we have: I Y = P n i=1 Xi Binomial (5%,n), Define the p-value: I p value = P (Y > Y obs ) Probability n = 25 n =

38 Two-sample two-sided Student-s & Welch s t-tests Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' We test H0: µ Y µ X =0 against H1: µ Y µ X 6=0. We know: I Student s t-test [assume 2 X = Y 2 ]: (Y X) (µ Y µ X ) q s 1 t 1 p + n 1 X n Y 2,n X +n Y 2 I Welch s t-test [assume 2 X 6= Y 2 ]: (Y X) (µ Y µ X ) r t s 2 1 X + s2 2,df Y n X ny Two Sample t-test 27 data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"] t = , df = 36, p-value = 6.046e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

39 Two-sample two-sided Student-s & Welch s t-tests Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' We test H0: µ Y µ X =0 against H1: µ Y µ X 6=0. We know: I Student s t-test [assume 2 X = Y 2 ]: (Y X) (µ Y µ X ) q s 1 t 1 p + n 1 X n Y 2,n X +n Y 2 I Welch s t-test [assume 2 X 6= Y 2 ]: (Y X) (µ Y µ X ) r t s 2 1 X + s2 2,df Y n X ny Welch Two Sample t-test 27 data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"] t = , df = , p-value = 9.871e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

40 F-test of equality of variances 28 Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' We test H0: 2 Y = 2 X against H1: 2 Y 6= 2 X. We know: I F-test [assume X i N(µ X, X) and Y i N(µ Y, Y )]: s 2 Y s 2 X F ny 1,n X 1 F test to compare two variances data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"] F = , num df = 26, denom df = 10, p-value = alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: sample estimates: ratio of variances

41 Multiplicity correction For each test, the probability of rejecting H0 (and accept H1) when H0 is true equals. For k tests, the probability of rejecting H0 (and accept H1) at least 1 time when H0 is true, k, is given by k =1 (1 ) k. Thus, for =0.05, I if k =1, 1 =1 (1 ) 1 =0.05, I if k =2, 2 =1 (1 ) 2 =0.0975, I if k = 10, 10 =1 (1 ) 10 = Idea: change the level of each test so that k =0.05: I Bonferroni correction : = k k, I Dunn-Sidak correction: =1 (1 k ) 1/k. 29

42 Introduction to Shiny Apps and Exercises 30

43 PART III: Non-parametric tests Cancer Research UK 24 th of April 2017 D.-L. Couturier / M. Dunning / M. Eldridge [Bioinformatics core]

44 Parametric or non-parametric? 32 T-test Outcome(s) normally distributed Yes Mildly No Small Sample size Medium Large Situations which may suggest the use of non-parametric statistics: I When there is a small sample size or very unequal groups, I When the data has notable outliers, I When one outcome has a distribution other than normal, I When the data are ordered with many ties or are rank ordered.

45 Sign test 33 A location model is assumed for X i, i =1,...,n: where e i iid(µ e =0, 2 e). X i = + e i, Interest for H0: = 0 against H1: < 0 or 6= 0 or > 0. Test statistics: S = P n i=1 (Xi 0 > 0). Distribution of S under H0: Exact binomial test S Binomial(0.5,n). data: 21 and 27 number of successes = 21, number of trials = 27, p-value = alternative hypothesis: true probability of success is not equal to percent confidence interval: sample estimates: probability of success Probability Number of successes out of 27 experiments

46 Wilcoxon sign-rank test 34 A location model is assumed for X i, i =1,...,n: where e i iid(µ e =0, 2 e). X i = + e i, Interest for H0: = 0 against H1: < 0 or 6= 0 or > 0. Test statistics : W + = P n i=1 (Xi 0 > 0) Rank( Xi 0 ). Distribution of W under H0: W + has no closed-form distribution. Wilcoxon signed rank test data: golub[1042, gol.fac == "ALL"] V = 268, p-value = alternative hypothesis: true location is not equal to percent confidence interval: sample estimates: (pseudo)median

47 Mann-Whitney-Wilcoxon test: Shift in location Let I Xi iid(µ X, 2 ), i =1,...,n X, I Y i iid(µ X +, 2 ), i =1,...,n Y. Acute myeloid leukemia (AML) n=11 Acute lymphoblastic leukemia (ALL) n=27 Intensity expression of gene 'CCND3 Cyclin D3' Interest for H0: = 0 against H1: < 0 or 6= 0 or > 0. Standardised test statistic: z = P ny i=1 R(Y i) [n Y (n X +n Y +1)/2] p n X n Y (n X +n Y +1)/ where R(Y i) denotes the rank of Y i amongst the combined samples, i.e., amongst (X 1,...,X nx,y 1,...,Y ny )., Distribution of Z under H0: Z N(0, 1) Implementation 1: statistic = , p-value = e Implementation 2: W = 284, p-value = 6.15e-07 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: sample estimates: difference in location Density

48 Non-parametric is not assumption free Shift in location tests when H0 is true 36 Simulate 2500 samples with I X i Uniform(1.5, 2.5), i =1,...,n X, I Y i Uniform(0, 4), i =1,...,n Y, so that E[X i]=e[y i]=2(i.e., same mean, same median). Assume I X i iid(µ X, 2 ), i =1,...,n X, I Y i iid(µ X +, 2 ), i =1,...,n Y. Test H0: = 0 against H1: 6= 0,atthe5%level,bymeansof I Mann-Whitney-Wilcoxon test (MWW), I T-test, I Welch-test. Sample size b Tests MWW Student s t-test Welch s test n X =200,n Y = n X =20,n Y =

49 Exercises 37

50 PART IV: Tests for categorical variables Cancer Research UK 24 th of April 2017 D.-L. Couturier / M. Dunning / M. Eldridge [Bioinformatics core]

51 39 2 goodness-of-fit test A trial to assess the e ectiveness of a new treatment versus a placebo in reducing tumour size in patients with ovarian cancer: Observed frequencies Binary outcome Group Tumour did not shrink Tumour did shrink Treatment (84) Placebo (40) (68) (56) (124) I H0 : No association between treatment group and tumour shrinkage, I H1 : Some association. Expected frequencies under H0 Binary outcome Group Treatment Placebo Tumour did not shrink = = Tumour did shrink = (84) = (40) (68) (56) (124) We have 2 categorical variables with a total of J =4cells (categories). I H0 : j = j0,j =1,...,J, I H1 : j 6= j0,j =1,...,J. JP 2 (O -test: j E j ) 2 E j 2 (J 1). j=1

52 2 goodness-of-fit test A trial to assess the e ectiveness of a new treatment versus a placebo in reducing tumour size in patients with ovarian cancer: Observed frequencies Binary outcome Group Tumour did not shrink Tumour did shrink Treatment (84) Placebo (40) (68) (56) (124) 39 I H0 : No association between treatment group and tumour shrinkage, I H1 : Some association. Expected frequencies under H0 Binary outcome Group Tumour did not shrink Tumour did shrink Treatment (84) Placebo (40) (68) (56) (124) We have 2 categorical variables with a total of J =4cells (categories). I H0 : j = j0,j =1,...,J, I H1 : j 6= j0,j =1,...,J. JP 2 (O -test: j E j ) 2 E j 2 (J 1). j=1 Pearson s Chi-squared test with Yates continuity correction data: M X-squared = , df = 1, p-value =

53 Fisher s exact test of independence 2 goodness-of-fit test not suitable when I n is small I E j < 5 for at least one cell. Observed frequencies Group Binary outcome Tumour did not shrink Tumour did shrink Treatment (84) Placebo (40) (68) (56) (124) Fisher showed that, under H0 (independence), P (observed table H0) =P (X = a) and X Hypergeometric(n, a + c, a + b). To compute the Fisher s test: I Define P (X = a) for all possible tables having the observed marginal counts, I Calculate the p value by defining the percentage of these tables that get a probability equal to or smaller than the one observed. Fisher s Exact Test for Count Data 40 data: M p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio

Analysis of Variance (ANOVA) Cancer Research UK 10 th of May 2018 D.-L. Couturier / R. Nicholls / M. Fernandes

Analysis of Variance (ANOVA) Cancer Research UK 10 th of May 2018 D.-L. Couturier / R. Nicholls / M. Fernandes 2 Quick review: Normal distribution Y N(µ, σ 2 ), f Y (y) = 1 2πσ 2 (y µ)2 e 2σ 2 E[Y ] =