An Informal Introduction to Statistics in 2h. Tim Kraska

Size: px

Start display at page:

Download "An Informal Introduction to Statistics in 2h. Tim Kraska"

Gerard Skinner
5 years ago
Views:

1 An Informal Introduction to Statistics in 2h Tim Kraska

2 Goal of this Lecture This is not a replacement for a proper introduction to probability and statistics Instead, it only tries to convey the very basic intuition behind some of the ideas The risk of this lecture: Half knowledge can be dangerous Most slides are based on CS 155 (big thanks to Eli)

3 The Very Basic

4 Statistics Probability Probability: mathematical theory that describes uncertainty. Statistics: set of techniques for extracting useful information from data.

5 Probability Space

6 Probability Function

7 Tossing a (Fair) Coin Ω = { H,T} F = 2 Ω = 2 2 = 4Events { { } F = { }, { H}, { T}, H,T Pr ({ })= 0 Pr Pr Pr ({ H} )= 0.5 ({ T} )= 0.5 ({ H,T} )= 1

8 Rolling a Dice Ω = { 1,2,3,4,5,6 } F = 2 Ω = 2 6 Events Pr ({ })= 0 Pr { 1} { } { } { } { } { } ( )= Pr( 2 )= Pr( 3 )= Pr( 4 )= Pr( 5 )= Pr( 6 )= 1 6 Pr { 1,2}... { } { } { } { } ( )= Pr( 1,3 )= Pr( 1,4 )= Pr( 1,5 )= Pr( 1,6 )= 2 6

9 Independent Events

10 Tossing a (Fair) Coin Twice Ω = { HH, HT,TH,TT } F = 2 Ω = 2 4 Events Pr ({ })= 0 Pr { HH} Pr { HT} Pr { HT,TT} Pr { HH, HT}... { } { } { } { } { } { } ( )= Pr( H )Pr ( H )= = 0.25 ( )= Pr( TH )= Pr( TT )= 0.25 ( )= Pr( HH,TH )= 0.5 ( )= Pr( TH,TT )= 0.5

11 Conditional Probability

12 Computing Conditional Probabilities

13 Example - a posteriori probability

15 Law of Total Probability

16 In Class Exercises 1. A fair coin was tossed 10 times and always ended up on its head. What is the likelihood that it will end up tail next? 2. Stan has two kids. One of his kids is a boy. What is the likelihood that the other one is also a boy

17 Bayesian Statistics

18 Bayes Law

19 Bayes Theorem Likelihood Probability of collecting this data when our hypothesis is true P(H D) = P(D H) P(H) P(D) Prior The probability of the hypothesis being true before collecting data Posterior The probability of our hypothesis being true given the data collected Marginal What is the probability of collecting this data under all possible hypotheses?

20 Deriving Bayes Law B P(A B) ~B A ~A U P(A B) = P(A B) * P(B) = P(B A) * P(A) U P(A B) = P(B A) * P(A) P(B)

21 Application: Finding a Biased Coin

23 Class Example: Drug Test 0.4% of the Rhode Island population use Marijuana* Drug Test: The test will produce 99% true positive results for drug users and 99% true negative results for non-drug users. If a randomly selected individual is tested positive, what is the probability he or she is a user?

24 Class Example: Drug Test 0.4% of the Rhode Island population use Marijuana* Drug Test: The test will produce 99% true positive results for drug users and 99% true negative results for non-drug users. If a randomly selected individual is tested positive, what is the probability he or she is a user? P( User + )= P ( + User )P( User) P + = ( ) ( ) P ( + User)P User P ( + User)P ( User)+ P ( +!User )P!User = = 28.4% ( )

25 Spam Filtering with Naïve Bayes ( ) P( spam words)= P ( spam )P words spam P( words) P( spam viagra,rich,..., friend)= P ( spam )P viagra,rich,..., friend spam P viagra,rich,..., friend ( ) ( ) P( spam words) P ( spam )P( viagra spam)p( rich spam) P friend spam P viagra,rich,..., friend ( ) ( ) 9/12/13 Bill Howe, UW 25

26 Bayesian Inference ( ) ( ) P( H E)= P ( E H )P H P E P( Θ E α)= P ( E Θ α )P Θ α P E α ( ) ( ) H Hypothesis ( ) Prior Probability P H P( H E) Posterior Probability ( ) Probability of observing E given H, likelihood P E H ( ) Model Evidence (marginal likelihood) P E

27 Random Variables

28 How to Model A Simple Game I get $5 from you You get $10 from me

29 Random Variables

30 Independence

31 Expectation µ

32 Linearity of Expectation

33 How to Model A Simple Game I get $5 from you You get $10 from me Would you play this game?

34 Variance

35 Variance

36 So far we knew the distribution What if we do not?

37 Red/Blue/Green Lottery Population N

38 Empirical Probability Population N f i = n i N = n in i i f blue = f green = 4 20 f red = 6 20

39 Population Mean µ = i N x i Variance σ 2 = i ( x i µ ) N 2

40 Red/Blue/Green Lottery Population N Sample n

41 Population vs. Sample Mean Population (parameter) µ = i N x i Sample (Statistic) àestimates x i i x = n Variance σ 2 = i ( x i µ ) N 2 Biased Estimate Un-Biased Estimate S N 2 = 2 S N 1 i = ( x i µ ) i 2 n ( x i µ ) n 1 2

42 Big Data How to calculate the Variance in 1-Pass 2 S N 1 2 ( x i µ ) i = n 1 = 1 x 2 i 1 n 1 i n = 1 n 1 i i x i ( x i x) 2 1 n 2 i ( x i x) 2

43 Law of Large Numbers

44 Law of Large Numbers Draw independent observations at random from any population with finite mean μ. As the number of observations increases, the sample mean approaches mean μ of the population. The more variation in the outcomes, the more trials are needed to ensure that is close to μ.

45 Weak Law of large numbers Strong law of large numbers X µ Pr( lim X n = µ )= 1 n

46 Central Limit Theorem

47 Law of Large Numbers (Coin)

48 Convolution Tossing 2 Dice Die 1: X Die 2: Y Dice 1+2 : Z = X Y k= P( Z = z)= P(X = k)p(y = z k)

49 Distribution of X 1 : Die 1 or Die

50 Distribution of S 2 : 2 Dice

51 Distribution of S

52 Distribution of S

53 Distribution of S

54 Distribution of S

55 Distribution of X

56 Distribution of S

57 Distribution of S

58 Distribution of S

59 Distribution of S

60 Distribution of S

61 Distribution of X

62 Distribution of S

63 Distribution of S

64 Distribution of S

65 Distribution of S

66 Distribution of S

67 Normal Distribution Probability Density Function 1 f ( x) = e x σ σ 2π 2 ( µ ) / ( 2 2) Ν ( µ,σ 2 ) Probability Density Function (PDF) Cumulative Distribution Function (CDF)

68 The Central Limit Theorem 1. The distribution of means will be approximately a normal distribution for larger sample sizes 2. The mean of the distribution of means approaches the population mean, μ, for large sample sizes 3. The standard deviation of the distribution of means approaches σ/ n for large sample sizes, where σ is the standard deviation of the population and n is the sample size

69 The Central Limit Theorem Side Notes 1. For practical purposes, the distribution of means will be nearly normal if the sample size is larger than If the original population is normally distributed, then the sample means will remain normally distributed for any sample size n, and it will become narrower 3. The original variable can have any distribution, it does not have to be a normal distribution

70 Shapes of Distributions as Sample Size Increases

71 Testing

72 Hypothesis Testing The FDA or science needs to decide on a new theory, drug, treatment H 0 : The null hypothesis - the current theory, drug, treatment, is as good or better H 1 : The alternative hypothesis - the new theory, drug, treatment, should replace the old one Researchers do not know which hypothesis is true. They must make a decision on the basis of evidence presented.

73 What is a Hypothesis? A hypothesis is a claim (assumption) about a population parameter: population mean Example: The mean monthly cell phone bill of this city is μ = $42 population proportion Example: The proportion of adults in this city with cell phones is p =.68 Statistics for Business and Economics, 6e 2007 Pearson Education, Inc. Chap 10-73

74 The Null Hypothesis, H 0 H 0 : μ = 3 H 0 : μ = 3 H 0 : X = 3 Statistics for Business and Economics, 6e 2007 Pearson Education, Inc.

75 Hypothesis Testing Process Is Claim: the population mean age is 50. (Null Hypothesis: H 0 : μ = 50 ) X=20 likely if μ = 50? If not likely, REJECT Null Hypothesis Suppose the sample mean age is 20: X = 20 Population Now select a random sample Sample

76 Reason for Rejecting H 0 76

77 Outcomes and Probabilities Possible Hypothesis Test Outcomes Key: Outcome (Probability) Decision Do Not Reject H 0 Reject H 0 Actual Situation H 0 True No error (1 - α ) Type I Error ( ) α H 0 False Type II Error ( β ) No Error ( 1 - β ) Statistics for Business and Economics, 6e 2007 Pearson Education, Inc. Chap 10-77

78 Level of Significance and the Rejection Region Level of significance = α H 0 : μ = 3 H 1 : μ 3 H 0 : μ 3 H 1 : μ > 3 Two-tail test Upper-tail test α/2 0 0 α /2 α Represents critical value Rejection region is shaded H 0 : μ 3 H 1 : μ < 3 α Statistics for Business and Economics, 6e 2007 Pearson Education, Inc. Lower-tail test Chap

79 p-value Approach to Testing p-value: Probability of obtaining a test statistic more extreme ( or ) than the observed sample value given H 0 is true Also called observed level of significance Smallest value of α for which H 0 can be rejected Statistics for Business and Economics, 6e 2007 Pearson Education, Inc. Chap 10-79

80 p-value Calculate the p-value and compare to α p-value Reject H 0 α =.10 0 Do not reject H 0 Reject H 0 Statistics for Business and Economics, 6e 2007 Pearson Education, Inc. Chap 10-80

81 Jonah Lehrer, 2010, The New Yorker The Truth Wears off John Davis, University of Illinois Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades. Anders Pape Møller, 1991 female barn swallows were far more likely to mate with male birds that had long, symmetrical feathers Between 1992 and 1997, the average effect size shrank by eighty per cent. Jonathan Schooler, 1990 subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it. The effect became increasingly difficult to measure. Joseph Rhine, 1930s, coiner of the term extrasensory perception Tested individuals with card-guessing experiments. A few students achieved multiple low-probability streaks. But there was a decline effect their performance became worse over time. 9/12/13 Bill Howe, Data Science, Autumn

82 Reason 1: Publication Bias In the last few years, several meta-analyses have reappraised the efficacy and safety of antidepressants and concluded that the therapeutic value of these drugs may have been significantly overestimated. Although publication bias has been documented in the literature for decades and its origins and consequences debated extensively, there is evidence suggesting that this bias is increasing. A case in point is the field of biomedical research in autism spectrum disorder (ASD), which suggests that in some areas negative results are completely absent (emphasis mine) a highly significant correlation (R 2 = 0.13, p < 0.001) between impact factor and overestimation of effect sizes has been reported. Publication bias: What are the challenges and can they be overcome? Ridha Joober, Norbert Schmitz, Lawrence Annable, and Patricia Boksa 9/12/13 Bill Howe, Data Science, Autumn J Psychiatry Neurosci May; 37(3): doi: /jpn

83 Publication Bias decline effect 9/12/13 Bill Howe, UW 83

84 decline effect = publication bias! 9/12/13 Bill Howe, UW 84

85 Background: Effect Size Effect size = [Mean of experimental group] [Mean of control group] standard deviation Expressed in relevant units Not just significant how significant? Used prolifically in meta-analysis to combine results from multiple studies But be careful averaging results from different experiments can produce nonsense Caveat: Other definitions of effect size exist: odds-ratio, correlation coefficient Robert Coe, 2002, Annual Conference of the British Educational Research Association It's 9/12/13 the Effect Size, Stupid: What effect size is Bill and Howe, why UW it is important. 85

86 Effect Size Standardized Mean Difference Lots of ways to estimate the pooled standard deviation Glass, 1976 e.g., Hartung et al., /12/2013 Bill Howe, UW 86

87 Effect size: Cohen s Heuristic Standardized mean difference effect size small = 0.20 medium = 0.50 large = /12/2013 Bill Howe, UW 87

88 Reason 3: Multiple Hypothesis Testing If you perform experiments over and over, you re bound to find something This is a bit different than the publication bias problem: Same sample, different hypotheses Significance level must be adjusted down when performing multiple hypothesis tests 9/12/13 Bill Howe, UW 88

89 P(detecting an effect when there is none) = α = 0.05 P(detecting an effect when it exists) = 1 α P(detecting an effect when it exists on every experiment) = (1 α) k P(detecting an effect when there is none on at least one experiment) = 1 (1 α) k α = 0.05 Familywise Error Rate 9/12/13 Bill Howe, UW 89

90 Familywise Error Rate Corrections Bonferroni Correction Just divide by the number of hypotheses Šidák Correction Asserts independence 09/12/2013 Bill Howe, UW 90

91 Summary Stochastic Variables Basics in Statistics Bayes Law Central Limit Theorem Law of Large Numbers Testing

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

1 School of Oriental and African Studies September 2015 Department of Economics Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com Gujarati D. Basic Econometrics, Appendix