Probability and Statistics Joyeeta Dutta Moscato June 30, 2014 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution Descriptive Statistics Hypothesis Null hypothesis (H 0 ) Alternate hypothesis (H A ) Significance P-value Confidence Interval Statistical Hypothesis Testing 1
Probability: How likely is it? How likely is a certain observation? Possible Outcomes Head, Tail P(Head) =? P(Tail) =? 1, 2, 3, 4, 5, 6 P(1) =? P(2) =?.. P(6) =? Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) Key condition: INDEPENDENCE What is the DISTRIBUTION of outcomes? 2
Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) Key condition: INDEPENDENCE What is the DISTRIBUTION of outcomes? P(2 Heads) = ¼ P(2 Tails) = ¼ P(1 Head) = P(1 Head, 1 Tail) + P( 1 Tail, 1 Head) = ¼ + ¼ = ½ Key condition: Must add to 1 Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) Key condition: INDEPENDENCE What is the DISTRIBUTION of outcomes? P(2 Heads) = ¼ P(2 Tails) = ¼ P(1 Head) = P(1 Head, 1 Tail) + P( 1 Tail, 1 Head) = ¼ + ¼ Histogram of outcomes of 10 tosses = ½ Key condition: Must sum to 1 3
Normal Distribution As the number of independent (random) events grows, the distribution approaches a NORMAL or GAUSSIAN distribution This property is often used in statistics and science http://www.mathsisfun.com/data/standard normal distribution.html Cumulative Distribution The probability distribution shows the probability of the value X The cumulative distribution shows the probability of a value less than or equal to X Wikipedia: http://en.wikipedia.org/wiki/cumulative_distribution_function 4
Statistical Hypothesis Testing You are running experiments to test the effect of a drug on subjects. How likely is it that the effect would be observed even if no real relation exists? If the likelihood is sufficiently small (eg. < 1%), then it can be assumed that a real relation exists. Otherwise, any observed effect may simply be due to chance H 0 : Null hypothesis No relation exists H A : Alternate hypothesis There is some sort of relation Statistical Hypothesis Testing SIGNIFICANCE LEVEL is decided a priori to decide whether H 0 is accepted or rejected. (Eg: 0.1, 0.5, 0.01) If P-VALUE < significance level, then H 0 is rejected. i.e. The result is considered STATISTICALLY SIGNIFICANT Wikipedia: http://en.wikipedia.org/wiki/p value 5
Error reporting How reliable is the measurement? (How reliable is the estimate?) Eg: 95% CONFIDENCE INTERVAL We are 95% confident that the true value is within this interval STANDARD ERROR can be used to approximate confidence intervals Standard error = Standard deviation of the sampling distribution Correlation When we say that two genes are correlated, we mean that they vary together. But how to quantify the degree of correlation? Pearson s r measures the extent to which two random variables are linearly related. A value of 1 indicates a perfect positive correlation (that is, as one variable increases, the other increases proportionally in linear fashion). A value of -1 indicates a perfect negative correlation. 6
Positive Correlations Negative Correlations 7
What do correlations tell us? Interesting site: http://www.tylervigen.com/ So how do we do make statements of causality? - Can ask the question: How likely is event X given an event Y? Back to Probability 0 < Prob < 1 P(A) = 1 P(A C ) [A C = Complement of A] If events A and B are independent, (event B has no effect on the probability of event A) Then: P (A, B) = P(A) P(B) If they are not independent, Then: P (A, B) = P(A B) P(B) P (A, B) = JOINT PROBABILITY of A and B P (A B) = CONDITIONAL PROBABILITY of A given B 8
Exercise 1 We are given 2 urns, each containing a collection of colored balls. Urn 1 contains 2 white and 3 blue balls; Urn 2 contains 3 white and 4 blue balls. A ball is drawn at random from urn 1 and put into urn 2, and then a ball is picked at random from urn 2 and examined. What is the probability that the ball is blue? Bayes Theorem P (A B) = P (B A) P(A) P (B) How? so or P (A, B) = P(A B) P(B) P(A B) = P (A, B) / P(B) P(A B) = P(B A) P(A) / P(B) P (A, B) = P(B, A) P (B, A) = P(B A) P(A) Also, This is equivalent to: P (A B) = P (B A) P(A) P (B A) P(A) + P (B A C ) P(A C ) 9
Contingency Table Courtesy: Rich Tsui, PhD Contingency Table You have developed a test to detect a certain disease What is the True Positive Rate (TPR) and True Negative Rate (TNR) of this test? Sensitivity = TPR = TP / TP + FN = P(Test+ Disease+) Specificity = TNR = TN / TN + FP = P(Test- Disease-) What is the Positive Predictive Value (PPV) and Negative Predictive Value (NPV)? PPV = TP / TP + FP = P(Disease+ Test+) NPV = TN / TN + FN = P(Disease- Test-) 10
Sensitivity (TPR) The probability of sick people who are correctly identified as having the condition Specificity (TNR) The probability of healthy people who are correctly identified as not having the condition Positive predictive value (PPV) Given that you test positive, the probability that you actually have the condition. Negative predictive value (NPV) Given that you test negative, the probability that you actually do not have the condition. Exercise 2 The results of a hypothetical study to measure test performance of the PCR test for HIV are shown in the 2 x 2 table in Table 1. (a) Calculate the sensitivity, specificity, disease prevalence, positive predictive value (PV+), and negative predictive value (PV-). (b) Use the TPR and TNR calculated in part (a) to fill the 2 x 2 table in Table 2. Calculate the disease prevalence, positive predictive value (PV+), and negative predictive value (PV-). 11
Recall Test question: The Prevalence of a particular disease is 1/10. A test for this disease provides a correct diagnosis in 90% of cases (i.e. if you have the disease, 90% of the time you will test positive, and if you do not have the disease, 90% of the time you will test negative). Given that you test positive for the disease, what is the probability that you actually have the disease? Solution: P (D+) = 0.1 P (T+ D+) = 0.9 P (T- D-) = 0.9, therefore P(T+ D-) = 1 0.9 = 0.1 P (D+ T+) = = 0.5 Prevalence = Prior probability in population P (T+ D+) P(D+) P (T+ D+) P(D+) + P (T+ D-) P(D-) T+ Test positive T- Test negative D+ Disease present D- Disease absent = (0.1) (0.9) (0.1) (0.9) + (0.9) (0.1) Assessing quality of the predictive model ROC-AUROC The area under the curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Q: Why is the blue curve worthless? 12