Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!! (preferred!)!!

Probability theory and inference statistics Dr. Paola Grosso SNE research group p.grosso@uva.nl paola.grosso@os3.nl (preferred)

Roadmap Lecture 1: Monday Sep. 22nd Collecting data Presenting data Descriptive statistics Basic probability theory Lecture 2: Thursday Sep. 25th Probability distributions (cont) Parameter estimation Confidence intervals, limits, significance Hypothesis testing

Last time

Last time What I think you learned: How to present data How to make simple statement about your data Basic of probability theory Discrete variable distributions What you really have learned/ remember: http://goo.gl/w1ufzz

The binomial distribution A discrete random variable R follows the binomial distribution if: P(R = r) = p r (1 p) n r n r(n r) Probability of a specific outcome Number of equivalent permutations for that outcome There is a fixed number of trials n; Only two outcomes (success or failure), are possible at each trial; The trials are independent; There is a constant probability p of success at each trial; The random variable r is the number of successes in n trials.

Hands-on #5 Five percent of the switches produced by a company are defective or do not operate. What is the probability that out of thirty switches you have to install, one will be defective? And the probability that at most one is defective? Hint: look at dbinom and pbinom and qbinom

Probability distributions (continued)

The Poisson distribution It determines the probability of a specified event occurring during a specific period of time (or volume or distance or length) The events occur with a known average rate and independently of the time since the last event λ is the expected number of occurrences in this interval P(r;λ) = e λ λ r r

Properties of the Poisson distribution

More properties of the Poisson distribution Mean, r = λ Variance: V (r) = λ σ = λ e P( r; λ) = r λ r λ

Hands-on #6 The annual failure rate of two year old hard disks is 8%. You maintain a pool of 100 nodes with these two year old hard disk installed. What is the probability that one will fail today? And the probability one will fail this week?

Continuous random variable PDFs

Probability of continuous random variable The probability Pr (2 x 4) is the area under the curve: Pr(a x b) = b a f (x)dx

The Gaussian distribution Look at Poisson distribution in limit of large N Familiar Gaussian distribution, (approximation reasonable for N>10) plot(x,dnorm(x,25,5),type='l',col='red',lwd=3) P( x; µ, σ ) = 1 e 2πσ ( x µ ) 2 / 2 σ 2

Properties of the Gaussian distribution Mean Variance Standard deviation + x = xp(x;µ,σ )dx = µ + V(x) = (x µ) 2 P(x;µ,σ )dx = σ 2 σ = σ The mother of all distributions: The binomial distribution B(n, p) is approximately normal N(np, np(1 p)) for large n and for p not too close to zero or one. The Poisson(λ) distribution is approximately normal N(λ, λ) for large values of λ. The chi-squared distribution χ2(k) is approximately normal N(k, 2k) for large ks. The Student s t-distribution t(ν) is approximately normal N(0, 1) when ν is large.

Hands-on #7 Run the script you find online: http://goo.gl/nrkmhm par(mfrow = c(3,3)) prob <- seq(0,1,0.01) x1 <- seq(-5,5,0.01) plot(x1,dnorm(x1),type='l') plot(x1,pnorm(x1),type='l') plot(prob,qnorm(prob),type='l') What have you done? What are you looking at? x2 <- seq(5,15,0.01) plot(x2,dnorm(x2,12,0.5),type='l') plot(x2,pnorm(x2,12,0.5),type='l') plot(prob,qnorm(prob,12,0.5),type='l') x3 <- seq((5-12)/0.5,(15-12)/0.5,0.01) plot(x3,dnorm(x3),type="l") plot(x3,pnorm(x3),type='l') plot(prob,qnorm(prob),type='l')

PDF, CDF and Quantile function Choose x. PDF returns the probability that we will observe a value x during one observation of the random variable X. Choose x. CDF returns the probability that we will observe a value equal or lower than x during one observation of the random variable X. Choose a probability p. The quantile function returns the value which the random variable will be at, or below, with that probability.

Intervals Can you show this with R? 68.27% within 1σ 90% à 1.645σ 95.43% within 2σ 95% à 1.96σ 99.73% within 3σ 99% à 2.58σ 99.9% à 3.29σ

Pause

Estimates and confidence intervals

Estimation Estimation is the process of using an estimator obtained from a sample to produce an estimate of a parameter. There are two types of estimates: a point estimate which is a single number or value used to estimate a population parameter; for example, x for µ and an interval estimate which is a spread of values used to estimate a population parameter. For example, a < x < b

Confidence intervals Three components in a confidence interval: 1. A confidence level describes the uncertainty of a sampling method. 1. A sample statistics a characteristic of a sample. Generally, a statistic is used to estimate the value of a population parameter. 1. A margin of error the range of values above and below the sample statistic. If we select different samples and compute different interval estimates using the same sampling method, the true population mean would fall within a range defined by: "sample statistic ± margin of error <confidence-level>% of the time.

Population Populations and samples N: Number of observations in the population Sample n: Number of observations in the sample P: Proportion of successes in population p: Proportion of successes in sample P i : Proportion of successes in population i p i : Proportion of successes in sample i μ: Population mean : Sample estimate of population mean σ: Population standard deviation s: Sample estimate of σ σ p : Standard deviation of p SE p : Standard error of p σ: Standard deviation of x SE : Standard error of x x x x

Standard deviation and standard error To calculate the confidence interval of the statistics you need to know either the standard deviation or the standard error of the statistics. Let s say you have measured a mean x or a probability p: The standard deviations are: σ x = σ n σ p = P(1 P) n Note, you need to know σ and P from the population The standard errors are: SE x = SE p = s n p(1 p) n

Margin of error In a confidence interval, the range of values above and below the sample statistic is called the margin of error: Margin of error = Critical value x S.D statistics Margin of error = Critical value x S.E statistics

Confidence level and critical value 1. You chose a confidence level (99%, 95%). 2. You calculate the parameters α and p *: α = 1 (confidence level /100) p * =1 α 2 The critical value is the value of z (z score ) or t (t score ) whose cumulative probability (from the CDF) is equal to p *. z is the random variable that follows a standard normal distribution (µ=0,σ=1) t is the random variable that follows t-student distribution with DF = n-1

z score and t score Confidence level α p* z score t score 80% 1-(80/100) = 0.2 1-(0.2/2) = 0.9 1.28 qt(0.9-n-1) 90% 1-(90/100) = 0.1 1-(0.1/2) = 0.95 1.64 qt(0.95,n-1) 95% 1-(95/100) = 0.05 1-(0.05/2) = 0.975 1.96 qt(0.975,n-1) 98% 1-(98/100)=0.02 1-(0.02/2)=0.99 2.33 qt(0.99,n-1) 99% 1-(99/100) = 0.01 1-(0.01/2) = 0.995 2.57 qt(0.995,n-1)

Hands-on #7 A sample of 200 elements has mean x of 34 and a sample standard deviation s of 3. What is the 99% confidence interval for the mean? 34±??? è??? < x<??? Margin of error = Critical value x S.D statistics Margin of error = Critical value x S.E statistics σ x = σ n SE x = s n z score t score σ p = P(1 P) n SE p = p(1 p) n 1.28 qt(0.9, n-1) 1.64 qt(0.95,n-1) 1.96 qt(0.975,n-1) 2.33 qt(0.99,n-1) 2.57 qt(0.995,n-1)

Example A sample of 200 elements has mean x of 34 and a sample standard deviation s of 3. What is the 99% confidence interval for the mean? SE = 3/sqrt(200)=0.21 α = 1-(99/100) p*= 1-0.01/2=0.995 z score =qnorm(0.995)=2.57 Margin error = 2.57 x 0.21= 0.54 34±0.54 è 33.46 < µ <34.54 Note, with 90% confidence: 34±0.34 è 33.66 < µ < 34.34 With a sample of 400 (and 99% confidence): SE =0.15 è 34±0.38

Statistical hypothesis

Statistical hypothesis A statistical hypothesis is an assumption about a population parameter. H 0 = the null hypothesis. H a = the alternative hypothesis. If sample data are not consistent with the statistical hypothesis H 0, the hypothesis is rejected. And the alternative is accepted. Examples: Are data from two samples belonging to the same population? Are the data following a poissonian distribution?

Hypothesis testing A statistical hypothesis test is a method of making decisions using experimental data. You measure statistical significance Four steps: State the hypothesis Proof by contradiction H 0 is usually the hypothesis that sample observations result purely from chance: H 0 parameter Formulate an analysis plan: Find a statistic that takes on extreme values when assumed hypothesis is false Analyze the sample data. Calculate the value of this statistic in the collected data Interpret the result Reject or fail to reject the null hypothesis.

Test statistics and P-value During the analysis phase you will define a test statistics (assuming that your data is normally distributed): Test statistics = (Statistic - Parameter from H 0 ) Standard deviation of statistic Test statistics = (Statistic - Parameter from H 0 ) Standard error of statistic

Significance level (p-value) The P-value is the probability of observing a sample statistic as extreme as the test statistic, assuming the null hypothesis is true. p-value Evidence against H0 < 0.01-ish Very strong >.01-ish and <.05-ish Moderate >.05-ish and <.10-ish Weak >.10 ish Practically none

An example 25%of eligible jurors are black. In a random sample of 1050 people 177 were black. Is there sign of discrimination? H0: P=0.25 Ha: P <0.25 Sample proportion: p=177/1050=0.1689 Test statistic: z=(0.1689-0.25)/ (0.25(1-0.25)/1050=-6.0689 P-value: pnorm(-6.0689)=6.43674e-10 The p-value is approximately 0, we reject the null hypothesis. It is very unlikely that we would observe a sample percentage of 16.89% or smaller if the true percentage was 0.25. The data suggest that black jurors were indeed selected less frequently than would have been expected. The data provide some evidence of discrimination.

Statistical significance You can use strict cut-off for the p-value: the significance levels. The significance letter is denoted with the letter α. For example: α =0.05 Reject the null hypothesis when the p-value less is than 0.05. Otherwise, do not reject it. Cannot rely blindly on cut-offs. Unsignificant unimportant Statistical significance practical significance

Common critical values Significance Two-tail One-tail 0.01 2.575 2.33 0.05 1.96 1.645 0..10 1.645 1.28

Type I and type II error Type I error (false positive) when the researcher rejects H 0 when H 0 is true. The probability of committing a Type I error is the significance level α. Type II error (false negative) when the researcher accepts H 0 when H a is true. The probability of committing a Type II error isβ. The probability of not committing a Type II error is called the Power of the test 1-β The chance of a making a Type I error does not depend on sample size. (Sample sizes incorporated into test statistics). The chance of making a Type II error decreases as sample size increases (power analysis).

Summary You have learned a lot. I have two hopes: 1. You can use this in your RP project, future projects, research. (R as open-source alternative) 2. You are now curious about statistics and are eager to learn more by yourself.