Introduction to Probability and Statistics

Size: px

Start display at page:

Download "Introduction to Probability and Statistics"

Lambert Harrington
5 years ago
Views:

1 Introduction to Probability and Statistics Xi Kathy Zhou, PhD Division of Biostatistics and Epidemiology Department of Public Health Feb. 2008

2 Overview Statistics: The mathematics of the collection, organization and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling definition in American Heritage Dictionary Why statistics: Through studying the characteristics of a small collection of observations proper inference for the entire population could be derived Probability theory is the basic tool for statistical inference

3 Outline Basic concepts in probability Events and random variables Probability and probability distributions Means, variance and moments Joint, marginal and conditional probabilities Dependence and independence Basic concepts in statistics Data Descriptive statistics Statistical Inference Estimation Statistical Inference Hypothesis testing

4 Probability a measure of uncertainty Example: Random experiment Possible Outcome Toss a coin {H}, {T} Roll a 6-sided die {3}, {5}, {1,2,3} What do you think you ll get in the above experiments? How sure are you? Why? each outcome is equally probable. - Probability is used as a way to measure uncertainty.

5 Events Definitions: Random experiment: an experiment which can result in different outcomes, and for which the outcome is unknown in advance. Sample space Ω : a set of all possible elementary outcomes of an experiment Event: a subset of the sample space Ω Random experiment sample space Events Toss a coin {H, T} {H}, {T} Roll a 6-sided die: {1,2,3,4,5,6} {3}, {5}, {1,2,3}

6 Probability measure Sigma field F: a set that satisfies the following, 1. If 2. If 3. Ø F A, B F A F,, then then A A B F c F, and A B F Probability measure P on (Ω, F ): a function P: F [0,1] satisfying the following properties (Ø denotes the empty set): 1. P( A) 0 for any A F 2. P( Ω) = 1 3. If A, B F and A I B = Ø, then P( A B) = P( A) + P( B) The 6-sided die example, Sigma field: {Ø, {1},, {6}, {1,2},, {1,2,3,4,5,6}} Sigma field: {Ø, {1,2,3}, {4,5,6}, {1,2,3,4,5,6}}

Probability measure - some properties Comparing the uncertainty of events: If A, B Ω and A B, then P( A) P( B) Assessing the uncertainty associated with other events Ω A-B A B B-A A B Illustration of

7 Probability measure - some properties Comparing the uncertainty of events: If A, B Ω and A B, then P( A) P( B) Assessing the uncertainty associated with other events Ω A-B A B B-A A B Illustration of rule 6. P( A) = 1 P( A) for A Ω and A = Ω A P(A1 A2... Ak ) = P( A1 ) P( Ak ) for pairwise disjoint A1,..., A P( A B) = P( A) + P( B) P( A B) for any A, B Ω k Ω Example (rolling a 6-sided die): If we know P({1}),, P({6}), we should know the uncertainty of more complicated events such as P({1,2,4})

8 Probabilities of events - Examples Experiment: randomly picking a DNA sequence of length 3 Event A: the picked sequence is ATG P(A)= 1/4 3 = 1/64 Experiment: randomly taking a DNA sequence of length 20 from a length 100 sequence with 20A s Event A: having 20A s in a row What is the sample space, what is the probability of event A? Answer: / 20

9 Relationship of two events Conditional probability Let Ω be an event space and let P be a probability measure on Ω. Let B єωbe an event (on which we want to condition). The function P(A B) P(. B) : Ω [0,1], P(A B) =, A Ω P(B) defines a probability measure on Ω, the conditional probability given B. (Proof as exercise) Independence: Let Ω be an event space and let P be a probability measure on Ω. Two events A,B єωwith P(A)>0, P(B)>0 are called (stochastically) independent if one of the following equivalent conditions holds: P(A B) = P(A) P(B) P(A B) = P(A) P(B A) = P(B) EXAMPLE: Throwing a six sided fair dice, event A= even number, event B=<5, C= prime number, D= <4 P(A)=?, P(B)=?, P(A B)=? Are A and B independent? P(C)=?, P(D)=?, P(C D)=? Are C and D independent?

10 Random variable Random Variable:A function X : Ω { ω Ω : X( ω) x} F for each x R. R with the property that A more common description of the results of a random experiment. Takes on value from a set of mutually exclusive and collectively exhaustive states and represent it with a number. Usually denoted by capital letters, e.g. X, Y, Z, etc. Realizations of random variables are usually denoted in lower case, e.g., x,y,z, etc. Can be discrete or continuous

11 Types of Random Variables Discrete random variable Any variable whose possible values either form a finite set, or else can be listed in a countable infinite sequence Continuous random variable Any variable whose possible values consist of an entire interval on the number line.

12 Probability distributions Definition: Probability Distribution of a random variable X is the function F: R ->[0,1] given by F(x)= P(X<=x) Characterizes the uncertainties of a random experiment before the experiment is conducted, i.e. we know that some results are more likely than others.

Discrete Random Variable Probability distribution function (pdf) A discrete random variable X with values x 1, x 2,, x k, has a Probability density distribution of X

13 Discrete Random Variable Probability distribution function (pdf) A discrete random variable X with values x 1, x 2,, x k, has a Probability density distribution of X is P(X=x i )=p i, I = 1, 2,, k, where p i is the probability mass function that satisfies 0 p i 1 p 1 +p 2 + +p k + =1 Range of this random variable: {x 1, x 2,, x k, }

of the cdf: 0 F(x) 1 If x y then F(x) F(y) = P( X x) = i: x x i p i Discrete case: step

14 Discrete Random Variable Cumulative distribution function (cdf) The cumulative distribution function F(x) of a discrete random variable X defined by F( x) Properties of the cdf: 0 F(x) 1 If x y then F(x) F(y) = P( X x) = i: x x i p i Discrete case: step function, continuous from the right, jump discontinues at x 1, x 2,, x k, with heights p 1, p 2,, p k,

15 Discrete Random Variable pdf and cdf example Random experiment: Roll 2 dice Random variable X: Sum of both values probability distribution function cumulative distribution function

16 Discrete Random Variable Probability calculation rules if

17 Discrete random variables Example of common distributions Discrete uniform distribution (rolling a fair 6-sided die) Geometric distribution (Repeat a Bernoulli experiment until the first success, the first occurrence of a event A.)

18 Discrete random variable Discrete uniform distribution A discrete random variable X is called a uniformly distributed on the range {x 1, x 2,, x k } if for all i = 1,, k: Example: Roll a fair die Probability distribution of a uniform distribution

19 Discrete Random Variable Geometric distribution (1) Random experiment (Repeat a Bernoulli experiment until the first success) Event: {TH, TTH, } Probability for a success: P(H) = π Random variable X: Number of trials until the first success {1, 2, } X has a geometric distribution with parameter π The probability distribution function has the form The cumulative distribution function has the form

20 Discrete random variable Geometric distribution (2)

21 Discrete Random variable Geometric distribution (3)

22 Discrete random variable - Mean The value you expect to get in a random experiment is the mean. Example: If you toss a coin 10 times, you expect to get 5 heads and 5 tails. You expect this value because the probability of getting "heads" is 0.5 and if you toss 10 times you should get 5. Definition: The mean of a discrete random variable with values x 1, x 2,, x k, and probability distribution p 1, p 2,, p k, is Note that E(x) characterizes the random experiment

23 Discrete random variable Mean (Example) Binary random variable X: Assume P(X=1) = π and P(X=0)=1- π, then E(x) = 0*P(X=0) + 1*P(X=1) = π Toss a fair coin: X gain/loss of a monetary unit If P(X=1) = P(x=0), E(X) =? Roll a fair die: Once X value of the landing, E(X)=? Twice X sum of values, E(X)=?

24 Discrete random variable Variance and Standard deviation The variance of a discrete random variable is The standard deviation is

Discrete random variables Independence Definition: 2 discrete random variables X with range {x 1, x 2,, x k, } and Y with range {y 1, y 2,, y l, } are called independent, if for

25 Discrete random variables Independence Definition: 2 discrete random variables X with range {x 1, x 2,, x k, } and Y with range {y 1, y 2,, y l, } are called independent, if for all and More general: n discrete random variables X 1, X 2,, X 3 are called independent, if for arbitrary values x 1, x 2,, x n in their respective range the following term is true

26 Discrete Random Variable Properties of the mean and calculation rules (1) Linear transformations: Nonlinear transformations: real function Example: Note: In general, Example:

Discrete Random Variable Properties of the mean and calculation rules (2) Linearity Rule: Mean of a sum of (discrete) random variables: E(X+Y) = E(X)+E(Y) E(a 1 X 1 + + a n X n ) = a 1 E(X 1

27 Discrete Random Variable Properties of the mean and calculation rules (2) Linearity Rule: Mean of a sum of (discrete) random variables: E(X+Y) = E(X)+E(Y) E(a 1 X a n X n ) = a 1 E(X 1 )+ + a n E(X n ) Product rule independent (discrete) random variables: If X, Y are independent, then E(XY) = E(X)E(Y) Example: Roll a die twice. What is the mean of the product of the values

variables X, Y and X 1,, X n respectively, we can

28 Discrete random variable Properties of the variance Linear transformations For independent random variables X, Y and X 1,, X n respectively, we can show Var(X+Y)= Var(X)+ Var(Y) and for any constant a 1,, a n

29 Discrete random variable Variance (Examples) Binary random variable: Var(X)= π(1 π) Proof Roll a fair die once: X is the value at the landing Var(X)=? Roll a fair die twice: X is the sum of the values Var(X) =?

Discrete Random variable Independence (Example)

P(X=I, Y=j) = 1/36 =1/6 X1/6 = P(X=i)P(Y=j)

is a prime number Z = 1 if the value is smaller

No Because Y=1 and Z=1 means 2 or 3, so P(Y=1,

30 Discrete Random variable Independence (Example) Random experiment: Roll two dice For all 1 i, j 6 P(X=I, Y=j) = 1/36 =1/6 X1/6 = P(X=i)P(Y=j) Random experiment: Roll a die Y = 1 if the value is a prime number Z = 1 if the value is smaller than four Are these two events independent? No Because Y=1 and Z=1 means 2 or 3, so P(Y=1, Z=1) = 2/6 1/2 1/2 = P(Y) P(Z). or equivalent: Is P(Y Z) = P(Y)?

31 Continuous Random Variables

Continuous random variable Probability distribution

If X:Ω IR is a random variable, the function F : IR

the distribution function F can be expressed as x F(

32 Continuous random variable Probability distribution Definition. If X:Ω IR is a random variable, the function F : IR IR, F( x) = P( X x) is called the distribution function of X. If X is a continuous random variable with density f, the distribution function F can be expressed as x F( x) = f ( x) dx. This formula is the continuous analogue of the discrete case, in which the distribution function was defined as F x) = f ( x ). ( x x j j

Continuous random variables mean and variance The statistics mean and variance, which were already defined for discrete random variables can be defined in an analogous way for continuous random

33 Continuous random variables mean and variance The statistics mean and variance, which were already defined for discrete random variables can be defined in an analogous way for continuous random variables: Mean X discrete, X j є {x 1,x 2,,} p.d.f. P(X=x j ) c.d.f. P(X x j ) = x P( X = E ( X) ) x j = j x j 2 Variance Var( X) ( x E( X)) P( X = x ) x j j j X continuous with density f density function f(x) distribution function F(x) xf E ( X ) = ( x) dx 2 Var ( X ) = ( x E( X )) f ( x) dx

the interval [a,b]) if it has a density function of the form 1 f ( x) = b a 0

34 Continuous random variables Example 1 Uniform distribution. A continuous random variable X is called uniform or uniformly distributed (in the interval [a,b]) if it has a density function of the form 1 f ( x) = b a 0 for x [a, b] otherwise for some real values a<b. This is denoted by X ~ U(a,b). 1 1 f F 0 a b a b 0

35 Continuous random variables Example 2 Exponential distribution. A continuous random variable having a density for some real parameter λ>0 is called exponentially distributed. Denote this by X ~ Ex(λ). The corresponding distribution function is λ exp(-λx) for x 0 f ( x) = 0 otherwise 1 exp(-λx) for x 0 F( x) = 0 otherwise f λ=3 λ=1 λ=0.3 λ=3 λ=1 λ=0.3 F Density and distribution function of an exponentially distributed rancom variable X ~ Ex(λ) for λ = 0.3, 1, 3.

density function of the form. 1 f ( x) = exp σ 2π x μ ( σ 1 2 2 ) There is no closed form for the distribution function F of such a variable.

36 Continuous random variables Example 3 Normal distribution. A continuous random variable X is called normally distributed or gaussian (with mean μ and standard deviation σ>0), write X ~ N(μ,σ 2 ), if it has a density function of the form. 1 f ( x) = exp σ 2π x μ ( σ ) There is no closed form for the distribution function F of such a variable. the distribution function has to be computed numerically. Standard normal distribution μ=0, σ=1 μ=1,σ= 2 μ=0, σ=0.8 f μ=1,5, σ=0.8 F μ=0, σ=1 μ=0, σ=0.8 μ=0, σ=2 Density and distribution function of some normally distributed rancom variables X ~ N(μ,σ 2 )

37 Two more continuous distributions The χ 2 -distribution. If X 1,,X n are independent random variables that are N(0,1)- distributed, then the random variable is said to be Chi-squared distributed with n degrees of freedom, for short Z ~ χ 2 (n). Student t-distribution (t-distribution). If X~N(0,1) and Z~ χ 2 (n) are independent, then the random variable 2 2 Z = X + X T X n X = Z / n is said to have a t-distribution with n degrees of freedom, for short T ~ t(n). This list of continuous random variables is by no means complete. For a survey, consult the statistics literature given in the reference list to this lecture series.

38 Definition. Let Ω be a probability space with probability measure P. Let X:Ω IR and Y:Ω IR be continuous random variables. X and Y are called independent if for all x,y є IR. ) ( ) ( ) ( ) ( ), ( y F x F y Y P x X P y Y x X P Y X = = Corollary. If the continuous random variables are independent, for all real values of a 1 <a 2,b 1 <b 2. ) ( ) ( ), ( b Y b P a X a P b Y b a X a P = Continuous random variables Independence

39 Continuous random variables Joint and marginal probability distributions Let X and Y be two random variables on the same probability space Ω. If there exists a function f: IR x IR IR such that P( a1 X a2, b1 Y b2 ) For all real values of a 1 <a 2,b 1 <b 2, then X and Y are said to have a continuous joint (multivariate) distribution, and f is called their joint density. We will be considered only with this case here. The marginal distribution of X is given by = a 2 a 1 b b 2 1 f ( x, y) dydx a 2 P( a1 X a2 ) = f ( x, y) dydx = a 1 a a 2 1 f X ( x) dx, f X ( x) = f ( x, y) dy where is the density of the marginal distribution of X.

40 Continuous Random Variable Conditional probability distributions The conditional distribution of X, given Y=b is given by P ( a 1 X a 2 f X ( x Y Y = b ) f X ( x Y = where is the density of the conditional distribution of X, given Y=b. = a a 2 1 b ) dx = b ) = f ( x, b ) f ( t, b ) dt We mention an equivalent condition for independence: The random variables X and Y are independent if 1. f(x,y)=f X (x)f Y (y) for all x,y є IR 2. f X (x Y=b)=f X (x) for all x,b є IR. 3. f Y (y X=a)=f Y (y) for all a,y є IR.

41 Basic Concepts in Statistics

42 Data, sampling and statistical inference Data: Characteristics/properties of a random sample from a population. For example: y 1,, y n (n realizations of a random variable Y) Sampling: Ways to select the subjects for which the characteristics/ properties of interest will be assessed Examples: SRS, stratified, clustered Statistical inference: Learning from data i.e. assuming these data are n draws from distribution f θ, what we know about the population parameter. Probability theory: reasoning from f->y if the experiment is like, then f will be, and (y 1,, y n ) will be like, or E(Y) must be Statistics: Reasoning from Y to f Since (y 1,, y n ), turned out to be, it seems that f is likely to be, or the parameter is likely to be around

Types of Data There are different types of data: Affymetrix Gene- Id Signal Detection p-value BioB-5_at 258 P 0.000754 BioB-M_at 470 P 0.

000044 BioDn-3_at 4356 P 0.00006 CreX-5_at 9992 P 0.000044 CreX-3_at 11389 P 0.000044 DapX-5_at 5 N 0.354453 DapX-M_at 14 N 0.

43 Types of Data There are different types of data: Affymetrix Gene- Id Signal Detection p-value BioB-5_at 258 P BioB-M_at 470 P numerical data (discrete, continuous) categorical data (ordered, nonordered) mixtures of both BioB-3_at 247 P BioC-5_at 787 P BioC-3_at 695 P BioDn-5_at 939 P BioDn-3_at 4356 P CreX-5_at 9992 P CreX-3_at P DapX-5_at 5 N DapX-M_at 14 N If the properties consist of multiple features (like Signal, Detection, pvalue in the example), the data is called multivariate, otherwise it is called univariate. DapX-3_at 1 N LysX-5_at 4 N LysX-M_at 3 N LysX-3_at 2 N PheX-5_at 14 N PheX-M_at 65 N PheX-3_at 9 N ThrX-5_at 25 N ThrX-M_at 4 N ThrX-3_at 118 N

44 Steps in statistical analysis of data? Describing the data (descriptive statistics) Propose reasonable probabilistic model Making inference about parameters in the model Check the model fitting/assumption Report results

45 Describing univariate categorical data Frequency table: Simply list all object-property pairs in a table. Count the number of objects in each category, display the result in a table Calculate the relative size of the category, display the result in the table Example:

Display the categorical dataset y 1,,y n as a bar plot, with the width of the bars proportional to the length of the intervals. Dataset: the height of a population of 10,000 people.

46 Describing univariate categorical data Assume we have a dataset with objects 1,, n and their real-valued properties x 1,, x n. Histogram Choose intervals with C 1 =[a 1,a 2 ), C 2 =[a 2,a 3 ),, C k =[a k,a k+1 ), and a 1 < a 2 < < a k+1 (this process is called binning ). Let y k = C k iff x k є C k. Display the categorical dataset y 1,,y n as a bar plot, with the width of the bars proportional to the length of the intervals. Dataset: the height of a population of 10,000 people. Histograms were plotted with k equidistant bins, k = 8,16,32,64 A local maximum of the abundance distribution is called a mode, x mode. Distribution with only one mode are called unimodal, distributions with more modes are called multimodal.

47 Descriptive statistics 1 The second and by far the most important way is to summarize the data by appropriate statistics. A statistics is a rule that assigns a number to a dataset. This number is meant to tell us something about the underlying dataset. Examples: Arithmetic mean. Given x 1,, x n, calculate the arithmetic mean as x = 1 n n x j j= 1 The arithmetic mean is one of the many statistics that aim to describe where the centre of the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data points, namely x n = argmin ( x j= 1 2 x x j ) x

median is a value such that the number of data points smaller than x med equals the number of data points greater

48 Descriptive statistics 2 Median. Let x 1,, x n, be given in ascending order. The median x med is defined as x med = x ( n+ 1) / 2 ( x n / 2 + x if n n / 2+ 1 is ) / 2 odd if n is even The median is a value such that the number of data points smaller than x med equals the number of data points greater than x med. Like the arithmetic mean, the median is also a location measure for the centre of the data. Mean Median Mode (this distribution is unimodal!)

A frequency distribution is called skewed to the right if the right tail of the distribution falls off slower

49 Descriptive statistics 3 Symmetry. A frequency distribution is called symmetric if it has an axis of symmetry. Skewness. A frequency distribution is called skewed to the right if the right tail of the distribution falls off slower than the left tail. Analogously: skewed to the left. Mean Median Mode Posture rules: Left skew: Symmetric: left skew symmetric skewed to the right x < med x < x x mode x med xmode Right skew: x > xmed > xmode

$A q-quantile of a frequency distribution is a value x q such that the fraction of data lying left to x q is at least q, and the fraction lying$ right to x q is at least 1-q. If the data is ordered x x.

right to x q is at least 1-q. If the data is ordered x x.

50 Descriptive statistics 4 Quantiles. Let q є (0,1). A q-quantile of a frequency distribution is a value x q such that the fraction of data lying left to x q is at least q, and the fraction lying right to x q is at least 1-q. If the data is ordered x x... x ), then ( 1 2 n x q = x qn [ x qn if qn is not an integer, xqn + 1] if qn is an integer X 0.05 X 0.25 X 0.5 X 0.75 X 0.95 Special quantiles are the quartiles, x 0.25,x 0.5,x 0.75 (which split up the data into four classes), and the quintiles x 0.2,x 0.4,x 0.6,x 0.8. They are frequently used to give a summary of the data distribution.

average squared distance from all data points to ).

51 Descriptive statistics 5 Variance, Standard deviation. The variance v=var(x 1,,x n )=Var(x) of a dataset x=(x 1,,x n ) is defined as n v = s 2 = 1 n j= 1 ( x j x) (the average squared distance from all data points to ). The standard deviation s=s(x) is the positive square root of the variance, s 2 =v. The variance and the standard deviation are measures for the dispersion of the data. 2 x Relative frequency Relative frequency small variance vs. high variance

line): A density function is a non-negative real-valued integrable function f such that x 0 x 1 f ( x) dx (this condition says that the area

52 Detailed description of univariate data Density plots. If the number of data points is large, it is often convenient to approximate a histogram (of the relative frequencies) by a density curve (red line): A density function is a non-negative real-valued integrable function f such that x 0 x 1 f ( x) dx (this condition says that the area enclosed by the graph of f and the x-axis is 1). Interpretation: The area of a segment enclosed by the x-axis, the graph of f and y=x 0 and y=x 1 (the grey shaded area in the figure) equals the fraction of data points with values between x 0 and x 1. = 1

53 Continuous univariate data An important distribution Normal distributions = Gaussian distributions. A very important family of density functions are the Gaussian distributions, defined as 1 f ( x) = exp σ 2π x μ ( σ ) This distribution is symmetric (around x=μ), unimodal (with mode at x=μ) and shaped like a bell. The mean of gaussian distributed data is μ, its variance is σ 2. With parameters μ and σ>0. The rule. If a dataset has a gaussian distribution with mean μ and variance σ 2, then 68% of the data lie within the interval [ μ-σ, μ+σ ] 95% of the data lie within the interval [ μ-2σ, μ+2σ ] 99.7% of the data lie within the interval [ μ-3σ, μ+3σ ]

54 Summary Frequency tables, Bar plots, Pie charts, Histograms, Density plots are possible ways to display statistical data. Mean, Median and Quantiles are measures of location for numerical data The variance is a measure of variation for numerical data, it has pleasant transformation properties The Gaussian distribution is a very important density function.

55 Multivariate descriptive statistics

Multidimensional data (1) In many applications a set of properties/features is

If we want to learn facts about a single property, we use univariate statistical

If we want to learn how two or more properties depend on each other we need

56 Multidimensional data (1) In many applications a set of properties/features is measured. If we want to learn facts about a single property, we use univariate statistical measures, e.g. mean, median, variance, quantiles. If we want to learn how two or more properties depend on each other we need multivariate statistical measures. Examples (multidimensional data) measure age and gender of the same person. microarray gene expression data are multidimensional Ways to describe these data

Multidimensional data (2) For each object i, i=1, n, we measure simultaneously several features X,Y, Z, multidimensional or multivariate data We get the values (x i,y i,z i ) of the features for

57 Multidimensional data (2) For each object i, i=1, n, we measure simultaneously several features X,Y, Z, multidimensional or multivariate data We get the values (x i,y i,z i ) of the features for object i In the following, we consider two features. Question: X <--> Y How does the correlation between X and Y look like? Correlation (Association) X --> Y How does X affect the feature Y (response)? Regression

If a feature s possible values range in an interval, we call it continuous. E.g. weight of a person.

58 Discrete/grouped data If a feature has only a finite or countable infinite number of possible values, we call it discrete. E.g. number of A s on a DNA-sequence. If a feature s possible values range in an interval, we call it continuous. E.g. weight of a person. To know: How to describe the distribution of two discrete features. How to evaluate whether the two features are correlated This also includes continuous features grouped into categories.

59 General description: Contingency table Absolute frequencies A (k x m) contingency table of absolute frequencies has the form: The contingency table describes the joint distribution of X and Y in terms of absolute frequencies

=h i1 + h im, i=1,,k and h.j =h 1j + h kj, j=1,,m The resulting sums h 1.,, h k. and h.1,, h.

60 Contingency table Marginal frequencies The column and row sums of the contingency table are called the marginal frequencies of the features X and Y. We write h i. =h i1 + h im, i=1,,k and h.j =h 1j + h kj, j=1,,m The resulting sums h 1.,, h k. and h.1,, h.m describe the univariate distributions of the features X and Y. This distribution is also called the marginal distribution.

61 Contingency table Relative frequencies A (k x m) contingency table of relative frequencies has the form: The contingency table describes the joint distribution of X and Y. The margins describe the marginal distributions of X and Y.

Contingency table Conditional frequencies By looking at the absolute or relative frequencies alone it is not immediately possible to decide whether there

62 Contingency table Conditional frequencies By looking at the absolute or relative frequencies alone it is not immediately possible to decide whether there is a correlation between features. Therefore: Look at conditional frequencies, i.e. the distribution of a feature for a fixed value of the second feature

63 Contingency table Conditional frequency distribution (1) Conditional frequency distribution of Y under the condition X=a i, also written Y X=a i, is given by: Conditional frequency distribution of X under the condition Y=b j, also written X Y=b j, is given by:

64 Contingency table Conditional frequency distribution (2) Because of we also have The conditional distributions are computed by dividing the joint frequencies by the appropriate marginal frequencies.

65 Contingency-table χ 2 coefficients Starting point: How should the joint frequencies look like, so that we could empirically assume independence between X and Y (given the marginal distributions)

66 Contingency table Empirical independence Idea: X and Y are empirically independent if and only if the conditional frequencies are equal in each sub-population X=a i, i.e. independent of a i

67 Contingency table Assessing empirical independence Idea: Compare for each cell (i,j) the theoretical frequency with the observed frequency under the assumption of independence χ 2 coefficient:

68 Contingency table Properties of the χ 2 coefficients X and Y are empirically independent large <==> strong correlation small <==> weak correlation Disadvantage: depends on the dimension of the table

69 Graphical representation of quantitative features Graphical representation of the values (x i,y i ), i=1,,n from two continuous features X and Y. The simplest representation of (x 1,y 1 ),,(x n,y n ) in a coordinate system is call a scatterplot

70 Correlation of continuous features Aim: Find a measure that describes the correlation of two continuous features X and Y. Measure with no or only weak correlation strong positive correlation strong negative correlation

Pearson s correlation coefficient (1) The Pearson correlation coefficient for the data (x i,y i ), i=1,,n is defined as The range of r is [-1,1] r > 0 positive correlation, positive linear

71 Pearson s correlation coefficient (1) The Pearson correlation coefficient for the data (x i,y i ), i=1,,n is defined as The range of r is [-1,1] r > 0 positive correlation, positive linear relationship, i.e. values are around a straight line with positive slope r < 0 negative correlation, negative linear relationship, i.e. values are around a straight line with negative slope r = 0 no correlation, uncorrelated

72 Pearson s correlation coefficient (2) The correlation coefficient r measures the strength of a linear relationship

73 Pearson s correlation coefficient (3) Rule of thumb: weak correlation medium correlation strong correlation Linear transformations: correlation coefficient between and correlation coefficient between and or or

74 Equivalent forms of r Multiplying out yields: Remember the formula for variances! In terms of standard deviations and covariance with covariance and standard deviations

75 Statistic Inference Estimation Finding approximations of the model parameters point estimation Finding the uncertainty associated with the population parameter interval estimation (finding the confidence intervals) Hypothesis testing

76 Point Estimation Finding: ˆ( θ x,..., x 1 n ) Desired properties of the estimator: unbiasedness (bias is measured as the expected difference between the estmator and the population parameter) efficiency (could be described by the inverse of variance of the estimator) 2 small mean square error (MSE) E( ˆ θ θ) other: consistency, etc. Common methods to find estimators: Method of moments Maximum likelihood estimation

77 Estimation: Method of Moments Method of moments: Match the first E(X), second (E(X 2 )),, order moments to the parameters Solve the equation system If sample E(X k )=g(θ), then ˆ -1 k θ = g ( x ) Maximum Likelihood Estimation (MLE) Assuming the data come from a parametric family indexed by a population parameter θ, i.e. X 1,, X n ~ i.i.d. f(x θ), the joint density of the data is f ( X1,..., X θ) = Πf ( X θ) The probability of observing the data is the likelihood function of the parameter θ under the assumed probabilistic model, i.e. Likelihood = n f ( x 1,..., x θ ) = Πf ( x θ ) n i i

78 Example: Binomial data Data: 6,3,5,6,8 number of successes in 5 repeated experiments of tossing a coin 10 times Is this a fair coin? What is going to come up for the 11 th toss? Assuming a probabilistic model: X ~Binom (π,10) Estimating π MOM: Because E(X)= π, estimate of π = sample mean = ( )/5 MLE: L(π data) = P(x 1 =6,, x 5 = 8 π)=p(x 1 =6 π)...p(x 5 =8 π), then find the value that maximize the likelihood function

79 Example: Normal data x 1,x 2,...,x n ~ iid N(μ,σ 2 ) 2 f N ) ( x μ Joint pdf for the whole random sample, σ 1 = e 2πσ 2 (x μ) 2 2σ 2 f ( x, x,..., x μ, σ ) = f ( x μ, σ ) f ( x μ, σ )... f n 1 2 n ( x μ, σ 2 ) Likelihood function is basically the pdf for the fixed sample l ( μ, σ x, x2,..., xn ) = f ( x1 μ, σ) f ( x2 μ, σ)... f ( x 1 n μ, σ) Maximum likelihood estimates of the model parameters μ and σ 2 are numbers that maximize the joint pdf for the fixed sample which is called the Likelihood function μˆ = n x i ( μˆ) 2 xi σˆ = n 2

80 Hypothesis Testing Making inference about the value of the population parameter based on the data Start with hypotheses about the population parameter (include: null, alternative) Using data to assess the sample variability of the null hypothesis Conclusion: Reject the H0, the data is highly unlikely to be generated from the probabilistic model defined by H0 Fail to reject H0, the data is not highly unlikely to happen with H0

81 Example: Hypothesis Testing X: the expression level of gene A under condition 1 Y: the expression level of gene A under condition 2 To decide: if the average expression levels are equal Null hypothesis H 0 = both expression levels are equal. Alternative hypothesis H 1 = the expression levels are unequal. Specify a method how to decide between these two alternatives. Choosing an appropriate statistics D that is able to discriminate between the two hypotheses and Choose a rejection region in which H 0 is rejected. The selection of the statistics defines the test.

Hypothesis Testing (Example) The biologist may proceed in the following way: He has n x replicate measurements of the gene of interest in condition X, (x 1,,x n y), and n y replicate measurements of

.. + y n y k Y = gene2 X = gene1 Then, the biologist might define the acceptance region as [-1/2,2], i.e. if log 2 D > 1, he rejects the null hypothesis in favour of the alternative hypothesis (differential gene expression).

82 Hypothesis Testing (Example) The biologist may proceed in the following way: He has n x replicate measurements of the gene of interest in condition X, (x 1,,x n y), and n y replicate measurements of condition Y, (y 1,,y n x). He might divide the average of the X measurements by the average of the Y measurements and obtain the statistics: D = x 1 + x x n X n y 1 + y y n y k Y = gene2 X = gene1 Then, the biologist might define the acceptance region as [-1/2,2], i.e. if log 2 D > 1, he rejects the null hypothesis in favour of the alternative hypothesis (differential gene expression). If log 2 D 1, he does not reject H 0. This test is not optimal (see the Exercises), but it is still used by many researchers. The great advantage of this approach is that the choice of the confidence interval can be done implicitly by prescribing a significance level.

83 Hypothesis Testing Significance level (Example) Let a α є(0,1) be given. Usually α is a number close to zero. The statistics D can be interpreted as a random variable. If we assume the null hypothesis is valid, we can find a (not necessarily unique) interval J on the real line such that P ( 0 D J H ) This means that given the null hypothesis is valid, the probability of observing a value of D outside the interval J is α (and hence small, if α is small). The complement of J in IR is then taken as the rejection region for the test. = α In the biologist example, there are better ways to design a test for differential gene expression. Under the assumption that the expression values for X respectively Y follow a normal distribution, we can conduct a t-test:

84 Assume that X=(x 1,,x n x) resp. Y=(y 1,,y n y) are two samples of independent normally distributed random variables with mean μ x resp. μ y and standard deviation σ x resp. σ y. The null hypothesis can be stated as H 0 = μ x = μ y. X n y Y Var n X Var Y E X E T ) ( ) ( ) ( ) ( + = The T statistic Is perfectly designed to answer this question. If the null hypothesis is true, i.e. μ x - μ y is near 0, then T should be close to 0 except for random outcomes that are pretty unusual. The T statistic is random variable with a little bit complicated distribution: If and, then T has approx. a t-distribution with d degrees of freedom, where d is the closest integer to ), ( ~ 2 X X N X σ μ ), ( ~ 2 Y Y N Y σ μ ) / ( 1 1 ) / ( 1 1 / ) / / ( Y Y Y X X X Y Y X X n S n n S n n S n S The two sample Student t-test.

85 Student t-test The density of the T statistic tells us how far from 0 we should expect T to be most of the time, given the null hypothesis is true. E.g. for k=8, and significance level α = 5%, we would expect only 5% of the time for T to be above t(0.975; 8)=2.306 or below t(0.025; 8) = Thus a typical decision rule in this case would be to reject H 0 in favour of H 1 if T > t(0.975;8) = Density of the t-statistic for k=8. Symmetric confidence interval for α = 5%. density of t(k=8) % 2.5% 95% P-values. x The probability of observing values of D that are at least as extreme as d. Calculate the p-value p = P( D > d ) Given a significance level α, we reject the null hypothesis if p< α.

86 Hypothesis Testing Error types If we reject the null hypothesis when it is actually true, we have made what is called a type I error or a false positive. (Example: Falsely declare a gene as differentially expressed) If we accept the null hypothesis when it is actually false, then we have made a type II error or a false negative. (Example: Failed to identify a truly differentially expressed gene) H 0 true H 0 not true Hypothesis not rejected True negatives Type II error (false negatives) Hypothesis rejected Type I error (false positives) True positives

87 Hypothesis Testing Error Types (Cont.) In hypothesis testing, the probability of a Type I error is controled to be at most as high as significance level of the test. It is harder to control the probability of a Type II error because we usually do not have a statistics for testing the alternative hypothesis. The smaller the true existing difference in expression levels the larger the probability of a Type II error. Given a statistical testing procedure, it is impossible to keep both error types arbitrarily large by selecting a special significance level. There is a trade-off between type I and type II error, as depicted by the next figure.

88 Two types of tests Parametric tests: A parametric distribution is assumed for the measured random variables. E.g. the t-test assumes that the variables are normally distributed. (If this were not the case, this would lead to wrong p-values or wrong confidence intervals.) Non-parametric tests: No parametric distribution function is assumed for the measured random variable when the distribution of the measured variables is not known or when there is no appropriate test that can deal with the distribution of the measured variables are non-parametric tests. merely rely on the relative order of the values of on some very mild constraints concerning the shape of the probability distributions of the measured variables (e.g. unimodality, symmetry). Often, prior to computing a test statistic, data is transformed in order to produce random variables that are easier to handle (e.g. to produce approximately normally distributed data). We mention one parametric and one non-parametric test which are commonly used.

89 Wilcoxon rank sum test Given two samples x=(x 1,,x n ) and y=(y 1,,y m ) drawn independently from the random variables X and Y resp. Testing whether the distibutions of X and Y are identical. For large numbers it is almost as sensitive as the two Sample Student t- test. For small numbers with unknown distributions this test is even more sensitive than the Student t-test. The only requirement for the Wilcoxon test to be applicable is that the distributions are symmetric.

90 Wilcoxon rank sum test (Cont.) State the hypotheses: Null hypothesis: The two variables X and Y have the same distribution. Alternative hypothesis: The two variables X and Y do not have the same distribution Choose a significance level α Compute Test statistics: Rank order all N=n+m values from both samples combined. Sum the ranks of the smaller sample and call this value w. Calculate p-value Look up the level of significance (p-value) in a table using w, m and n. Calculating the exact p-value is based on calculating all permutations of ranks over both samples. (This is infeasible for n, m>10. Fortunately, there are approximations available (and implemented in R)). Compare p-value with α and state the conclusion P-value < α : Reject H 0 P-value >= α : Fail to reject H 0

91 Summary Null hypothesis, test statistics Significance level, rejection region, p-value Type I and type II errors 5-Step testing procedure Parametric tests: t -test, χ 2 -test, ANOVA Non-parametric test: Wilcoxon rank sum test, Kruskal Wallis

92 Multiple hypothesis testing Golub et al. (1999) were interested in identifying genes that are differentially expressed in patients with two type of leukemias: - acute lymphoblastic leukemia (ALL, class 0) and - acute myeloid leukemia (AML, class 1). Gene expression levels were measured using Affymetrix chips containing g = 6817 human genes. n = 38 samples = 27 ALL cases + 11 AML cases.

93 Multiple hypothesis testing Following Golub et al. Three preprocessing steps were applied to the normalized matrix of intensity values available on the website: (i) Thresholding floor of 100 and ceiling of ; (ii) filtering: exclusion of genes with (max /min) 5 or (max-min) 500, where max and min refer respectively to the maximum and minimum intensities for a particular gene across all mrna samples ; (iii) base 10 logarithmic transformation. The data were then summarized by a 3051 x 38 matrix. A two sample t-test for was computed for each of the 3051 genes.

94 Multiple hypothesis testing Did you expect that? Did you expect that? Histogram of teststat Histogram of p-values Frequency Frequency teststat * (1 - pnorm(abs(teststat)))

95 Multiple Comparison p-value: probability of finding a difference equal or greater than the observed one just by chance under the null hypothesis. Measure of false positive rate (F/ m 0 ) Commonly used significance level, 5% (+/-1.96 s.d.), is arbitrary In multiple comparisons, 5% significance level for each comparison often results in too large overall significance level. Do not involve the alternative hypothesis. Called significant Called not significant Total Null true F m 0 -F m 0 Alternative true T m 1 -T m 1 Total S m - S m

96 Multiple Comparison (Cont. 1) Family-wise error rate (FWER) probability of having at least one false positives in multiple comparisons. Many versions of controlling procedure. Bonferroni, Holm (1979), Hochberg (1988), Hommel (1988) Can be too conservative for genomic studies. Table: FWER (expected number of false postives) for different number of comparisons (N) at different α level α N (0.01) (0.05) (0.1) (0.5) (1) (10) (0.05) (0.25) (0.5) (2.5) (5) (50)

97 Multiple Comparison (Cont. 2) False discovery rate (FDR / pfdr): Proportion of hits that are false (F/S). Several versions of controlling procedure. (Benjamini & Hochberg (1995), and Benjamini & Yekutieli (2001)) A significance measure based on pfdr: q-value (Storey & Tibshirani (2003)) q-value: minimum false discovery rate that can be attained when calling a feature significant Require to estimate the proportion of true null (m 0 /m) For FDRs estimated using Benjamini s and Storey s approaches, the same cut-off resulted in different numbers of significant genes. No formula to describe what are the quantities related to FDR and how they are related.

98 Summary This only provides some flavor of probability, statistics and their usage. To learn more: taking a full course! Introduction to biostatistics for clinical investigators Statistical methods for observational studies

99 References and some useful info Statistical methods in bioinformatics course slides developed by Dr. Christian Gieger and Dr. Achim Tresch Statistical Methods in Bioinformatics by Warren Ewens and Gregory Grant Introduction to Statistical Thought by Michael Lavine, The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman Statistical software package and program language R,

Introductionn to Probability and Statistics

Introductionn to Probability and Statistics Xi Kathy Zhou, PhD Division of Biostatistics and Epidemiology Departmentt of Public Health http://www.med.cornell.edu/public.health/biostat.htm Feb. 2010 Overview