probability George Nicholson and Chris Holmes 31st October 2008

Size: px

Start display at page:

Download "probability George Nicholson and Chris Holmes 31st October 2008"

Rebecca Robinson
5 years ago
Views:

1 probability George Nicholson and Chris Holmes 31st October 2008 This practical focuses on understanding probabilistic and statistical concepts using simulation and plots in R R. It begins with an introduction to writing loops and functions in R. Loops in R This section gives a very brief introduction to writing loops in R. 1 Type in the following code and see what it does. sum=0 for(i in 1:10){ sum=sum+i print(paste("loop ",i,", sum = ",sum,sep="")) Now try this: for(mychar in letters){ print(paste("loop ",mychar,sep="")) Define a vector of characters (e.g. cv=c("s","g","u")) and write a loop that will calculate the sum of the letter indices of cv (i.e =47). (Think about using the function match() in your loop.) Can you think of a way of an automated way of doing this calculation without the loop? 1 You should try whenever possible to avoid using loops in R, as they are relatively slow ways to do computations. The way to work efficiently is to vectorise calculations (e.g. use matrix multiplication), and to use specialized R functions that can do iterative calculations fast (e.g. rowsums(), ifelse()). 1

2 Functions in R A function in R is defined using the following syntax: funname=function(argument1, argument2, etc.){...code here... listout=list(outname1=object1, outname2=object2, etc.) return(listout) You only need to create a list to return if you want to return more than one R object at a time (otherwise you can just use return(object)). First, let s create a function that takes, as arguments, two numbers, and returns the first number raised to the power of the second number: pow=function(x,p){ out=x^p return(out) Play around with the function pow. What happens if the argument x is a vector? What about p? Now create a function whose arguments are two numbers, m and n say; the function must return two objects, named div and rem, with the property that m divides n div times with remainder rem (e.g. 5 divides 7 once leaving remainder 2). divide=function(m,n){ div=floor(n/m) rem=n-m*div return(div,rem) Now write a function which takes a single argument, x say (a numeric vector). The function must calculate the sample mean and variance of x using the following formulae: x = 1 n x i n s 2 = 1 n 1 i=1 n (x i x) 2 where n is the length of x. Return the objects mean and variance in a list. i=1 2

3 Simulation and probability in R These first few questions are designed to get you to think about natural random variation and how things change with sample size. 1. Generate n random draws from a standard Gaussian distribution. Plot a histogram of the data. Overlay the histogram with the density function of a standard Gaussian; first look at?dnorm and understand what this function does. xpl=seq(-5,5,l=10000) lines(x=xpl,y=dnorm(xpl)) Repeat 6 times. Open a graphics device, and define a 2 3 layout using the argument mfrow to the par function. Plot the 6 histograms in this graphics device. Export the plots into a.pdf file using the saveplot() function. How does the appearance of the histograms change with n? For this question you should see that for small sample sizes we see much greater variation in the qualitative shapes of the distributions 2. Repeat the previous task in its entirety, but now create Q-Q plots instead of histograms; superimpose the line y = x on each plot using abline(). 3. Generate n draws from a standard Gaussian distribution; calculate the mean of the n values. Repeat the procedure of the previous sentence m times, storing the m means in a vector, mv say. Standardise mv (i.e. subtract its mean and divide by its standard deviation). Plot the standardised vector mv in a histogram and overlay with the density of a standard Gaussian. What do you observe? Vary n and m; what happens as they increase? Can you spot a pattern? what you ll see is that the standard deviation of the distribution of means decrease with sample size. In fact the standard deviation of the distribution of means decreases as σ n. That is, as the sample size increases the variance of the distribution of means decreases. This is a very important point, as in hypothesis testing we ll be interested in answering questions of the form, what is the chance that the true (unknown) mean is zero? so knowing that as the sample size increases we can expect that the estimated mean will be contained in a region which is shrinking around zero 4. Repeat the previous exercise, but use the chi-squared distribution with one degree of freedom as the generating distribution (rchisq()). (You SHOULD remind yourself what this distribution looks like by plotting a histogram of some simulated data before setting out.) 3

4 this is a demonstration of the central limit theorem (CLT) an amazing result. The CLT states that it does not matter what the underlying probability distribution of the samples is, the distribution of the means of samples from the population tends to a normal density as the number of samples increases. Please ask me if this is not clear. 5. Generate n draws from a standard Gaussian; save the 0.1 quantile (10th percentile); repeat this m times, storing the m values in a vector, mv say; plot a histogram of mv and add a line indicating the theoretical 0.1 quantile (find this using qnorm). Repeat the previous sentence s procedure for the median and for the mean. Boxplot the distribution of the means and distribution of the medians. What do you see? we see that the distribution of the means and the distribution of the medians have the same central location. To put it another way, the median is an unbiased estimate of the mean. However, the distribution of medians has greater variance, around 1.25 as I recall. The distribution of the 0.1 quantile should have greater variability still, as there s less information to estimate it. 6. Write a function with arguments n, m and alpha. The function should generate a vector of length m, mv say, each element of which is the alpha quantile of a random sample of size n from a standard Gaussian. The function should create a histogram of mv, label it appropriately, and return mv. Note: you ve just written a function to plot the distribution of the order statistics!! Compare the empirical distribution of mv with the distribution of a Gaussian centred at the mean value and scaled by the standard deviation of your quantile estimates. What happens as n and m become large? 7. Generate a pair of independent draws from a standard Gaussian distribution. Write a function which simulates n pairs and then calculates the proportion of pairs for which (i) both members are smaller than alpha (ii) either member is smaller than alpha. Your function should return these two proportions in a list. Calculate the exact (theoretical) probability of these events and compare your accuracy as n changes. Amend your function so that the above is repeated m times, plots a histogram of each of the two vectors of length m, and returns these two vectors in a list. Finally, can you repeat this whole procedure, but now with triples rather than pairs? This example will start to use the notions of probability and probability calculus discussed in the lecture. Note: if two random variables are independent we know that P r(x&y ) = P r(x Y )P r(y ) = P r(x)p r(y ) 4

5 We also have from the third axiom of probability: P r(x or Y) = Pr(X) + Pr(Y) Pr(X&Y) 8. Generate a pair of samples from a multivariate normal density with correlation rho=0.6. Scatterplot a large number and investigate what happens as rho changes. Write a function that uses simulation to estimate the conditional probability P r(x > 0.6 Y > 0.6) and returns the estimate, along with the marginal estimate of P r(x > 0.6). Generalise your function to take rho, alpha and beta as arguments, and to estimate P r(x > α Y > β) and P r(x > α) where each pair (X, Y ) is drawn from from a multivariate normal density with correlation rho. Investigate the dependence as rho is altered. Try negative values such as rho=-0.6. As the value of rho increases so does the level of dependence and P r(x Y ) should look incresingly different to P r(x). 9. Using very large sample sizes (say 100,000 samples) draw correlated pairs of multivariate Gaussian observations, (X, Y ). Store X only if Y is in some small range (e.g. store X only if 0.6 < Y < 0.61). Explore the resulting distribution of the stored X values. What do you see? Here we attempting to generate the conditional distribution P r(x Y = 0.605) via simulation. Interestingly this distribution should be Gaussian; that is the conditional distribution of a multivariate Gaussian is also Gaussian. This result does not hold for all dependent (non-independent) multivariate distributions. 10. Generate n random draws from a standard Gaussian distribution. Store this in a vector z. Input xpl=seq(-5,5,l=10000) fn=ecdf(x=z) Try to work out what the object fn is. What does fn(xpl) return? Plot the empirical CDF of the data: plot(x=xpl,y=fn(xpl),type="l") Superimpose the theoretical CDF of a standard Gaussian distribution (see pnorm). This question is to get you used to think about the distribution functions F (x) = P r(x x) and see that, as for questions (1),(2) that it s affected by sample size 5

6 11. Write a function g that takes as arguments two numeric vectors, u and v say; it must return an argument, w say, the same length as u, with the ith element in w equal to the proportion of elements in v that are less than or equal to the ith element in u. Generate n random draws from a standard Gaussian distribution and store in a vector z. Then run gout=g(u=xpl,v=z) fn=ecdf(x=z) fout=fn(xpl) Compare fout and gout. If they re the same, you ve written a function to evaluate the empirical CDF of a sample of data, v, at a set of points, u! Hypothesis Testing In the examples above we ve considered random variation that naturally occurs when we sample from a population. In this section we will look at random variation that occurs when we look for differences between samples drawn from two populations. That is, when testing for differences. The next couple of questions are aimed to get you thinking about p-values under the null (when there is no change in distribution between two treatments/experiments/categories) 1. Generate two sets of 50 values each from a standard normal N(0,1). That is, they have the same distribution Use t.test() to test for differences in the means: Write a for loop to repeat the test 1000 times (with different data sets drawn from rnorm() each time) storing the p-values from the 1000 tests. Plot a histogram of the p-values and Q-Q plot them against a uniform density (note: you can approximate the theoretical i th quantile of a uniform by (i/(n+1)). What percentage of your p-values fall below 0.05? Is that what you expect? 2. Perform question (1) above but now using 100 samples for each set. Histogram the p-values from 1000 repeats and see how many fall below What do you expect to see? 3. Repeat (1) and (2) but now using a chi-squared distribution to draw the samples. Is the t-test robust to changes in distribution? 4. Generate 50 points from N(0, 1) and 50 points from N(µ, 1) with say µ = 0.4, y <- rnorm(50)+µ. Repeat question (1) above and plot the histograms of p-values. What percentage of p-values fall below 0.05? Compare your result with that given by power.t.test(100,µ,1,0.1) 6

7 5. Permutation testing is a great way to think about how we can explore the natural variation in a test result that occurs purely by chance when the null is true. To do a permutation we randomly swap points between the two sets of samples that we re testing. Having done this we know (by design) that there is no association between the class labels and the measurements (think about this!). Hence, any association we do see is purely by chance. To demonsatrate the principle perform the following: Generate 50 points from N(0, 1), x<-rnorm(50) and 50 points from N(µ, 1), y<-rnorm(50) + µ. Swap points from X to Y at random. Suppose you have data stored in x and y then the following code will shuffle the points across to create xnew ynew n <- length(x) t <- rnorm(n) indx <- t < 0 indx_2 <- t > 0 xnew <- x[indx] ynew <- x[indx_2] t <- rnorm(n) indx <- t < 0 indx_2 <- t > 0 xnew <- c(xnew, y[indx]) ynew <- c(ynew, y[indx_2]) Use t.test to test for association. Repeat the above 1000 times and plot the distribution of the p-values. 6. Repeat the task in (6) but this time store the standardised difference between the sample means at each point. You can use the code (below) to calculate the standardised difference in mean. mu_x <- mean(xnew) mu_y <- mean(ynew) n <- length(xnew) grouped_standard_deviation <- sqrt(((n-1)/(2*n-2))*(sd(xnew)+sd(ynew))) mu_dif <- (mu_x - mu_y) / (grouped_standard_deviation / sqrt(2*n)) plot the distribution of standardised mean differences using histogram and q-q plots. What is the distribution? 7

8 Multiple Testing These questions will use R to explore issues in multiple testing by simulating repeated tests for association. The situation we consider is where we ve performed T tests, say on T genetic markers, and we re intereted in the evidence for association provided by the top hits (those with lowest p-values). As before we need to look at the distribution of the test statistic when the global null holds (i.e. none of the effects are true) and as before we can simulate this situation using R. 1. Generate n = 50 points for two groups X and Y under the null that they have no difference in means. Use rnorm(n=50) for both X and Y. Repeat m = 1000 times storing the difference in means ( x ȳ) each time and histogram the results 2. Suppose we have now performed T = 1000 tests each with n = 50 individuals in two groups X and Y. We re interested in the distribution of the mean differences under the global null: Generate 50 points for X and Y from a standard normal rnorm(50) and calculate the difference in means. Store the result and repeat this T = 1000 times for the T number of tests (e.g. mimicking T genetic markers). Then store the MAXIMUM of the differences of means (across the T = 1000 experiments). Repeat this whole procedure m = 1000 times storing the maximum at each time. Histogram the m = 1000 maxima and compare to the distribution you get when you set T = 1, i.e. only a single experiment. What changes? Hint: you should write a function that can take inputs, T, n, m and plot the histogram and return the vector (of length m) of maxima means. 3. Repeat Q(2) above but now add in a true effect for one of the T = 1000 tests; say the first test. For the first test generate X from rnorm(n,0,1) and Y from rnorm(n,µ = 0.4,1) with all other tests having no difference. (i) Calculate how often in the m = 1000 simulations the true effect is the top hit. (ii) Plot the distribution (histogram) of the ranking of the true effect in the sorted list of T = 1000 tests. (iii) Investigate the dependency by changing µ, T, and n. How do µ, T and n affect the multiple testing problem? Note: you should generalise your function in Q(2) to do this 4. Repeat Q(3) but now store the difference in mean for the true effect (x y) as well as the maxima of the difference in means for the T = null tests. Repeat m = 1000 times. Plot the histograms of the maxima and for the true effect. Explore what happens as T increases. Explore what happens as n, the 8

9 sample size, increases. 9

probability George Nicholson and Chris Holmes 29th October 2008

probability George Nicholson and Chris Holmes 29th October 2008 This practical focuses on understanding probabilistic and statistical concepts using simulation and plots in R R. It begins with an introduction