probability George Nicholson and Chris Holmes 29th October 2008

Size: px

Start display at page:

Download "probability George Nicholson and Chris Holmes 29th October 2008"

Elizabeth Stafford
5 years ago
Views:

1 probability George Nicholson and Chris Holmes 29th October 2008 This practical focuses on understanding probabilistic and statistical concepts using simulation and plots in R R. It begins with an introduction to writing loops and functions in R. Loops in R This section gives a very brief introduction to writing loops in R. 1 Type in the following code and see what it does. sum=0 for(i in 1:10){ sum=sum+i print(paste("loop ",i,", sum = ",sum,sep="")) Now try this: for(mychar in letters){ print(paste("loop ",mychar,sep="")) Define a vector of characters (e.g. cv=c("s","g","u")) and write a loop that will calculate the sum of the letter indices of cv (i.e =47). (Think about using the function match() in your loop.) Can you think of a way of an automated way of doing this calculation without the loop? 1 You should try whenever possible to avoid using loops in R, as they are relatively slow ways to do computations. The way to work efficiently is to vectorise calculations (e.g. use matrix multiplication), and to use specialized R functions that can do iterative calculations fast (e.g. rowsums(), ifelse()). 1

2 Functions in R A function in R is defined using the following syntax: funname=function(argument1, argument2, etc.){...code here... listout=list(outname1=object1, outname2=object2, etc.) return(listout) You only need to create a list to return if you want to return more than one R object at a time (otherwise you can just use return(object)). First, let s create a function that takes, as arguments, two numbers, and returns the first number raised to the power of the second number: pow=function(x,p){ out=x^p return(out) Play around with the function pow. What happens if the argument x is a vector? What about p? Now create a function whose arguments are two numbers, m and n say; the function must return two objects, named div and rem, with the property that m divides n div times with remainder rem (e.g. 5 divides 7 once leaving remainder 2). divide=function(m,n){ div=floor(n/m) rem=n-m*div return(div,rem) Now write a function which takes a single argument, x say (a numeric vector). The function must calculate the sample mean and variance of x using the following formulae: x = 1 n x i n s 2 = 1 n 1 i=1 n (x i x) 2 where n is the length of x. Return the objects mean and variance in a list. i=1 2

3 Simulation and probability in R 1. Generate n random draws from a standard Gaussian distribution. Plot a histogram of the data. Overlay the histogram with the density function of a standard Gaussian; first look at?dnorm and understand what this function does. xpl=seq(-5,5,l=10000) lines(x=xpl,y=dnorm(xpl)) Repeat 6 times. Open a graphics device, and define a 2 3 layout using the argument mfrow to the par function. Plot the 6 histograms in this graphics device. Export the plots into a.pdf file using the saveplot() function. How does the appearance of the histograms change with n? 2. Repeat the previous task in its entirety, but now create Q-Q plots instead of histograms; superimpose the line y = x on each plot using abline(). 3. Generate n draws from a standard Gaussian distribution; calculate the mean of the n values. Repeat the procedure of the previous sentence m times, storing the m means in a vector, mv say. Standardise mv (i.e. subtract its mean and divide by its standard deviation). Plot the standardised vector mv in a histogram and overlay with the density of a standard Gaussian. What do you observe? Vary n and m; what happens as they increase? Can you spot a pattern? 4. Repeat the previous exercise, but use the chi-squared distribution with one degree of freedom as the generating distribution (rchisq()). (You may want to remind yourself what this distribution looks like by plotting a histogram of some simulated data before setting out.) 5. Generate n draws from a standard Gaussian; save the 0.1 quantile (10th percentile); repeat this m times, storing the m values in a vector, mv say; plot a histogram of mv and add a line indicating the theoretical 0.1 quantile (find this using qnorm). Repeat the previous sentence s procedure for the median and for the mean. Boxplot the distribution of the means and distribution of the medians. What do you see? 6. Write a function with arguments n, m and alpha. The function should generate a vector of length m, mv say, each element of which is the alpha quantile of a random sample of size n from a standard Gaussian. The function should create a histogram of mv, label it appropriately, and return mv. Note: you ve just written a function to plot the distribution of the order statistics!! Compare the empirical distribution of mv with the distribution of a standard Gaussian. What happens as n and m become large? 3

4 7. Generate a pair of independent draws from a standard Gaussian distribution. Write a function which simulates n pairs and then calculates the proportion of pairs for which (i) both members are smaller than alpha (ii) either member is smaller than alpha. Your function should return these two proportions in a list. Calculate the exact (theoretical) probability of these events and compare your accuracy as n changes. Amend your function so that the above is repeated m times, plots a histogram of each of the two vectors of length m, and returns these two vectors in a list. Finally, can you repeat this whole procedure, but now with triples rather than pairs? 8. Generate a pair of samples from a multivariate normal density with correlation rho=0.6. Scatterplot a large number. Write a function that uses simulation to estimate the conditional probability P r(x > 0.6 Y > 0.6) and returns the estimate, along with the marginal estimate of P r(x > 0.6). Generalise your function to take rho, alpha and beta as arguments, and to estimate P r(x > α Y > β) and P r(x > α) where each pair (X, Y ) is drawn from from a multivariate normal density with correlation rho. 9. Using very large sample sizes (say 100,000 samples) draw correlated pairs of multivariate Gaussian observations, (X, Y ). Store X only if Y is in some small range (e.g. store X only if 0.6 < Y < 0.61). Explore the resulting distribution of the stored X values. What do you see? 10. Generate n random draws from a standard Gaussian distribution. Store this in a vector z. Input xpl=seq(-5,5,l=10000) fn=ecdf(x=z) Try to work out what the object fn is. What does fn(xpl) return? Plot the empirical CDF of the data: plot(x=xpl,y=fn(xpl),type="l") Superimpose the theoretical CDF of a standard Gaussian distribution (see pnorm). 11. Write a function g that takes as arguments two numeric vectors, u and v say; it must return an argument, w say, the same length as u, with the ith element in w equal to the proportion of elements in v that are less than or equal to the ith element in u. Generate n random draws from a standard Gaussian distribution and store in a vector z. Then run gout=g(u=xpl,v=z) 4

5 fn=ecdf(x=z) fout=fn(xpl) Compare fout and gout. If they re the same, you ve written a function to evaluate the empirical CDF of a sample of data, v, at a set of points, u! Hypothesis Testing In the examples above we ve considered random variation that naturally occurs when we sample from a population. In this section we will look at random variation that occurs when we look for differences between samples drawn from two populations. That is, when testing for differences. The next couple of questions are aimed to get you thinking about p-values under the null (when there is no change in distribution between two treatments/experiments/categories) 1. Generate two sets of 50 values each from a standard normal N(0,1). That is, they have the same distribution Use t.test() to test for differences in the means: Write a for loop to repeat the test 1000 times (with different data sets drawn from rnorm() each time) storing the p-values from the 1000 tests. Plot a histogram of the p-values and Q-Q plot them against a uniform density (note: you can approximate the theoretical i th quantile of a uniform by (i/(n+1)). What percentage of your p-values fall below 0.05? Is that what you expect? 2. Perform question (1) above but now using 100 samples for each set. Histogram the p-values from 1000 repeats and see how many fall below What do you expect to see? 3. Repeat (1) and (2) but now using a chi-squared distribution to draw the samples. Is the t-test robust to changes in distribution? 4. Generate 50 points from N(0, 1) and 50 points from N(µ, 1) with say µ = 0.4, y <- rnorm(50)+µ. Repeat question (1) above and plot the histograms of p-values. What percentage of p-values fall below 0.05? Compare your result with that given by power.t.test(100,µ,1,0.1) 5. Permutation testing is a great way to think about how we can explore the natural variation in a test result that occurs purely by chance when the null is true. To do a permutation we randomly swap points between the two sets of samples that we re testing. Having done this we know (by design) that there is no association between the class labels and the measurements (think about this!). Hence, any association we do see is purely by chance. To demonsatrate the principle perform 5

6 the following: Generate 50 points from N(0, 1), x<-rnorm(50) and 50 points from N(µ, 1), y<-rnorm(50) + µ. Swap points from X to Y at random. Suppose you have data stored in x and y then the following code will shuffle the points across to create xnew ynew n <- length(x) t <- rnorm(n) indx <- t < 0 indx_2 <- t > 0 xnew <- x[indx] ynew <- x[indx_2] t <- rnorm(n) indx <- t < 0 indx_2 <- t > 0 xnew <- c(xnew, y[indx]) ynew <- c(ynew, y[indx_2]) Use t.test to test for association. Repeat the above 1000 times and plot the distribution of the p-values. 6. Repeat the task in (6) but this time store the standardised difference between the sample means at each point. You can use the code (below) to calculate the standardised difference in mean. mu_x <- mean(x_new) mu_y <- mean(y_new) n <- length(mu_x) grouped_standard_deviation <- sqrt(((n-1)/(2*n-2))*(sd(x_new)+sd(y_new)) mu_dif <- (mu_x - mu_y) / (grouped_standard_deviation / sqrt(2*n)) plot the distribution of standardised mean differences using histogram and q-q plots. What is the distribution? 6

probability George Nicholson and Chris Holmes 31st October 2008

probability George Nicholson and Chris Holmes 31st October 2008 This practical focuses on understanding probabilistic and statistical concepts using simulation and plots in R R. It begins with an introduction