Lunds universitet Matematikcentrum Matematisk statistik Matematisk statistik allmän kurs, MASA01:A, HT-15 Laborationer General information on labs During the rst half of the course MASA01 we will have two labs. is based on programming in R, and will be done in groups of 2-4. The laboratory work It is compulsory to submit a written report for the lab works in the week following the lab. The report should be written so that it can be read independently without the need to use the computer exercise. All graphs should be given descriptive names on the axes. Also all sourcecode should contain comments, and must be included in the report. The report will be due in 2 weeks' time. Exercises marked with asterisk (*) are exercises that can be either partially or fully solved without R. In case of doubts, please contact Debleena Thacker, debleena@maths.lth.se.
1 INTRODUCTION TO R Computer Exercise 1 The purpose of this lab is to introduce R as a tool to investigate dierent densities. 1 Introduction to R Following are some useful commands for dierent distributions: Distribution Random numbers Probability-/density function distribution function quantile Normal rnorm dnorm pnorm qnorm Uniform runif dunif punif qunif Binomial rbinom dbinom pbinom qbinom Poisson rpois dpois ppois qpois Exponential rexp dexp pexp qexp Gamma rgamma dgamma pgamma qgamma Quantile functions (fth column) will be used only in the second part of the course. A useful command in R is? or help. Type help(sum) or?sum An extended search is conducted through help.search: help.search("sum") Typical examples 1+2+3 A variable is dened by v <- 2 w <- -7.5 The following is an example of how one can use this: u <- (v-w)/5 exp(u) 2
1 INTRODUCTION TO R Matrices and vectors Vectors can be dened as functions c() a <- c(12,-4.3,9,5/2) a Sometimes the following are useful: b <- c(6:9) c <- rep(3,4) d <- seq(from = -2, to = 3, by = 0.3) Indexing is done by d[4] the length of a vector can be determined by length() function: length(d) Element-wise operations can be done as follows: a*b 3*a-1 f <- a+b The scalar product of two vectors can be computed as: g <- a%*%b or: sum(a*b) One can use the in-built functions for vectors: exp(a) sin(3*pi)-exp(c) Matrices can be dened in the following manner as follows: fill_number <- c(1:10) m1 <- matrix(fill_number, ncol = 2) m2 <- matrix(fill_number, ncol = 2, byrow = TRUE) m1 m2 What is the dierence between m1 and m2? Also try m3<- cbind(a,b) m4 <- rbind(a,b) The dimension of a matrix : dim(m3) nrow(m3) 3
1 INTRODUCTION TO R ncol(m3) Matrices can be indexed by: m1[1,] m1[1,2] m1[,1] Matrix product can be computed by %*% m3%*%m4 m3%*%t(m3) Logical operations are dened element-wise according to the following rules: a <= b a[a<=b] Often one wants to assign values 1 and 0 to TRUE and FALSE. This can be done by as.real as.real(a <= b) The following can be used to dene the indicator variables. Matrix inversion can be done by solve: diag_m <- diag(3, nrow = 4) inv <- solve(diag_m) diag_m%*%inv The type of a variable is given by str function: names <- c("hugo", "Lasse", "Line") str(names) str(d) str(m1) Graphics/Diagrams x <- seq(-10,10,0.1) n <- 15 gaussians <- c(rnorm(100, mean = 0, sd = 1)) y1 <- pnorm(x, mean = 0, sd = 1) y2 <- dnorm(x, mean = 0, sd = 1) plot(x,y1, main = "Standard normal distribution", xlab = "function arguments", ylab = "function values",col = "red", type = "l", xlim = c(-5,5)) lines(x,y2, col = "green", type = "l") legend(-5,0.9, c("density", "distr.function"), col = c("red", "green"), text.col = "blue", lty = c(1, 1)) 4
2 DISTRIBUTIONS points(gaussians, rep(0,length(gaussians)), pch = 4) mean(gaussians) sd(gaussians) var(gaussians) help(legend) One plot multiple gures in the same window in the following manner: par(mfrow = c(2,1)) plot(x, sin(x), type = "b") plot(x, sin(x 2), type = "l") One can remove the divisions in the window by: dev.off() or par(mfrow = c(1,1)) One can nish on the R sessions by typing: q() at which point you will be prompted as to whether or not you want to save your workspace. If you do not save it, it will be lost. To save your workspace (type y). 2 Distributions Problem* 2.1. What is the denition of the distribution of a random variable? Problem* 2.2. Dene density and probability functions of a random variable. Problem* 2.3. Let X be a continuous random variable with density function f X (x) = 1 2 x1 [0,2](x). Determine P ( 1 X 2 ). (1 A ( ) denotes the indicator function of A, where 1 A (x) = 1 for x A and 1 A (x) = 0 for x / A. ) Problem* 2.4. Let X be a discrete random variable with probability function Determine c. Calculate P ( 1 X < 4 ). p X (k) = c 0.4 k, k = 0, 1, 2,... Problem* 2.5. Give an example of an experiment such that its outcome can be described by the following Binomial random variable X Bin(n, p). 5
2 DISTRIBUTIONS Problem* 2.6. What is the dierence between hypergeometric and binomial distributions? 6
3 PRACTICALS WITH R 3 Practicals with R Problem 3.1. Let X be exponentially distributed with E(X) = 4. Plot (using R) the distribution function F X ( ) of X. Problem 3.2. Determine P (X 4) where X Poisson(3). Compute P (X 3) for same X using R. Problem 3.3. Compute (using R) P (X 5) då X where N (µ = 2, σ 2 = 3). note that the parameters in dnorm are expectation µ and standard deviation σ (and not the variance). Problem 3.4. Compute (using R) P (X 4) where X Bin(10, 0.6) Problem 3.5. Plot the probability functions for Poisson(10) and Bin(n, p n ) where (n, p n ) {(11, 10/11), (20, 1/2), (100, 0.1), (1000, 0.01)}. Plot all of them in the same window: For example start with: x <- seq(0,20,1) plot(x, dpois(x, 10), ylim = c(0,0.5), col = "red", main = "comparision of densities", type = "l") One can use the command lines to add a graphic in a plot window. Use dierent colors for dierent densities. Type colors() for a complete list of available colors). What do you observe? Can you say something about when can one describe a Binomial distribution with the help of Poisson distribution? Problem 3.6. Plot the densities of N (5, 2), unif(0, 10) and Exp(rate = 1/5) in the same window. Use dierent colors. Remember to label them. Histogram A histogram can describe how values in a data set is distributed over an interval. This interval is divided into an (optional) number of subintervals, and we can count the number of values in each subinterval. Using a bar chart, histogram can be easily visualized.. Use R function rexp to simulate a vector X of 1000 exponential random numbers with mean 3. To see how to use function rexp and its parameters, type help(rexp) and read the help text. X <- rexp(1000, rate = 1/3) We want to draw a histogramm of X with 30 classes. Let's type 7
3 PRACTICALS WITH R h <- hist(x, breaks = 30, ylab = "counts", main = "Histogram över X Exp(1/3)", col = "lightblue") The break points of the subintervals that has been used in the function hist can be obtained by bryt<-h$breaks and the length of bryt by length(bryt). hist may not use the number of breakpoints as many breakpoints as you enter. Instead, it chooses breakpoints in a "nice" way. It chooses m 30 (usually m 30) equally spaced breakpoints such that their distance is a round value. This is done by the function pretty, type help(pretty) if you want to know more. If you want to control over vertices, then the best way is to choose break points yourself. You can for example write bins <- seq(0, max(x), length.out = 30) h2 <- hist(x, breaks = bins, ylab = "counts", col = "lightblue", main = "Histogram of random numbers with Exp(3)") Type str(h) to get an overview of all values returned by hist. One can obtain the number of data points in each sub-interval by h$counts The histogram shows the number of data points in each subgroup. However, we are interested in comparing the histogram with the theoretical density function and hence, you need to normalize the histogram. Problem 3.7 (Plot normalised histogram). What is the charaterisitics of probability function and how can one scale the histograms in order to be able to relate them to densities. Observe that hist(...)$counts gives the values of the histogram and hist(...)$breaks gives the subinterval break-points. One can plot a bar-diagram using barplot. Hints: The command diff(hist(...)$breaks) gives the dierence between two consecutive break points. Problem 3.8. Generate a sample of size 1000 from N (20, σ 2 = 6). Draw a normalised histogram from this sample. Plot the density of N (20, 6) in the same graph, and compare it with the histogram. Problem 3.9. An alternative way to plot the histogram will be freq = FALSE i hist function: hist(x, 30, freq = FALSE, main = "normalised histogram") Let X be a Normal randon=m variable with mean µ = 1 and variance σ 2 = 2. Y = 2X + 3. What is the distribution of Y? 8 Assume
3 PRACTICALS WITH R Generate two samples of size 5000 each, one from the distribution of X, and another from the distribution of Y. Draw the respective normalised histograms using hist and freq = FALSE. Plot the densities of X and Y in the same gure. 9