R STATISTICAL COMPUTING some R Examples Dennis Friday 2 nd and Saturday 3 rd May, 14.
Topics covered Vector and Matrix operation. File Operations. Evaluation of Probability Density Functions. Testing of Hypothesis, i.e. t, z, F, Chi-square. Linear Regression.
Vectors I A numeric vector is a list of numbers. The three common functions that are used to create vectors in various situations are:- c(), seq() and rep(). The function c() The letter c means concatenate, i.e. to join together. > c(0, 270, 250, 300, 210) # Creating a numeric vector > Saturday.sales < c(0, 300, 3) > Sunday.sales < c(1, 230, 0)# Saving a vector in a variable. > weekend.sales < c(saturday.sales, Sunday.sales)# Combining two vectors. > cities < c( Kerala, Delhi, Chennai ) # Creating a vector of characters.
Vectors II The functions cbind(), rbind() and seq() > rows.sales < cbind(saturday.sales, Sunday.sales) # Binds Rows > columns.sales < rbind(saturday.sales, Sunday.sales) # Binds Columns > colnames(column.sales) < cities # Adds titles into an array of vectors. The abbreviation seq means sequence, i.e. it is used for equidistant series of numbers. > seq(4, 9) # Creates a sequence of numbers from 4 to 9 at interval of 1. # creates a sequence of numbers from 3 to 10 at interval of 2. > seq(3, 10)
Vectors III The function rep() The abbreviation rep means replicate, i.e. it is used to generate repeated values.it is used in two variants, depending on whether the second argument is a vector or a single number, for example; to repeat a series of numbers; > x = c(4, 5) > rep(x, 3) # This will repeat x three times. Consider the following r-code; > rep(1 : 2, c(10, 15)) # This means repeat 1 10 times and 2 15 times.
Matrices I Example A matrix is a two-dimensional array of numbers. > x < 1 : 12 > dim(x) < c(3, 4); x Alternatively, the function matrix() can be used as follows; > matrix(1 : 12, nrow = 3, byrow = T ) The dim assignment function sets or changes the dimension attribute of x, causing R to treat the vector of 12 numbers as a 3 x 4 matrix. Useful functions that operate on matrices include rownames(), colnames(), and the transposition function t().
Matrices II Example > x < matrix(1 : 12, nrow = 3, byrow = T ) > rownames(x) < LETTERS[1 : 3]; x > colnames(x) < letters[1 : 3]; x # Naming rows and columns. The character vector LETTERS is a built-in variable that contains the capital letters A Z. Similar useful vectors are letters, month.name and month.abb, which refers to lowercase letters a z, month names, and abbreviated month names respectively. Finding the Transpose and the Inverse of a matrix. > y < t(x); y # Gives the transpose of a matrix. > z < solve(x); z # Gives the inverse of a matrix.
Matrices III Create two matrices, say, A and B. Matrix addition, subtraction and multiplication > C = c(rep(0, 3), seq(1, 6, 1), rep(seq(1, 6, 1), 2)) > A = matrix(sample(c, 12), nrow = 3, byrow = T ); A > B = matrix(sample(c, 12), nrow = 3, byrow = T ); B > D = A + B; D # Adds the matrix. > E = A B; E # Subtracts matrix B form matrix A. > F = A B; F # Multiplies matrix A form matrix B element-wise. > G = A% %t(b); G # Computes matrix multiplication.
File operations Reading Data. The two common functions used in reading data are; scan() and read.table(), i.e. > Sales = scan(file = filepath ) > SALES = read.table(file = filepath, header = FALSE) To merge two or more files according to the column names or row names, we use the function merge(); > AB = merge(x, y, z) # merging files. To write an output into a file, use the function write(), i.e. > output.file = write(x, file = file path to save the data, ncolumns = 1, append = FALSE)
Some probability distributions I Normal distribution Full list and options are found in > help(normal) command. dnorm > x < seq(,, by =.1) > y < dnorm(x) > plot(x, y) pnorm > x < seq(,, by =.1) > y < pnorm(x, mean = 3, sd = 4) > plot(x, y)
Some probability distributions II qnorm The next function we look at is qnorm which is the inverse of pnorm. The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. > x < seq(0, 1, by =.05) > y < qnorm(x) > plot(x, y) > y < qnorm(x, mean = 3, sd = 2) > plot(x, y) rnorm > y < rnorm(0, mean = 2) > hist(y)
t-distribution I
t-distribution II dt, Help at > help(tdist) > x < seq(,, by =.5) > y < dt(x, df = 10) pt > x = c( 3, 4, 2, 1) > pt((mean(x) 2)/sd(x), df = ) qt > v < c(0.005,.025,.05) > qt(v, df = 253) rt > rt(3, df = 10)
Chi-square I
Chi-square II dchisq, Help at > help(chisquare) > x < seq(,, by =.5) > y < dchisq(x, df = 10) pchisq > x = c(2, 4, 5, 6) > pchisq(x, df = ) qchisq > v < c(0.005,.025,.05); qchisq(v, df = 253) rchisq > rchisq(3, df = )
Testing of Hypotheses I In the one-sample t-test, we are comparing a sample mean to a known or hypothesized population mean. The null hypothesis is that the expected mean difference between the sample mean and the population mean is zero, or in other words, that the expected value of the sample mean is equal to the population mean. t-test example 1. > rnorm1 = rnorm(50, 500, 100); rnorm1 # Generate random numbers from a normal distribution with µ = 500, σ = 100. > summary(rnorm1) # Gives the descriptive summary. The mean of 517.3 is higher than µ = 500, but is it significantly higher? To test this hypotheses, we use the function t.test(). > t.test(rnorm1, mu = 500)
Testing of Hypotheses II Load the in-built library datasets. t-test example 2. Quiz > library(datasets) > head(mtcars) Assuming that the data in mtcars follows the normal distribution, find the 95% confidence interval estimate of the difference between the mean gas mileage of manual and automatic transmissions.
Testing of Hypotheses III Solution Answer > L = mtcars$am == 0 > mpg.auto = mtcars[l, ]$mpg; mpg.auto # automatic transmission mileage. > mpg.manual = mtcars[!l, ]$mpg; mpg.manual # manual transmission mileage. > t.test(mpg.auto, mpg.manual) In mtcars, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg. The 95% confidence interval of the difference in mean gas mileage is between 3.97 and 11.2802 mpg.
Testing of Hypotheses IV Chi-squared test Consider the following example; > female = c(18, 102) > male = c(10, 110) > migraine = cbind(female, male); migraine > chisq.test(migraine) From the results we find that;- We determine that the chi-square test fails to reject the null hypothesis that gender and migraine susceptibility are independent. If the ratio of female to male migraine sufferers is indeed 18:6, then our result is not what we had hoped for. We may have an unusual sample or we may simply need a larger sample to obtain statistical significance.
Regression I Simple Linear Regression An example of grade point averages (GPAs) of students along with the number of hours each student studies per week. We then use the lm() function to find the slope and intercept terms. Hours 10 12 10 15 14 12 13 15 16 14 13 12 11 10 13 13 14 18 17 14 GPA 3.33 2.92 2.56 3.08 3.57 3.31 3.45 3.93 3.82 3.70 3.26 3.00 2.74 2.85 3.33 3.29 3.58 3.85 4.00 3.50 > results < lm(gpa Hours); summary(results) The correct interpretation of the regression equation is that a student who did not study would have an estimated GPA of 1.3728, and that for every 1-hour increase in study time, the estimated GPA would increase by 0.1489 points.
Regression II Example Consider the dataframe on cars, mtcars in library(datasets). Test if mpg -miles per gallon is affected by disp -displacement and hp -horsepower. > myvariables = c( mpg, disp, hp ) > mycars = mtcars[myvariables] > mymodel = lm(mpg disp + hp, data = mycars) > summary(mymodel) From the summary of the model, we can clearly see that the independent variables; disp -Engine displacement & hp -Horsepower negatively affects the dependent variable, i.e. for an increase in either displacement or horsepower, it will result in a decrease in the miles per gallon. Another important graphical way to assess relationship between