probability George Nicholson and Chris Holmes 31st October 2008

Size: px
Start display at page:

Download "probability George Nicholson and Chris Holmes 31st October 2008"

Transcription

1 probability George Nicholson and Chris Holmes 31st October 2008 This practical focuses on understanding probabilistic and statistical concepts using simulation and plots in R R. It begins with an introduction to writing loops and functions in R. Loops in R This section gives a very brief introduction to writing loops in R. 1 Type in the following code and see what it does. sum=0 for(i in 1:10){ sum=sum+i print(paste("loop ",i,", sum = ",sum,sep="")) Now try this: for(mychar in letters){ print(paste("loop ",mychar,sep="")) Define a vector of characters (e.g. cv=c("s","g","u")) and write a loop that will calculate the sum of the letter indices of cv (i.e =47). (Think about using the function match() in your loop.) Can you think of a way of an automated way of doing this calculation without the loop? 1 You should try whenever possible to avoid using loops in R, as they are relatively slow ways to do computations. The way to work efficiently is to vectorise calculations (e.g. use matrix multiplication), and to use specialized R functions that can do iterative calculations fast (e.g. rowsums(), ifelse()). 1

2 Functions in R A function in R is defined using the following syntax: funname=function(argument1, argument2, etc.){...code here... listout=list(outname1=object1, outname2=object2, etc.) return(listout) You only need to create a list to return if you want to return more than one R object at a time (otherwise you can just use return(object)). First, let s create a function that takes, as arguments, two numbers, and returns the first number raised to the power of the second number: pow=function(x,p){ out=x^p return(out) Play around with the function pow. What happens if the argument x is a vector? What about p? Now create a function whose arguments are two numbers, m and n say; the function must return two objects, named div and rem, with the property that m divides n div times with remainder rem (e.g. 5 divides 7 once leaving remainder 2). divide=function(m,n){ div=floor(n/m) rem=n-m*div return(div,rem) Now write a function which takes a single argument, x say (a numeric vector). The function must calculate the sample mean and variance of x using the following formulae: x = 1 n x i n s 2 = 1 n 1 i=1 n (x i x) 2 where n is the length of x. Return the objects mean and variance in a list. i=1 2

3 Simulation and probability in R These first few questions are designed to get you to think about natural random variation and how things change with sample size. 1. Generate n random draws from a standard Gaussian distribution. Plot a histogram of the data. Overlay the histogram with the density function of a standard Gaussian; first look at?dnorm and understand what this function does. xpl=seq(-5,5,l=10000) lines(x=xpl,y=dnorm(xpl)) Repeat 6 times. Open a graphics device, and define a 2 3 layout using the argument mfrow to the par function. Plot the 6 histograms in this graphics device. Export the plots into a.pdf file using the saveplot() function. How does the appearance of the histograms change with n? For this question you should see that for small sample sizes we see much greater variation in the qualitative shapes of the distributions 2. Repeat the previous task in its entirety, but now create Q-Q plots instead of histograms; superimpose the line y = x on each plot using abline(). 3. Generate n draws from a standard Gaussian distribution; calculate the mean of the n values. Repeat the procedure of the previous sentence m times, storing the m means in a vector, mv say. Standardise mv (i.e. subtract its mean and divide by its standard deviation). Plot the standardised vector mv in a histogram and overlay with the density of a standard Gaussian. What do you observe? Vary n and m; what happens as they increase? Can you spot a pattern? what you ll see is that the standard deviation of the distribution of means decrease with sample size. In fact the standard deviation of the distribution of means decreases as σ n. That is, as the sample size increases the variance of the distribution of means decreases. This is a very important point, as in hypothesis testing we ll be interested in answering questions of the form, what is the chance that the true (unknown) mean is zero? so knowing that as the sample size increases we can expect that the estimated mean will be contained in a region which is shrinking around zero 4. Repeat the previous exercise, but use the chi-squared distribution with one degree of freedom as the generating distribution (rchisq()). (You SHOULD remind yourself what this distribution looks like by plotting a histogram of some simulated data before setting out.) 3

4 this is a demonstration of the central limit theorem (CLT) an amazing result. The CLT states that it does not matter what the underlying probability distribution of the samples is, the distribution of the means of samples from the population tends to a normal density as the number of samples increases. Please ask me if this is not clear. 5. Generate n draws from a standard Gaussian; save the 0.1 quantile (10th percentile); repeat this m times, storing the m values in a vector, mv say; plot a histogram of mv and add a line indicating the theoretical 0.1 quantile (find this using qnorm). Repeat the previous sentence s procedure for the median and for the mean. Boxplot the distribution of the means and distribution of the medians. What do you see? we see that the distribution of the means and the distribution of the medians have the same central location. To put it another way, the median is an unbiased estimate of the mean. However, the distribution of medians has greater variance, around 1.25 as I recall. The distribution of the 0.1 quantile should have greater variability still, as there s less information to estimate it. 6. Write a function with arguments n, m and alpha. The function should generate a vector of length m, mv say, each element of which is the alpha quantile of a random sample of size n from a standard Gaussian. The function should create a histogram of mv, label it appropriately, and return mv. Note: you ve just written a function to plot the distribution of the order statistics!! Compare the empirical distribution of mv with the distribution of a Gaussian centred at the mean value and scaled by the standard deviation of your quantile estimates. What happens as n and m become large? 7. Generate a pair of independent draws from a standard Gaussian distribution. Write a function which simulates n pairs and then calculates the proportion of pairs for which (i) both members are smaller than alpha (ii) either member is smaller than alpha. Your function should return these two proportions in a list. Calculate the exact (theoretical) probability of these events and compare your accuracy as n changes. Amend your function so that the above is repeated m times, plots a histogram of each of the two vectors of length m, and returns these two vectors in a list. Finally, can you repeat this whole procedure, but now with triples rather than pairs? This example will start to use the notions of probability and probability calculus discussed in the lecture. Note: if two random variables are independent we know that P r(x&y ) = P r(x Y )P r(y ) = P r(x)p r(y ) 4

5 We also have from the third axiom of probability: P r(x or Y) = Pr(X) + Pr(Y) Pr(X&Y) 8. Generate a pair of samples from a multivariate normal density with correlation rho=0.6. Scatterplot a large number and investigate what happens as rho changes. Write a function that uses simulation to estimate the conditional probability P r(x > 0.6 Y > 0.6) and returns the estimate, along with the marginal estimate of P r(x > 0.6). Generalise your function to take rho, alpha and beta as arguments, and to estimate P r(x > α Y > β) and P r(x > α) where each pair (X, Y ) is drawn from from a multivariate normal density with correlation rho. Investigate the dependence as rho is altered. Try negative values such as rho=-0.6. As the value of rho increases so does the level of dependence and P r(x Y ) should look incresingly different to P r(x). 9. Using very large sample sizes (say 100,000 samples) draw correlated pairs of multivariate Gaussian observations, (X, Y ). Store X only if Y is in some small range (e.g. store X only if 0.6 < Y < 0.61). Explore the resulting distribution of the stored X values. What do you see? Here we attempting to generate the conditional distribution P r(x Y = 0.605) via simulation. Interestingly this distribution should be Gaussian; that is the conditional distribution of a multivariate Gaussian is also Gaussian. This result does not hold for all dependent (non-independent) multivariate distributions. 10. Generate n random draws from a standard Gaussian distribution. Store this in a vector z. Input xpl=seq(-5,5,l=10000) fn=ecdf(x=z) Try to work out what the object fn is. What does fn(xpl) return? Plot the empirical CDF of the data: plot(x=xpl,y=fn(xpl),type="l") Superimpose the theoretical CDF of a standard Gaussian distribution (see pnorm). This question is to get you used to think about the distribution functions F (x) = P r(x x) and see that, as for questions (1),(2) that it s affected by sample size 5

6 11. Write a function g that takes as arguments two numeric vectors, u and v say; it must return an argument, w say, the same length as u, with the ith element in w equal to the proportion of elements in v that are less than or equal to the ith element in u. Generate n random draws from a standard Gaussian distribution and store in a vector z. Then run gout=g(u=xpl,v=z) fn=ecdf(x=z) fout=fn(xpl) Compare fout and gout. If they re the same, you ve written a function to evaluate the empirical CDF of a sample of data, v, at a set of points, u! Hypothesis Testing In the examples above we ve considered random variation that naturally occurs when we sample from a population. In this section we will look at random variation that occurs when we look for differences between samples drawn from two populations. That is, when testing for differences. The next couple of questions are aimed to get you thinking about p-values under the null (when there is no change in distribution between two treatments/experiments/categories) 1. Generate two sets of 50 values each from a standard normal N(0,1). That is, they have the same distribution Use t.test() to test for differences in the means: Write a for loop to repeat the test 1000 times (with different data sets drawn from rnorm() each time) storing the p-values from the 1000 tests. Plot a histogram of the p-values and Q-Q plot them against a uniform density (note: you can approximate the theoretical i th quantile of a uniform by (i/(n+1)). What percentage of your p-values fall below 0.05? Is that what you expect? 2. Perform question (1) above but now using 100 samples for each set. Histogram the p-values from 1000 repeats and see how many fall below What do you expect to see? 3. Repeat (1) and (2) but now using a chi-squared distribution to draw the samples. Is the t-test robust to changes in distribution? 4. Generate 50 points from N(0, 1) and 50 points from N(µ, 1) with say µ = 0.4, y <- rnorm(50)+µ. Repeat question (1) above and plot the histograms of p-values. What percentage of p-values fall below 0.05? Compare your result with that given by power.t.test(100,µ,1,0.1) 6

7 5. Permutation testing is a great way to think about how we can explore the natural variation in a test result that occurs purely by chance when the null is true. To do a permutation we randomly swap points between the two sets of samples that we re testing. Having done this we know (by design) that there is no association between the class labels and the measurements (think about this!). Hence, any association we do see is purely by chance. To demonsatrate the principle perform the following: Generate 50 points from N(0, 1), x<-rnorm(50) and 50 points from N(µ, 1), y<-rnorm(50) + µ. Swap points from X to Y at random. Suppose you have data stored in x and y then the following code will shuffle the points across to create xnew ynew n <- length(x) t <- rnorm(n) indx <- t < 0 indx_2 <- t > 0 xnew <- x[indx] ynew <- x[indx_2] t <- rnorm(n) indx <- t < 0 indx_2 <- t > 0 xnew <- c(xnew, y[indx]) ynew <- c(ynew, y[indx_2]) Use t.test to test for association. Repeat the above 1000 times and plot the distribution of the p-values. 6. Repeat the task in (6) but this time store the standardised difference between the sample means at each point. You can use the code (below) to calculate the standardised difference in mean. mu_x <- mean(xnew) mu_y <- mean(ynew) n <- length(xnew) grouped_standard_deviation <- sqrt(((n-1)/(2*n-2))*(sd(xnew)+sd(ynew))) mu_dif <- (mu_x - mu_y) / (grouped_standard_deviation / sqrt(2*n)) plot the distribution of standardised mean differences using histogram and q-q plots. What is the distribution? 7

8 Multiple Testing These questions will use R to explore issues in multiple testing by simulating repeated tests for association. The situation we consider is where we ve performed T tests, say on T genetic markers, and we re intereted in the evidence for association provided by the top hits (those with lowest p-values). As before we need to look at the distribution of the test statistic when the global null holds (i.e. none of the effects are true) and as before we can simulate this situation using R. 1. Generate n = 50 points for two groups X and Y under the null that they have no difference in means. Use rnorm(n=50) for both X and Y. Repeat m = 1000 times storing the difference in means ( x ȳ) each time and histogram the results 2. Suppose we have now performed T = 1000 tests each with n = 50 individuals in two groups X and Y. We re interested in the distribution of the mean differences under the global null: Generate 50 points for X and Y from a standard normal rnorm(50) and calculate the difference in means. Store the result and repeat this T = 1000 times for the T number of tests (e.g. mimicking T genetic markers). Then store the MAXIMUM of the differences of means (across the T = 1000 experiments). Repeat this whole procedure m = 1000 times storing the maximum at each time. Histogram the m = 1000 maxima and compare to the distribution you get when you set T = 1, i.e. only a single experiment. What changes? Hint: you should write a function that can take inputs, T, n, m and plot the histogram and return the vector (of length m) of maxima means. 3. Repeat Q(2) above but now add in a true effect for one of the T = 1000 tests; say the first test. For the first test generate X from rnorm(n,0,1) and Y from rnorm(n,µ = 0.4,1) with all other tests having no difference. (i) Calculate how often in the m = 1000 simulations the true effect is the top hit. (ii) Plot the distribution (histogram) of the ranking of the true effect in the sorted list of T = 1000 tests. (iii) Investigate the dependency by changing µ, T, and n. How do µ, T and n affect the multiple testing problem? Note: you should generalise your function in Q(2) to do this 4. Repeat Q(3) but now store the difference in mean for the true effect (x y) as well as the maxima of the difference in means for the T = null tests. Repeat m = 1000 times. Plot the histograms of the maxima and for the true effect. Explore what happens as T increases. Explore what happens as n, the 8

9 sample size, increases. 9

probability George Nicholson and Chris Holmes 29th October 2008

probability George Nicholson and Chris Holmes 29th October 2008 probability George Nicholson and Chris Holmes 29th October 2008 This practical focuses on understanding probabilistic and statistical concepts using simulation and plots in R R. It begins with an introduction

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

Distribution-Free Procedures (Devore Chapter Fifteen)

Distribution-Free Procedures (Devore Chapter Fifteen) Distribution-Free Procedures (Devore Chapter Fifteen) MATH-5-01: Probability and Statistics II Spring 018 Contents 1 Nonparametric Hypothesis Tests 1 1.1 The Wilcoxon Rank Sum Test........... 1 1. Normal

More information

Review of Basic Probability Theory

Review of Basic Probability Theory Review of Basic Probability Theory James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 35 Review of Basic Probability Theory

More information

Introductory Statistics with R: Simple Inferences for continuous data

Introductory Statistics with R: Simple Inferences for continuous data Introductory Statistics with R: Simple Inferences for continuous data Statistical Packages STAT 1301 / 2300, Fall 2014 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu

More information

The Normal Distribution

The Normal Distribution The Mary Lindstrom (Adapted from notes provided by Professor Bret Larget) February 10, 2004 Statistics 371 Last modified: February 11, 2004 The The (AKA Gaussian Distribution) is our first distribution

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015 Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences 18.30 21.15h, February 12, 2015 Question 1 is on this page. Always motivate your answers. Write your answers in English. Only the

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population,

More information

STATISTICS 1 REVISION NOTES

STATISTICS 1 REVISION NOTES STATISTICS 1 REVISION NOTES Statistical Model Representing and summarising Sample Data Key words: Quantitative Data This is data in NUMERICAL FORM such as shoe size, height etc. Qualitative Data This is

More information

STA2601. Tutorial Letter 104/1/2014. Applied Statistics II. Semester 1. Department of Statistics STA2601/104/1/2014 TRIAL EXAMINATION PAPER

STA2601. Tutorial Letter 104/1/2014. Applied Statistics II. Semester 1. Department of Statistics STA2601/104/1/2014 TRIAL EXAMINATION PAPER STA2601/104/1/2014 Tutorial Letter 104/1/2014 Applied Statistics II STA2601 Semester 1 Department of Statistics TRIAL EXAMINATION PAPER BAR CODE Learn without limits. university of south africa Dear Student

More information

Statistical Computing Session 4: Random Simulation

Statistical Computing Session 4: Random Simulation Statistical Computing Session 4: Random Simulation Paul Eilers & Dimitris Rizopoulos Department of Biostatistics, Erasmus University Medical Center p.eilers@erasmusmc.nl Masters Track Statistical Sciences,

More information

1 Measures of the Center of a Distribution

1 Measures of the Center of a Distribution 1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects

More information

a table or a graph or an equation.

a table or a graph or an equation. Topic (8) POPULATION DISTRIBUTIONS 8-1 So far: Topic (8) POPULATION DISTRIBUTIONS We ve seen some ways to summarize a set of data, including numerical summaries. We ve heard a little about how to sample

More information

Relating Graph to Matlab

Relating Graph to Matlab There are two related course documents on the web Probability and Statistics Review -should be read by people without statistics background and it is helpful as a review for those with prior statistics

More information

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom Central Limit Theorem and the Law of Large Numbers Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Understand the statement of the law of large numbers. 2. Understand the statement of the

More information

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries Unit 3: Inferential Statistics for Continuous Data Statistics for Linguists with R A SIGIL Course Designed by Marco Baroni 1 and Stefan Evert 1 Center for Mind/Brain Sciences (CIMeC) University of Trento,

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Lecture 2: Review of Probability

Lecture 2: Review of Probability Lecture 2: Review of Probability Zheng Tian Contents 1 Random Variables and Probability Distributions 2 1.1 Defining probabilities and random variables..................... 2 1.2 Probability distributions................................

More information

Introduction to Linear Regression

Introduction to Linear Regression Introduction to Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Introduction to Linear Regression 1 / 46

More information

(Re)introduction to Statistics Dan Lizotte

(Re)introduction to Statistics Dan Lizotte (Re)introduction to Statistics Dan Lizotte 2017-01-17 Statistics The systematic collection and arrangement of numerical facts or data of any kind; (also) the branch of science or mathematics concerned

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Stat 139 Homework 2 Solutions, Spring 2015

Stat 139 Homework 2 Solutions, Spring 2015 Stat 139 Homework 2 Solutions, Spring 2015 Problem 1. A pharmaceutical company is surveying through 50 different targeted compounds to try to determine whether any of them may be useful in treating migraine

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Determining the Spread of a Distribution

Determining the Spread of a Distribution Determining the Spread of a Distribution 1.3-1.5 Cathy Poliak, Ph.D. cathy@math.uh.edu Department of Mathematics University of Houston Lecture 3-2311 Lecture 3-2311 1 / 58 Outline 1 Describing Quantitative

More information

Statistics for Engineers Lecture 5 Statistical Inference

Statistics for Engineers Lecture 5 Statistical Inference Statistics for Engineers Lecture 5 Statistical Inference Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu March 13, 2017 Chong Ma (Statistics, USC) STAT 509 Spring 2017

More information

Determining the Spread of a Distribution

Determining the Spread of a Distribution Determining the Spread of a Distribution 1.3-1.5 Cathy Poliak, Ph.D. cathy@math.uh.edu Department of Mathematics University of Houston Lecture 3-2311 Lecture 3-2311 1 / 58 Outline 1 Describing Quantitative

More information

Chapter 5. Understanding and Comparing. Distributions

Chapter 5. Understanding and Comparing. Distributions STAT 141 Introduction to Statistics Chapter 5 Understanding and Comparing Distributions Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 27 Boxplots How to create a boxplot? Assume

More information

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation

More information

Robustness. James H. Steiger. Department of Psychology and Human Development Vanderbilt University. James H. Steiger (Vanderbilt University) 1 / 37

Robustness. James H. Steiger. Department of Psychology and Human Development Vanderbilt University. James H. Steiger (Vanderbilt University) 1 / 37 Robustness James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 37 Robustness 1 Introduction 2 Robust Parameters and Robust

More information

Confidence Intervals 1

Confidence Intervals 1 Confidence Intervals 1 November 1, 2017 1 HMS, 2017, v1.1 Chapter References Diez: Chapter 4.2 Navidi, Chapter 5.0, 5.1, (Self read, 5.2), 5.3, 5.4, 5.6, not 5.7, 5.8 Chapter References 2 Terminology Point

More information

The Central Limit Theorem

The Central Limit Theorem The Central Limit Theorem for Sums By: OpenStaxCollege Suppose X is a random variable with a distribution that may be known or unknown (it can be any distribution) and suppose: 1. μ X = the mean of Χ 2.

More information

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you. ISQS 5347 Final Exam Spring 2017 Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you. 1. Recall the commute

More information

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University Probability theory and statistical analysis: a review Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University Concepts assumed known Histograms, mean, median, spread, quantiles Probability,

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Be able to compute and interpret expectation, variance, and standard

More information

Multiple comparison procedures

Multiple comparison procedures Multiple comparison procedures Cavan Reilly October 5, 2012 Table of contents The null restricted bootstrap The bootstrap Effective number of tests Free step-down resampling While there are functions in

More information

This does not cover everything on the final. Look at the posted practice problems for other topics.

This does not cover everything on the final. Look at the posted practice problems for other topics. Class 7: Review Problems for Final Exam 8.5 Spring 7 This does not cover everything on the final. Look at the posted practice problems for other topics. To save time in class: set up, but do not carry

More information

L03. PROBABILITY REVIEW II COVARIANCE PROJECTION. NA568 Mobile Robotics: Methods & Algorithms

L03. PROBABILITY REVIEW II COVARIANCE PROJECTION. NA568 Mobile Robotics: Methods & Algorithms L03. PROBABILITY REVIEW II COVARIANCE PROJECTION NA568 Mobile Robotics: Methods & Algorithms Today s Agenda State Representation and Uncertainty Multivariate Gaussian Covariance Projection Probabilistic

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

Lecture 8 Sampling Theory

Lecture 8 Sampling Theory Lecture 8 Sampling Theory Thais Paiva STA 111 - Summer 2013 Term II July 11, 2013 1 / 25 Thais Paiva STA 111 - Summer 2013 Term II Lecture 8, 07/11/2013 Lecture Plan 1 Sampling Distributions 2 Law of Large

More information

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017 Null Hypothesis Significance Testing p-values, significance level, power, t-tests 18.05 Spring 2017 Understand this figure f(x H 0 ) x reject H 0 don t reject H 0 reject H 0 x = test statistic f (x H 0

More information

STA 2201/442 Assignment 2

STA 2201/442 Assignment 2 STA 2201/442 Assignment 2 1. This is about how to simulate from a continuous univariate distribution. Let the random variable X have a continuous distribution with density f X (x) and cumulative distribution

More information

CSC2515 Assignment #2

CSC2515 Assignment #2 CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and

More information

Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!! (preferred!)!!

Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!!  (preferred!)!! Probability theory and inference statistics Dr. Paola Grosso SNE research group p.grosso@uva.nl paola.grosso@os3.nl (preferred) Roadmap Lecture 1: Monday Sep. 22nd Collecting data Presenting data Descriptive

More information

Lecture 9: Predictive Inference

Lecture 9: Predictive Inference Lecture 9: Predictive Inference There are (at least) three levels at which we can make predictions with a regression model: we can give a single best guess about what Y will be when X = x, a point prediction;

More information

Estimating a population mean

Estimating a population mean Introductory Statistics Lectures Estimating a population mean Confidence intervals for means Department of Mathematics Pima Community College Redistribution of this material is prohibited without written

More information

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices. Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices. 1.(10) What is usually true about a parameter of a model? A. It is a known number B. It is determined by the data C. It is an

More information

EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002

EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002 EE/CpE 345 Modeling and Simulation Class 0 November 8, 2002 Input Modeling Inputs(t) Actual System Outputs(t) Parameters? Simulated System Outputs(t) The input data is the driving force for the simulation

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9 EECS 16B Designing Information Devices and Systems II Fall 18 Elad Alon and Miki Lustig Homework 9 This homework is due Wednesday, October 31, 18, at 11:59pm. Self grades are due Monday, November 5, 18,

More information

IENG581 Design and Analysis of Experiments INTRODUCTION

IENG581 Design and Analysis of Experiments INTRODUCTION Experimental Design IENG581 Design and Analysis of Experiments INTRODUCTION Experiments are performed by investigators in virtually all fields of inquiry, usually to discover something about a particular

More information

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE Course Title: Probability and Statistics (MATH 80) Recommended Textbook(s): Number & Type of Questions: Probability and Statistics for Engineers

More information

Stat 101: Lecture 12. Summer 2006

Stat 101: Lecture 12. Summer 2006 Stat 101: Lecture 12 Summer 2006 Outline Answer Questions More on the CLT The Finite Population Correction Factor Confidence Intervals Problems More on the CLT Recall the Central Limit Theorem for averages:

More information

Week 2: Review of probability and statistics

Week 2: Review of probability and statistics Week 2: Review of probability and statistics Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ALL RIGHTS RESERVED

More information

Assignments. Statistics Workshop 1: Introduction to R. Tuesday May 26, Atoms, Vectors and Matrices

Assignments. Statistics Workshop 1: Introduction to R. Tuesday May 26, Atoms, Vectors and Matrices Statistics Workshop 1: Introduction to R. Tuesday May 26, 2009 Assignments Generally speaking, there are three basic forms of assigning data. Case one is the single atom or a single number. Assigning a

More information

Statistical Inference

Statistical Inference Statistical Inference Bernhard Klingenberg Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Outline Estimation: Review of concepts

More information

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science (Mostly QQ and Leverage Plots) 1 / 63 Graphical Diagnosis Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas. (Mostly QQ and Leverage

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

STATISTICS/MATH /1760 SHANNON MYERS

STATISTICS/MATH /1760 SHANNON MYERS STATISTICS/MATH 103 11/1760 SHANNON MYERS π 100 POINTS POSSIBLE π YOUR WORK MUST SUPPORT YOUR ANSWER FOR FULL CREDIT TO BE AWARDED π YOU MAY USE A SCIENTIFIC AND/OR A TI-83/84/85/86 CALCULATOR ONCE YOU

More information

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations EECS 70 Discrete Mathematics and Probability Theory Fall 204 Anant Sahai Note 5 Random Variables: Distributions, Independence, and Expectations In the last note, we saw how useful it is to have a way of

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career. Introduction to Data and Analysis Wildlife Management is a very quantitative field of study Results from studies will be used throughout this course and throughout your career. Sampling design influences

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Physics 509: Non-Parametric Statistics and Correlation Testing

Physics 509: Non-Parametric Statistics and Correlation Testing Physics 509: Non-Parametric Statistics and Correlation Testing Scott Oser Lecture #19 Physics 509 1 What is non-parametric statistics? Non-parametric statistics is the application of statistical tests

More information

6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses.

6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses. 6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses. 0 11 1 1.(5) Give the result of the following matrix multiplication: 1 10 1 Solution: 0 1 1 2

More information

Confidence intervals

Confidence intervals Confidence intervals We now want to take what we ve learned about sampling distributions and standard errors and construct confidence intervals. What are confidence intervals? Simply an interval for which

More information

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22 CS 70 Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22 Random Variables and Expectation Question: The homeworks of 20 students are collected in, randomly shuffled and returned to the students.

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

Bootstrap. ADA1 November 27, / 38

Bootstrap. ADA1 November 27, / 38 The bootstrap as a statistical method was invented in 1979 by Bradley Efron, one of the most influential statisticians still alive. The idea is nonparametric, but is not based on ranks, and is very computationally

More information

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location Bootstrap tests Patrick Breheny October 11 Patrick Breheny STA 621: Nonparametric Statistics 1/14 Introduction Conditioning on the observed data to obtain permutation tests is certainly an important idea

More information

3. Shrink the vector you just created by removing the first element. One could also use the [] operators with a negative index to remove an element.

3. Shrink the vector you just created by removing the first element. One could also use the [] operators with a negative index to remove an element. BMI 713: Computational Statistical for Biomedical Sciences Assignment 1 September 9, 2010 (due Sept 16 for Part 1; Sept 23 for Part 2 and 3) 1 Basic 1. Use the : operator to create the vector (1, 2, 3,

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Inference for Single Proportions and Means T.Scofield

Inference for Single Proportions and Means T.Scofield Inference for Single Proportions and Means TScofield Confidence Intervals for Single Proportions and Means A CI gives upper and lower bounds between which we hope to capture the (fixed) population parameter

More information

Math 180B Problem Set 3

Math 180B Problem Set 3 Math 180B Problem Set 3 Problem 1. (Exercise 3.1.2) Solution. By the definition of conditional probabilities we have Pr{X 2 = 1, X 3 = 1 X 1 = 0} = Pr{X 3 = 1 X 2 = 1, X 1 = 0} Pr{X 2 = 1 X 1 = 0} = P

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Asymptotic Statistics-VI. Changliang Zou

Asymptotic Statistics-VI. Changliang Zou Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous

More information

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation CS 70 Discrete Mathematics and Probability Theory Fall 203 Vazirani Note 2 Random Variables: Distribution and Expectation We will now return once again to the question of how many heads in a typical sequence

More information

EE/CpE 345. Modeling and Simulation. Fall Class 9

EE/CpE 345. Modeling and Simulation. Fall Class 9 EE/CpE 345 Modeling and Simulation Class 9 208 Input Modeling Inputs(t) Actual System Outputs(t) Parameters? Simulated System Outputs(t) The input data is the driving force for the simulation - the behavior

More information

Section 3. Measures of Variation

Section 3. Measures of Variation Section 3 Measures of Variation Range Range = (maximum value) (minimum value) It is very sensitive to extreme values; therefore not as useful as other measures of variation. Sample Standard Deviation The

More information

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008 MIT OpenCourseWare http://ocw.mit.edu 2.830J / 6.780J / ESD.63J Control of Processes (SMA 6303) Spring 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS 1a. Under the null hypothesis X has the binomial (100,.5) distribution with E(X) = 50 and SE(X) = 5. So P ( X 50 > 10) is (approximately) two tails

More information

COMP6463: λ-calculus

COMP6463: λ-calculus COMP6463: λ-calculus 1. Basics Michael Norrish Michael.Norrish@nicta.com.au Canberra Research Lab., NICTA Semester 2, 2015 Outline Introduction Lambda Calculus Terms Alpha Equivalence Substitution Dynamics

More information

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM CIS59: Applied Machine Learning Fall 208 Homework 5 Handed Out: December 5 th, 208 Due: December 0 th, 208, :59 PM Feel free to talk to other members of the class in doing the homework. I am more concerned

More information

Foundations of Probability and Statistics

Foundations of Probability and Statistics Foundations of Probability and Statistics William C. Rinaman Le Moyne College Syracuse, New York Saunders College Publishing Harcourt Brace College Publishers Fort Worth Philadelphia San Diego New York

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Statistics Introductory Correlation

Statistics Introductory Correlation Statistics Introductory Correlation Session 10 oscardavid.barrerarodriguez@sciencespo.fr April 9, 2018 Outline 1 Statistics are not used only to describe central tendency and variability for a single variable.

More information

Quick Sort Notes , Spring 2010

Quick Sort Notes , Spring 2010 Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.

More information

COMP6053 lecture: Sampling and the central limit theorem. Markus Brede,

COMP6053 lecture: Sampling and the central limit theorem. Markus Brede, COMP6053 lecture: Sampling and the central limit theorem Markus Brede, mb8@ecs.soton.ac.uk Populations: long-run distributions Two kinds of distributions: populations and samples. A population is the set

More information

Lecture 10: Powers of Matrices, Difference Equations

Lecture 10: Powers of Matrices, Difference Equations Lecture 10: Powers of Matrices, Difference Equations Difference Equations A difference equation, also sometimes called a recurrence equation is an equation that defines a sequence recursively, i.e. each

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

1 A brief primer on probability distributions

1 A brief primer on probability distributions Inference, Models and Simulation for Comple Systems CSCI 7- Lecture 3 August Prof. Aaron Clauset A brief primer on probability distributions. Probability distribution functions A probability density function

More information