Lecture 09: Sep 19, Randomness. Random Variables Sampling Probability Distributions Caching. James Balamuta STAT UIUC

Similar documents
R Functions for Probability Distributions

Lecture 10: Sep 24, v2. Hypothesis Testing. Testing Frameworks. James Balamuta STAT UIUC

Introduction to R and Programming

1 Probability Distributions

Matematisk statistik allmän kurs, MASA01:A, HT-15 Laborationer

Chapter 3: Methods for Generating Random Variables

GOV 2001/ 1002/ E-2001 Section 3 Theories of Inference

Package skellam. R topics documented: December 15, Version Date

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /13/ /12

R Based Probability Distributions

Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!! (preferred!)!!

Package Delaporte. August 13, 2017

Generating Random Numbers

Probability measures A probability measure, P, is a real valued function from the collection of possible events so that the following

MAS1802: Problem Solving II (Statistical Computing with R)

Package ppcc. June 28, 2017

Brief Review of Probability

Package alphaoutlier

Lecture 4: Random Variables and Distributions

Introduction to Statistical Data Analysis Lecture 3: Probability Distributions

TMA4265: Stochastic Processes

TMA4265: Stochastic Processes

Quantitative Understanding in Biology Module I: Statistics Lecture II: Probability Density Functions and the Normal Distribution

Name: Firas Rassoul-Agha

Random Variables. Definition: A random variable (r.v.) X on the probability space (Ω, F, P) is a mapping

FW 544: Computer Lab Probability basics in R

Package invgamma. May 7, 2017

GOV 2001/ 1002/ Stat E-200 Section 1 Probability Review

Statistical distributions: Synopsis

Package jmuoutlier. February 17, 2017

Lecture 1: Random number generation, permutation test, and the bootstrap. August 25, 2016

R STATISTICAL COMPUTING

Pembangkitan Bilangan Acak dan Resampling

IE 303 Discrete-Event Simulation L E C T U R E 6 : R A N D O M N U M B E R G E N E R A T I O N

STT 315 Problem Set #3

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

Chi-squared (χ 2 ) (1.10.5) and F-tests (9.5.2) for the variance of a normal distribution ( )

Outline PMF, CDF and PDF Mean, Variance and Percentiles Some Common Distributions. Week 5 Random Variables and Their Distributions

Lecture 2. October 21, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Random number generators

Random number generation

Precept 4: Hypothesis Testing

In Chapter 17, I delve into probability in a semiformal way, and introduce distributions

Systems Simulation Chapter 7: Random-Number Generation

CPSC 531: Random Numbers. Jonathan Hudson Department of Computer Science University of Calgary

Homework for 1/13 Due 1/22

Textbook: Survivial Analysis Techniques for Censored and Truncated Data 2nd edition, by Klein and Moeschberger

Slides 8: Statistical Models in Simulation

Exam 2 Practice Questions, 18.05, Spring 2014

7 Random samples and sampling distributions

Lecture 2: Repetition of probability theory and statistics

Topics in Computer Mathematics

Random Variables Example:

Sampling Inspection. Young W. Lim Wed. Young W. Lim Sampling Inspection Wed 1 / 26

Random Number Generation. CS1538: Introduction to simulations

Topics. Pseudo-Random Generators. Pseudo-Random Numbers. Truly Random Numbers

Chapter 2: Random Variables

Generalized linear models

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

15 Discrete Distributions

Statistics and Econometrics I

Transformations of Standard Uniform Distributions

Chapter 3. Chapter 3 sections

Pseudo-Random Generators

Using R in 200D Luke Sonnet

STA442/2101: Assignment 5

Introduction and Overview STAT 421, SP Course Instructor

Collaborative Statistics: Symbols and their Meanings

Experiment, Sample Space, and Event. Event Operations

Sources of randomness

Math/Stat 3850 Exam 1

Chapter 3 Single Random Variables and Probability Distributions (Part 1)

AMS 132: Discussion Section 2

STAT509: Discrete Random Variable

Chapter 2. Discrete Distributions

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

STAT 414: Introduction to Probability Theory

Chapter 4: Monte Carlo Methods. Paisan Nakmahachalasint

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

R in 02402: Introduction to Statistic

Relationship between probability set function and random variable - 2 -

Stat 451 Lecture Notes Simulating Random Variables

Mathematics 343 Class Notes 1. Goal: To define probability and statistics and explain the relationship between the two

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

Math 105 Course Outline

STAT 418: Probability and Stochastic Processes

CMPSCI 240: Reasoning Under Uncertainty

( x) ( ) F ( ) ( ) ( ) Prob( ) ( ) ( ) X x F x f s ds

Pseudo-Random Generators

INTRODUCTION TO R FOR DECISION MAKING OPTIMIZATION & SIMULATION

EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002

Normal (Gaussian) distribution The normal distribution is often relevant because of the Central Limit Theorem (CLT):

Probability Distributions

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Continuous random variables

2 Random Variable Generation

EE/CpE 345. Modeling and Simulation. Fall Class 5 September 30, 2002

Simulating Random Variables

Chapter 2. Continuous random variables

Statistical Concepts. Distributions of Data

Transcription:

Lecture 09: Sep 19, 2018 Randomness Random Variables Sampling Probability Distributions Caching James Balamuta STAT 385 @ UIUC

Announcements hw03 is due on Friday, Sep 21st, 2018 @ 6:00 PM Quiz 04 covers Week 3 contents @ CBTF. Window: Sep 18th - 20th EC Opportunity: Attend the Reflections and Projections Conference Write a 2-3 paragraph reflection about a talk and take a picture with the speaker for 1% extra credit. Fine Print Got caught using GitHub's web interface in hw01 or hw02? Let's chat.

Last Time SLR and MLR Estimating a linear regression with 2 parameters vs. p parameters Underlying design matrix construction Factors Provide a mapping of levels to indicator variables inside the design matrix.

Topic Outline Tools for Reproducibility Working with Data Simulations Visual Exploratory Data Analysis Unstructured Data Data Wrangling Functions Modeling Data Interactive Interfaces High Performance Computing

Topic Outline http://r4ds.had.co.nz/introduction.html

Lecture Objectives Describe what a random variable is. Create and use a pseudo-random number generator. Generate random variables from probability distributions in R Explain the differences between sampling with and without replacement.

Random Variables

Definition: Sample Space contains all possible outcomes that can be obtained. They can be either: 1. discrete / finite in size, e.g. heads / tails, or 2. continuous / infinite in possibilities, e.g.weight / height Discrete Continuous "Heads" "Tails" 1 Coin 2 Coins US Height Distribution

Definition: Random Variables are variables that take on an unknown value in a sample space. P( X = Heads) = 0.5 P( X = Tails) = 0.5 Assumes a "fair" coin

Kinds of Randomness Source Source Pseudo-Random Number Generators (PRNGs) Deterministic Efficient Linear Congruential Method True Random Number Generators (TRNGs) Based on randomness from physical phenomena

Deterministic or Nondeterministic Source Same Input, Same Output Same Input, Different Outputs

LCM PRNG overview of LCM Algorithm Seed X 0, X 1, X 2, X 3,!, X n (0 X0 < m) Generated ( ) mod m X i+1 = ax i + c Generated Multiplier (0 a < m) Increment (0 c < m) Modulus (0 < m) Integer Range X 0,m 1 i! R = X i i m 0,1 ) Numeric Does not include 1

Prior generated value LCM Computations computation dependency X = ( ax + c)mod m i+1 i X 0 = 20, a = 13, c = 3, m = 64 ( X = 7 = 263mod64 = 13 ( 20) + 3)mod64 1 Simulated Integers ( X = 30 = 94mod64 = 13 ( 7) + 3)mod64 2 ( X = 9 = 393mod64 = 13 ( 30) + 3)mod64 3 ( X = 56 = 120mod64 = 13 ( 9) + 3)mod64 4 ( X = 27 = 731mod64 = 13 ( 56) + 3)mod64 5

Previously Modulus Operator a mod q 12 %% 7 # a = n*q + r => 12 = 1*7 + 5 # [1] 5 outer(9:1, 2:9, `%%`) # Compute the cross between X & Y mod 2 <int> mod 3 <int> mod 4 <int> mod 5 <int> mod 6 <int> mod 7 <int> mod 8 <int> mod 9 <int> x = 1 1 1 1 1 1 1 1 1 x = 2 0 2 2 2 2 2 2 2 x = 3 1 0 3 3 3 3 3 3 x = 4 0 1 0 4 4 4 4 4 x = 5 1 2 1 0 5 5 5 5 x = 6 0 0 2 1 0 6 6 6 x = 7 1 1 3 2 1 0 7 7 x = 8 0 2 0 3 2 1 0 8 x = 9 1 0 1 4 3 2 1 0

How could we generate Xi so that Ri includes 1?

# Setup values a = 5181215173 c = 12581 m = 2^32 # Initial Seed seed = as.numeric(sys.time()) * 1000 # Pre-allocation of numeric vector x = numeric(length = 2) Sample Implementation LCM Algorithm # Set initial value x[1] = seed x[2] = (a * x[1] + c) %% m # Real number (not including 1) r = x[2]/m

PRNG Periodicity a = 13, c = 0, m = 2 6 X i i X 0 = 1 X 0 = 2 X 0 = 3 X 0 = 4 0 1 2 3 4 1 13 26 39 52 2 41 18 59 36 3 21 42 63 20 4 17 34 51 4 5 29 58 23 52 6 57 50 43 36 7 37 10 37 20 8 33 2 35 4 9 45 26 7 52 10 9 18 27 36

Quality of R.V. uniqueness of numbers Independent Dependent a = 1240814549, c = 12581, m = 2 32 a = 1229, c = 1, m = 2 11

Source XKCD 221: Random Number

Sampling

Definition: Sampling without Replacement involves picking values from a sample space and not adding them back for subsequent picks. In essence, a value may only be selected once. Available Available https://www.youtube.com/watch?v=pmtr0ocx6og Picked

Definition: Sampling with Replacement involves picking values from a sample space, adding the picked value back for subsequent picks. In essence, a value may be selected multiple times.

Sampling picking r.v. values Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL)

Sampling without Replacement picking r.v. values once Set a seed to control reproducibility set.seed(90210) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = 8, size = 4, replace = FALSE) # [1] 3 1 2 7 NB: Sample creates a vector of 10 from a single number. seq(1, 10) # [1] 1 2 3 4 5 6 7 8 9 10

Sampling without Replacement picking character r.v. values Set a seed to control reproducibility set.seed(42) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = c("heads", "tails"), size = 1, replace = FALSE) # [1] "tails"

Your Turn Modify the sample code such that it can roll a six-sided die once. Set a seed to control reproducibility set.seed(385) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL)

Sampling with Replacement picking r.v. values Set a seed to control reproducibility set.seed(5318008) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = 10, size = 20, replace = TRUE) # [1] 7 2 9 8 9 7 3 6 1 10 7 5 8 10 1 6 6 7 2 6

Sampling with Replacement biased coin sample Set a seed to control reproducibility set.seed(376006) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. y = sample(x = c("heads", "tails"), size = 10, replace = TRUE, prob = c(0.4, 0.6)) y # [1] "tails" "heads" "tails" "tails" "tails" "tails" "tails" "heads" "tails" "tails"

Frequencies and Proportions counting sampled data # Previously obtained sample under probability of 0.4/0.6 y # [1] "tails" "heads" "tails" "tails" "tails" "tails" "tails" "heads" "tails" "tails" # Count observations and/or obtain the proportion table(y) # heads tails # 2 8 prop.table(table(y)) # heads tails # 0.2 0.8 # Why doesn't this match the requested probability of 0.4 / 0.6?

Your Turn Create a call to sample such that it can roll a "loaded" six-sided die twice with probabilities: 1 := 0.1, 2 := 0.3, 3 = 0.1, 4 := 0.15, 5 := 0.15, 6 := 0.2 Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL) Source

Probability Distributions

Definition: Probability Distribution provides the outcome of a statistical event with the probability it occurs. Source

Supported Distributions possible distributions Distribution Functions Distribution Functions Beta pbeta qbeta dbeta rbeta Log Normal plnorm qlnorm dlnorm rlnorm Binomial pbinom qbinom dbinom rbinom Negative Binomial pnbinom qnbinom dnbinom rnbinom Cauchy pcauchy qcauchy dcauchy rcauchy Normal pnorm qnorm dnorm rnorm Chi-Squared pchisq qchisq dchisq rchisq Poison ppois qpois dpois rpois Exponential pexp qexp dexp rexp Student's t pt qt dt rt F pf qf df rf Studentized Range ptukey qtukey dtukey rtukey Gamma pgamma qgamma dgamma rgamma Uniform punif qunif dunif runif Geometric pgeom qgeom dgeom rgeom Weibull pweibull qweibull dweibull rweibull Hypergeometric phyper qhyper dhyper rhyper Wilcoxon Rank Sum pwilcox qwilcox dwilcox rwilcox Logistic plogis qlogis dlogis rlogis Wilcoxon Signed Rank psignrank qsignrank dsignrank rsignrank See for more Distributions the CRAN Task View

Normal Distribution an example of a probability distribution f ( x µ,σ ) = ( 1 exp x µ ) 2 σ 2π 2σ 2 µ is the mean σ is the standard deviation σ 2 is the variance

Distribution Functions d: density or the probability density function (PDF) p: probability or the cumulative distribution function (CDF) q: quantile or the inverse CDF r: random variable generation. # Normal PDF dnorm(c(-1, 0, 1)) # [1] 0.2420 0.3989 0.2420 # Normal CDF pnorm(c(-1, 0, 1)) # [1] 0.1587 0.5000 0.8413 # Normal icdf qnorm(c(0.15, 0.5, 0.84)) # [1] -1.0364 0.0000 0.9945 # Normal RNG set.seed(15) rnorm(2) # [1] 0.2588 1.8311

Cumulative and Density # Compute the probability pnorm(1.96) # [1] 0.9750021 F X ( x) = P( X x) # Compute the density # at a specific point (uh oh) dnorm(1.96) # [1] 0.05844094 # Only works for discrete # distributions.. Why? f X ( x) = F X ( x) # Integration required to # match output of pnorm integrate(dnorm(x), lower = -10, upper = 1.96 ) # 0.9750021 with # absolute error < 2e-07

Quantiles finding the p-th quantile # Obtain the x value for # a probability of 0.975 qnorm(0.975) # [1] 1.959964 # Note, this is taking the # inverse of pnorm, c.f. F X ( 1 F ( X p) ) = p x = F X 1 ( p) qnorm( pnorm(1.959964) ) # [1] 1.959964 pnorm( qnorm(0.975) ) # [1] 0.975 F X ( x) = P( X x)

Random Values generating lots of numbers # Set a seed set.seed(1) # Generate 1000 values # from a standard normal rnorm(1000) # [1] -0.626453811 # [2] 0.183643324 # [3] Box-Muller Transform

# Random Number Generation r*(n, param1 =, param2 = ) Generic Distribution Form pattern in statistical distribution functions # * represents a supported distribution (e.g. norm, t, f, chisq) # Density d*(x, param1 =, param2 =, log = FALSE) # Probability p*(q, param1 =, param2 =, lower.tail = TRUE, log.p = FALSE) # Quantile (inverse probability) q*(p, param1 =, param2 =, lower.tail = TRUE, log.p = FALSE)

Your Turn Using the qnorm() function, determine what the quantiles are at probabilities 0.5, 0.95, and 0.975. Have you seen these values before?

Caching

Definition: Caches refers to storing data locally in order to speed up subsequent retrievals. Source

Cache in Action large value calculation * ```{r cache-large-computation, cache = TRUE} x = rnorm(10000*10000) system.time({ sorted_x = sort(x) }) ``` # user system elapsed # 10.564 1.071 11.713 * The code chunk will be run once, the resulting objects are then stored within a file in the caches directory, and the stored data is then displayed on subsequent runs.

There are only two hard things in Computer Science: cache invalidation and naming things Phil Karlton

Warning Do not try the next example in an RMarkdown document with set.seed

Caution Required randomness and storage ```{r bad-sampling} x = rnorm(3) x ``` # [1] 1.1754306 0.5250691 0.1464347 ```{r cached-calc, cache = TRUE} x y = x + 2 y ``` # [1] 1.1754306 0.5250691 0.1464347 # [1] 3.175431 2.525069 2.146435

Redux Caution Required re-running the aforementioned code ```{r bad-sampling} x = rnorm(3) x ``` # [1] -1.8372987 0.8720906 0.2886695 ```{r cached-calc, cache = TRUE} x y = x + 2 y ``` # [1] 1.1754306 0.5250691 0.1464347 # [1] 3.175431 2.525069 2.146435

Depends On specifying dependency ```{r sampling, cache = TRUE} x = rnorm(3) x ``` # [1] 1.2988116 0.9563079-0.6841705 ```{r cached-calc, cache = TRUE, dependson = "sampling"} x y = x + 2 y ``` # [1] 1.2988116 0.9563079-0.6841705 # [1] 3.298812 2.956308 1.315830

Recap Random Variables Variables that take on an unknown value LCM is one way to generate random numbers Distributions Probability Density/Mass Functions (PD/MF) is d* Cumulative Distribution Function (CDF) is p* Inverse CDF is q* Random Variables from Distribution is r* Sampling Sample without replacement does not add the picked object back. Sample with replacement does add the picked object back. Caches Speed up computationally intensive reports by storing results and re-using them.

This work is licensed under the Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License