Lecture 09: Sep 19, Randomness. Random Variables Sampling Probability Distributions Caching. James Balamuta STAT UIUC

Lecture 09: Sep 19, 2018 Randomness Random Variables Sampling Probability Distributions Caching James Balamuta STAT 385 @ UIUC

Announcements hw03 is due on Friday, Sep 21st, 2018 @ 6:00 PM Quiz 04 covers Week 3 contents @ CBTF. Window: Sep 18th - 20th EC Opportunity: Attend the Reflections and Projections Conference Write a 2-3 paragraph reflection about a talk and take a picture with the speaker for 1% extra credit. Fine Print Got caught using GitHub's web interface in hw01 or hw02? Let's chat.

Last Time SLR and MLR Estimating a linear regression with 2 parameters vs. p parameters Underlying design matrix construction Factors Provide a mapping of levels to indicator variables inside the design matrix.

Topic Outline Tools for Reproducibility Working with Data Simulations Visual Exploratory Data Analysis Unstructured Data Data Wrangling Functions Modeling Data Interactive Interfaces High Performance Computing

Topic Outline http://r4ds.had.co.nz/introduction.html

Lecture Objectives Describe what a random variable is. Create and use a pseudo-random number generator. Generate random variables from probability distributions in R Explain the differences between sampling with and without replacement.

Random Variables

Definition: Sample Space contains all possible outcomes that can be obtained. They can be either: 1. discrete / finite in size, e.g. heads / tails, or 2. continuous / infinite in possibilities, e.g.weight / height Discrete Continuous "Heads" "Tails" 1 Coin 2 Coins US Height Distribution

Definition: Random Variables are variables that take on an unknown value in a sample space. P( X = Heads) = 0.5 P( X = Tails) = 0.5 Assumes a "fair" coin

Kinds of Randomness Source Source Pseudo-Random Number Generators (PRNGs) Deterministic Efficient Linear Congruential Method True Random Number Generators (TRNGs) Based on randomness from physical phenomena

Deterministic or Nondeterministic Source Same Input, Same Output Same Input, Different Outputs

LCM PRNG overview of LCM Algorithm Seed X 0, X 1, X 2, X 3,!, X n (0 X0 < m) Generated ( ) mod m X i+1 = ax i + c Generated Multiplier (0 a < m) Increment (0 c < m) Modulus (0 < m) Integer Range X 0,m 1 i! R = X i i m 0,1 ) Numeric Does not include 1

Prior generated value LCM Computations computation dependency X = ( ax + c)mod m i+1 i X 0 = 20, a = 13, c = 3, m = 64 ( X = 7 = 263mod64 = 13 ( 20) + 3)mod64 1 Simulated Integers ( X = 30 = 94mod64 = 13 ( 7) + 3)mod64 2 ( X = 9 = 393mod64 = 13 ( 30) + 3)mod64 3 ( X = 56 = 120mod64 = 13 ( 9) + 3)mod64 4 ( X = 27 = 731mod64 = 13 ( 56) + 3)mod64 5

Previously Modulus Operator a mod q 12 %% 7 # a = n*q + r => 12 = 1*7 + 5 # [1] 5 outer(9:1, 2:9, `%%`) # Compute the cross between X & Y mod 2 <int> mod 3 <int> mod 4 <int> mod 5 <int> mod 6 <int> mod 7 <int> mod 8 <int> mod 9 <int> x = 1 1 1 1 1 1 1 1 1 x = 2 0 2 2 2 2 2 2 2 x = 3 1 0 3 3 3 3 3 3 x = 4 0 1 0 4 4 4 4 4 x = 5 1 2 1 0 5 5 5 5 x = 6 0 0 2 1 0 6 6 6 x = 7 1 1 3 2 1 0 7 7 x = 8 0 2 0 3 2 1 0 8 x = 9 1 0 1 4 3 2 1 0

How could we generate Xi so that Ri includes 1?

# Setup values a = 5181215173 c = 12581 m = 2^32 # Initial Seed seed = as.numeric(sys.time()) * 1000 # Pre-allocation of numeric vector x = numeric(length = 2) Sample Implementation LCM Algorithm # Set initial value x[1] = seed x[2] = (a * x[1] + c) %% m # Real number (not including 1) r = x[2]/m

PRNG Periodicity a = 13, c = 0, m = 2 6 X i i X 0 = 1 X 0 = 2 X 0 = 3 X 0 = 4 0 1 2 3 4 1 13 26 39 52 2 41 18 59 36 3 21 42 63 20 4 17 34 51 4 5 29 58 23 52 6 57 50 43 36 7 37 10 37 20 8 33 2 35 4 9 45 26 7 52 10 9 18 27 36

Quality of R.V. uniqueness of numbers Independent Dependent a = 1240814549, c = 12581, m = 2 32 a = 1229, c = 1, m = 2 11

Source XKCD 221: Random Number

Sampling

Deﬁnition: Sampling without Replacement involves picking values from a sample space and not adding them back for subsequent picks. In essence, a value may only be selected once. Available Available https://www.youtube.com/watch?v=pmtr0ocx6og Picked

Definition: Sampling with Replacement involves picking values from a sample space, adding the picked value back for subsequent picks. In essence, a value may be selected multiple times.

Sampling picking r.v. values Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL)

Sampling without Replacement picking r.v. values once Set a seed to control reproducibility set.seed(90210) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = 8, size = 4, replace = FALSE) # [1] 3 1 2 7 NB: Sample creates a vector of 10 from a single number. seq(1, 10) # [1] 1 2 3 4 5 6 7 8 9 10

Sampling without Replacement picking character r.v. values Set a seed to control reproducibility set.seed(42) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = c("heads", "tails"), size = 1, replace = FALSE) # [1] "tails"

Your Turn Modify the sample code such that it can roll a six-sided die once. Set a seed to control reproducibility set.seed(385) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL)

Sampling with Replacement picking r.v. values Set a seed to control reproducibility set.seed(5318008) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = 10, size = 20, replace = TRUE) # [1] 7 2 9 8 9 7 3 6 1 10 7 5 8 10 1 6 6 7 2 6

Sampling with Replacement biased coin sample Set a seed to control reproducibility set.seed(376006) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. y = sample(x = c("heads", "tails"), size = 10, replace = TRUE, prob = c(0.4, 0.6)) y # [1] "tails" "heads" "tails" "tails" "tails" "tails" "tails" "heads" "tails" "tails"

Frequencies and Proportions counting sampled data # Previously obtained sample under probability of 0.4/0.6 y # [1] "tails" "heads" "tails" "tails" "tails" "tails" "tails" "heads" "tails" "tails" # Count observations and/or obtain the proportion table(y) # heads tails # 2 8 prop.table(table(y)) # heads tails # 0.2 0.8 # Why doesn't this match the requested probability of 0.4 / 0.6?

Your Turn Create a call to sample such that it can roll a "loaded" six-sided die twice with probabilities: 1 := 0.1, 2 := 0.3, 3 = 0.1, 4 := 0.15, 5 := 0.15, 6 := 0.2 Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL) Source

Probability Distributions

Definition: Probability Distribution provides the outcome of a statistical event with the probability it occurs. Source

Supported Distributions possible distributions Distribution Functions Distribution Functions Beta pbeta qbeta dbeta rbeta Log Normal plnorm qlnorm dlnorm rlnorm Binomial pbinom qbinom dbinom rbinom Negative Binomial pnbinom qnbinom dnbinom rnbinom Cauchy pcauchy qcauchy dcauchy rcauchy Normal pnorm qnorm dnorm rnorm Chi-Squared pchisq qchisq dchisq rchisq Poison ppois qpois dpois rpois Exponential pexp qexp dexp rexp Student's t pt qt dt rt F pf qf df rf Studentized Range ptukey qtukey dtukey rtukey Gamma pgamma qgamma dgamma rgamma Uniform punif qunif dunif runif Geometric pgeom qgeom dgeom rgeom Weibull pweibull qweibull dweibull rweibull Hypergeometric phyper qhyper dhyper rhyper Wilcoxon Rank Sum pwilcox qwilcox dwilcox rwilcox Logistic plogis qlogis dlogis rlogis Wilcoxon Signed Rank psignrank qsignrank dsignrank rsignrank See for more Distributions the CRAN Task View

Normal Distribution an example of a probability distribution f ( x µ,σ ) = ( 1 exp x µ ) 2 σ 2π 2σ 2 µ is the mean σ is the standard deviation σ 2 is the variance

Distribution Functions d: density or the probability density function (PDF) p: probability or the cumulative distribution function (CDF) q: quantile or the inverse CDF r: random variable generation. # Normal PDF dnorm(c(-1, 0, 1)) # [1] 0.2420 0.3989 0.2420 # Normal CDF pnorm(c(-1, 0, 1)) # [1] 0.1587 0.5000 0.8413 # Normal icdf qnorm(c(0.15, 0.5, 0.84)) # [1] -1.0364 0.0000 0.9945 # Normal RNG set.seed(15) rnorm(2) # [1] 0.2588 1.8311

Cumulative and Density # Compute the probability pnorm(1.96) # [1] 0.9750021 F X ( x) = P( X x) # Compute the density # at a specific point (uh oh) dnorm(1.96) # [1] 0.05844094 # Only works for discrete # distributions.. Why? f X ( x) = F X ( x) # Integration required to # match output of pnorm integrate(dnorm(x), lower = -10, upper = 1.96 ) # 0.9750021 with # absolute error < 2e-07

Quantiles finding the p-th quantile # Obtain the x value for # a probability of 0.975 qnorm(0.975) # [1] 1.959964 # Note, this is taking the # inverse of pnorm, c.f. F X ( 1 F ( X p) ) = p x = F X 1 ( p) qnorm( pnorm(1.959964) ) # [1] 1.959964 pnorm( qnorm(0.975) ) # [1] 0.975 F X ( x) = P( X x)

Random Values generating lots of numbers # Set a seed set.seed(1) # Generate 1000 values # from a standard normal rnorm(1000) # [1] -0.626453811 # [2] 0.183643324 # [3] Box-Muller Transform

# Random Number Generation r*(n, param1 =, param2 = ) Generic Distribution Form pattern in statistical distribution functions # * represents a supported distribution (e.g. norm, t, f, chisq) # Density d*(x, param1 =, param2 =, log = FALSE) # Probability p*(q, param1 =, param2 =, lower.tail = TRUE, log.p = FALSE) # Quantile (inverse probability) q*(p, param1 =, param2 =, lower.tail = TRUE, log.p = FALSE)

Your Turn Using the qnorm() function, determine what the quantiles are at probabilities 0.5, 0.95, and 0.975. Have you seen these values before?

Caching

Definition: Caches refers to storing data locally in order to speed up subsequent retrievals. Source

Cache in Action large value calculation * ```{r cache-large-computation, cache = TRUE} x = rnorm(10000*10000) system.time({ sorted_x = sort(x) }) ``` # user system elapsed # 10.564 1.071 11.713 * The code chunk will be run once, the resulting objects are then stored within a file in the caches directory, and the stored data is then displayed on subsequent runs.

There are only two hard things in Computer Science: cache invalidation and naming things Phil Karlton

Warning Do not try the next example in an RMarkdown document with set.seed

Caution Required randomness and storage ```{r bad-sampling} x = rnorm(3) x ``` # [1] 1.1754306 0.5250691 0.1464347 ```{r cached-calc, cache = TRUE} x y = x + 2 y ``` # [1] 1.1754306 0.5250691 0.1464347 # [1] 3.175431 2.525069 2.146435

Redux Caution Required re-running the aforementioned code ```{r bad-sampling} x = rnorm(3) x ``` # [1] -1.8372987 0.8720906 0.2886695 ```{r cached-calc, cache = TRUE} x y = x + 2 y ``` # [1] 1.1754306 0.5250691 0.1464347 # [1] 3.175431 2.525069 2.146435

Depends On specifying dependency ```{r sampling, cache = TRUE} x = rnorm(3) x ``` # [1] 1.2988116 0.9563079-0.6841705 ```{r cached-calc, cache = TRUE, dependson = "sampling"} x y = x + 2 y ``` # [1] 1.2988116 0.9563079-0.6841705 # [1] 3.298812 2.956308 1.315830

Recap Random Variables Variables that take on an unknown value LCM is one way to generate random numbers Distributions Probability Density/Mass Functions (PD/MF) is d* Cumulative Distribution Function (CDF) is p* Inverse CDF is q* Random Variables from Distribution is r* Sampling Sample without replacement does not add the picked object back. Sample with replacement does add the picked object back. Caches Speed up computationally intensive reports by storing results and re-using them.

This work is licensed under the Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License