Lecture 09: Sep 19, 2018 Randomness Random Variables Sampling Probability Distributions Caching James Balamuta STAT 385 @ UIUC
Announcements hw03 is due on Friday, Sep 21st, 2018 @ 6:00 PM Quiz 04 covers Week 3 contents @ CBTF. Window: Sep 18th - 20th EC Opportunity: Attend the Reflections and Projections Conference Write a 2-3 paragraph reflection about a talk and take a picture with the speaker for 1% extra credit. Fine Print Got caught using GitHub's web interface in hw01 or hw02? Let's chat.
Last Time SLR and MLR Estimating a linear regression with 2 parameters vs. p parameters Underlying design matrix construction Factors Provide a mapping of levels to indicator variables inside the design matrix.
Topic Outline Tools for Reproducibility Working with Data Simulations Visual Exploratory Data Analysis Unstructured Data Data Wrangling Functions Modeling Data Interactive Interfaces High Performance Computing
Topic Outline http://r4ds.had.co.nz/introduction.html
Lecture Objectives Describe what a random variable is. Create and use a pseudo-random number generator. Generate random variables from probability distributions in R Explain the differences between sampling with and without replacement.
Random Variables
Definition: Sample Space contains all possible outcomes that can be obtained. They can be either: 1. discrete / finite in size, e.g. heads / tails, or 2. continuous / infinite in possibilities, e.g.weight / height Discrete Continuous "Heads" "Tails" 1 Coin 2 Coins US Height Distribution
Definition: Random Variables are variables that take on an unknown value in a sample space. P( X = Heads) = 0.5 P( X = Tails) = 0.5 Assumes a "fair" coin
Kinds of Randomness Source Source Pseudo-Random Number Generators (PRNGs) Deterministic Efficient Linear Congruential Method True Random Number Generators (TRNGs) Based on randomness from physical phenomena
Deterministic or Nondeterministic Source Same Input, Same Output Same Input, Different Outputs
LCM PRNG overview of LCM Algorithm Seed X 0, X 1, X 2, X 3,!, X n (0 X0 < m) Generated ( ) mod m X i+1 = ax i + c Generated Multiplier (0 a < m) Increment (0 c < m) Modulus (0 < m) Integer Range X 0,m 1 i! R = X i i m 0,1 ) Numeric Does not include 1
Prior generated value LCM Computations computation dependency X = ( ax + c)mod m i+1 i X 0 = 20, a = 13, c = 3, m = 64 ( X = 7 = 263mod64 = 13 ( 20) + 3)mod64 1 Simulated Integers ( X = 30 = 94mod64 = 13 ( 7) + 3)mod64 2 ( X = 9 = 393mod64 = 13 ( 30) + 3)mod64 3 ( X = 56 = 120mod64 = 13 ( 9) + 3)mod64 4 ( X = 27 = 731mod64 = 13 ( 56) + 3)mod64 5
Previously Modulus Operator a mod q 12 %% 7 # a = n*q + r => 12 = 1*7 + 5 # [1] 5 outer(9:1, 2:9, `%%`) # Compute the cross between X & Y mod 2 <int> mod 3 <int> mod 4 <int> mod 5 <int> mod 6 <int> mod 7 <int> mod 8 <int> mod 9 <int> x = 1 1 1 1 1 1 1 1 1 x = 2 0 2 2 2 2 2 2 2 x = 3 1 0 3 3 3 3 3 3 x = 4 0 1 0 4 4 4 4 4 x = 5 1 2 1 0 5 5 5 5 x = 6 0 0 2 1 0 6 6 6 x = 7 1 1 3 2 1 0 7 7 x = 8 0 2 0 3 2 1 0 8 x = 9 1 0 1 4 3 2 1 0
How could we generate Xi so that Ri includes 1?
# Setup values a = 5181215173 c = 12581 m = 2^32 # Initial Seed seed = as.numeric(sys.time()) * 1000 # Pre-allocation of numeric vector x = numeric(length = 2) Sample Implementation LCM Algorithm # Set initial value x[1] = seed x[2] = (a * x[1] + c) %% m # Real number (not including 1) r = x[2]/m
PRNG Periodicity a = 13, c = 0, m = 2 6 X i i X 0 = 1 X 0 = 2 X 0 = 3 X 0 = 4 0 1 2 3 4 1 13 26 39 52 2 41 18 59 36 3 21 42 63 20 4 17 34 51 4 5 29 58 23 52 6 57 50 43 36 7 37 10 37 20 8 33 2 35 4 9 45 26 7 52 10 9 18 27 36
Quality of R.V. uniqueness of numbers Independent Dependent a = 1240814549, c = 12581, m = 2 32 a = 1229, c = 1, m = 2 11
Source XKCD 221: Random Number
Sampling
Definition: Sampling without Replacement involves picking values from a sample space and not adding them back for subsequent picks. In essence, a value may only be selected once. Available Available https://www.youtube.com/watch?v=pmtr0ocx6og Picked
Definition: Sampling with Replacement involves picking values from a sample space, adding the picked value back for subsequent picks. In essence, a value may be selected multiple times.
Sampling picking r.v. values Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL)
Sampling without Replacement picking r.v. values once Set a seed to control reproducibility set.seed(90210) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = 8, size = 4, replace = FALSE) # [1] 3 1 2 7 NB: Sample creates a vector of 10 from a single number. seq(1, 10) # [1] 1 2 3 4 5 6 7 8 9 10
Sampling without Replacement picking character r.v. values Set a seed to control reproducibility set.seed(42) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = c("heads", "tails"), size = 1, replace = FALSE) # [1] "tails"
Your Turn Modify the sample code such that it can roll a six-sided die once. Set a seed to control reproducibility set.seed(385) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL)
Sampling with Replacement picking r.v. values Set a seed to control reproducibility set.seed(5318008) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). sample(x = 10, size = 20, replace = TRUE) # [1] 7 2 9 8 9 7 3 6 1 10 7 5 8 10 1 6 6 7 2 6
Sampling with Replacement biased coin sample Set a seed to control reproducibility set.seed(376006) Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. y = sample(x = c("heads", "tails"), size = 10, replace = TRUE, prob = c(0.4, 0.6)) y # [1] "tails" "heads" "tails" "tails" "tails" "tails" "tails" "heads" "tails" "tails"
Frequencies and Proportions counting sampled data # Previously obtained sample under probability of 0.4/0.6 y # [1] "tails" "heads" "tails" "tails" "tails" "tails" "tails" "heads" "tails" "tails" # Count observations and/or obtain the proportion table(y) # heads tails # 2 8 prop.table(table(y)) # heads tails # 0.2 0.8 # Why doesn't this match the requested probability of 0.4 / 0.6?
Your Turn Create a call to sample such that it can roll a "loaded" six-sided die twice with probabilities: 1 := 0.1, 2 := 0.3, 3 = 0.1, 4 := 0.15, 5 := 0.15, 6 := 0.2 Sample Space Either a vector or number to sample from. Sample Size Number of values to pick from sample space. Replacement Sample without/with replacement (FALSE/TRUE). Probabilities Weighting for each observation. Optional. sample(x = <sample-space>, size = <nobs>, replace = FALSE, prob = NULL) Source
Probability Distributions
Definition: Probability Distribution provides the outcome of a statistical event with the probability it occurs. Source
Supported Distributions possible distributions Distribution Functions Distribution Functions Beta pbeta qbeta dbeta rbeta Log Normal plnorm qlnorm dlnorm rlnorm Binomial pbinom qbinom dbinom rbinom Negative Binomial pnbinom qnbinom dnbinom rnbinom Cauchy pcauchy qcauchy dcauchy rcauchy Normal pnorm qnorm dnorm rnorm Chi-Squared pchisq qchisq dchisq rchisq Poison ppois qpois dpois rpois Exponential pexp qexp dexp rexp Student's t pt qt dt rt F pf qf df rf Studentized Range ptukey qtukey dtukey rtukey Gamma pgamma qgamma dgamma rgamma Uniform punif qunif dunif runif Geometric pgeom qgeom dgeom rgeom Weibull pweibull qweibull dweibull rweibull Hypergeometric phyper qhyper dhyper rhyper Wilcoxon Rank Sum pwilcox qwilcox dwilcox rwilcox Logistic plogis qlogis dlogis rlogis Wilcoxon Signed Rank psignrank qsignrank dsignrank rsignrank See for more Distributions the CRAN Task View
Normal Distribution an example of a probability distribution f ( x µ,σ ) = ( 1 exp x µ ) 2 σ 2π 2σ 2 µ is the mean σ is the standard deviation σ 2 is the variance
Distribution Functions d: density or the probability density function (PDF) p: probability or the cumulative distribution function (CDF) q: quantile or the inverse CDF r: random variable generation. # Normal PDF dnorm(c(-1, 0, 1)) # [1] 0.2420 0.3989 0.2420 # Normal CDF pnorm(c(-1, 0, 1)) # [1] 0.1587 0.5000 0.8413 # Normal icdf qnorm(c(0.15, 0.5, 0.84)) # [1] -1.0364 0.0000 0.9945 # Normal RNG set.seed(15) rnorm(2) # [1] 0.2588 1.8311
Cumulative and Density # Compute the probability pnorm(1.96) # [1] 0.9750021 F X ( x) = P( X x) # Compute the density # at a specific point (uh oh) dnorm(1.96) # [1] 0.05844094 # Only works for discrete # distributions.. Why? f X ( x) = F X ( x) # Integration required to # match output of pnorm integrate(dnorm(x), lower = -10, upper = 1.96 ) # 0.9750021 with # absolute error < 2e-07
Quantiles finding the p-th quantile # Obtain the x value for # a probability of 0.975 qnorm(0.975) # [1] 1.959964 # Note, this is taking the # inverse of pnorm, c.f. F X ( 1 F ( X p) ) = p x = F X 1 ( p) qnorm( pnorm(1.959964) ) # [1] 1.959964 pnorm( qnorm(0.975) ) # [1] 0.975 F X ( x) = P( X x)
Random Values generating lots of numbers # Set a seed set.seed(1) # Generate 1000 values # from a standard normal rnorm(1000) # [1] -0.626453811 # [2] 0.183643324 # [3] Box-Muller Transform
# Random Number Generation r*(n, param1 =, param2 = ) Generic Distribution Form pattern in statistical distribution functions # * represents a supported distribution (e.g. norm, t, f, chisq) # Density d*(x, param1 =, param2 =, log = FALSE) # Probability p*(q, param1 =, param2 =, lower.tail = TRUE, log.p = FALSE) # Quantile (inverse probability) q*(p, param1 =, param2 =, lower.tail = TRUE, log.p = FALSE)
Your Turn Using the qnorm() function, determine what the quantiles are at probabilities 0.5, 0.95, and 0.975. Have you seen these values before?
Caching
Definition: Caches refers to storing data locally in order to speed up subsequent retrievals. Source
Cache in Action large value calculation * ```{r cache-large-computation, cache = TRUE} x = rnorm(10000*10000) system.time({ sorted_x = sort(x) }) ``` # user system elapsed # 10.564 1.071 11.713 * The code chunk will be run once, the resulting objects are then stored within a file in the caches directory, and the stored data is then displayed on subsequent runs.
There are only two hard things in Computer Science: cache invalidation and naming things Phil Karlton
Warning Do not try the next example in an RMarkdown document with set.seed
Caution Required randomness and storage ```{r bad-sampling} x = rnorm(3) x ``` # [1] 1.1754306 0.5250691 0.1464347 ```{r cached-calc, cache = TRUE} x y = x + 2 y ``` # [1] 1.1754306 0.5250691 0.1464347 # [1] 3.175431 2.525069 2.146435
Redux Caution Required re-running the aforementioned code ```{r bad-sampling} x = rnorm(3) x ``` # [1] -1.8372987 0.8720906 0.2886695 ```{r cached-calc, cache = TRUE} x y = x + 2 y ``` # [1] 1.1754306 0.5250691 0.1464347 # [1] 3.175431 2.525069 2.146435
Depends On specifying dependency ```{r sampling, cache = TRUE} x = rnorm(3) x ``` # [1] 1.2988116 0.9563079-0.6841705 ```{r cached-calc, cache = TRUE, dependson = "sampling"} x y = x + 2 y ``` # [1] 1.2988116 0.9563079-0.6841705 # [1] 3.298812 2.956308 1.315830
Recap Random Variables Variables that take on an unknown value LCM is one way to generate random numbers Distributions Probability Density/Mass Functions (PD/MF) is d* Cumulative Distribution Function (CDF) is p* Inverse CDF is q* Random Variables from Distribution is r* Sampling Sample without replacement does not add the picked object back. Sample with replacement does add the picked object back. Caches Speed up computationally intensive reports by storing results and re-using them.
This work is licensed under the Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License