Random variables. DS GA 1002 Probability and Statistics for Data Science.

Random variables DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda

Motivation Random variables model numerical quantities that are uncertain They allow us to structure the information we have about these quantities in a principled way

Definition Given a probability space (Ω, F, P), a random variable X is a function from the sample space Ω to the real numbers R We use uppercase letters to denote random variables: X, Y,... Once the outcome ω Ω is revealed, X (ω) is the realization of X We use lowercase letters to denote numerical values: x, y,...

Characterization Given a probability space (Ω, F, P), for any set S P (X S) = P ({ω X (ω) S}). We will almost never construct probabilistic models like this!

Discrete random variables Discrete random variables take values on a finite or countably infinite subset of R such as the integers The probability mass function (pmf) of X is defined as p X (x) := P ({ω X (ω) = x}) In words, p X (x) is the probability that X equals x The pmf completely specifies a random variable

Probability mass function If D is the range of X, then ( D, 2 D ), p X is a valid probability space Any pmf satisfies p X (x) 0 for any x D, p X (x) = 1 x D P (X S) = x S p X (x) for any S D

Probability mass function 0.4 0.3 px (x) 0.2 0.1 0 1 2 3 4 5 x

Example P (X {1, 4}) P (X > 3)

Example P (X {1, 4}) = p X (1) + p X (4) = 0.5 P (X > 3)

Example P (X {1, 4}) = p X (1) + p X (4) = 0.5 P (X > 3) = p X (4) + p X (5) = 0.6

Defining a discrete random variable To define a discrete random variable X we just need A discrete range D A nonnegative function p X satisfying p X (x) = 1 x D

Bernoulli random variable Experiment with two possible outcomes (coin flip with bias p) p X (0) = 1 p p X (1) = p Special case: Indicator random variable of an event S { 1, if ω S, 1 S (ω) = 0, otherwise Bernoulli with parameter P (S) Allows to represent an event by a random variable

Example: Coin flips You flip a coin with bias p until you obtain heads (flips are independent) If you model the number of flips as a random variable X, what is p X?

Example: Coin flips p X (k)

Example: Coin flips p X (k) = P (k flips)

Example: Coin flips p X (k) = P (k flips) = P (1st flip = tails,..., k 1th flip = tails, kth flip = heads)

Example: Coin flips p X (k) = P (k flips) = P (1st flip = tails,..., k 1th flip = tails, kth flip = heads) = P (1st flip = tails) P (k 1th flip = tails) P (kth flip = heads)

Example: Coin flips p X (k) = P (k flips) = P (1st flip = tails,..., k 1th flip = tails, kth flip = heads) = P (1st flip = tails) P (k 1th flip = tails) P (kth flip = heads) = (1 p) k 1 p

Geometric random variable The pmf of a geometric random variable with parameter p is p X (k) = (1 p) k 1 p k = 1, 2,...

Geometric random variable p = 0.2 0.8 0.6 px (k) 0.4 0.2 0 2 4 6 8 10 k

Geometric random variable p = 0.5 0.8 0.6 px (k) 0.4 0.2 0 2 4 6 8 10 k

Geometric random variable p = 0.8 0.8 0.6 px (k) 0.4 0.2 0 2 4 6 8 10 k

Example: Coin flips You flip a coin with bias p n times (flips are independent) If you model the number of heads as a random variable X, what is p X?

Example: Coin flips What is the probability of getting k heads and then n k tails? P (k heads, then n k tails)

Example: Coin flips What is the probability of getting k heads and then n k tails? P (k heads, then n k tails) = P (1st = heads,..., kth = heads, k + 1th = tails,..., nth = tails)

Example: Coin flips What is the probability of getting k heads and then n k tails? P (k heads, then n k tails) = P (1st = heads,..., kth = heads, k + 1th = tails,..., nth = tails) = P (1st = heads) P (kth = heads) P (k + 1th = tails) P (nth = tails)

Example: Coin flips Any fixed order of k heads and n k tails has the same probability

Example: Coin flips Any fixed order of k heads and n k tails has the same probability We are interested in the union of these events

Example: Coin flips Any fixed order of k heads and n k tails has the same probability We are interested in the union of these events Can we just add their probabilities?

Example: Coin flips Any fixed order of k heads and n k tails has the same probability We are interested in the union of these events Can we just add their probabilities? How many possible orders are there?

Binomial random variable The pmf of a binomial random variable with parameters n and p is p X (k) = ( ) n p k (1 p) n k, k k = 0, 1, 2,..., n

Binomial random variable n = 20, p = 0.2 0.25 0.2 px (k) 0.15 0.1 5 10 2 0 0 5 10 15 20 k

Binomial random variable n = 20, p = 0.5 0.25 0.2 px (k) 0.15 0.1 5 10 2 0 0 5 10 15 20 k

Binomial random variable n = 20, p = 0.8 0.25 0.2 px (k) 0.15 0.1 5 10 2 0 0 5 10 15 20 k

Example: Call center Model the number of calls received per day Assumptions: 1. Each call occurs independently from every other call 2. A given call has the same probability of occurring at any given time of the day 3. Calls occur at a rate of λ calls per day

Example: Call center Discretize day into n slots

Example: Call center Discretize day into n slots Probability of receiving m calls in one slot?

Example: Call center Discretize day into n slots Probability of receiving m calls in one slot? (λ/n) m

Example: Call center Discretize day into n slots Probability of receiving m calls in one slot? (λ/n) m If n is large enough λ/n >> (λ/n) m for all m > 1

Example: Call center Discretize day into n slots Probability of receiving m calls in one slot? (λ/n) m If n is large enough λ/n >> (λ/n) m for all m > 1 Assume that in each slot we either receive one call or none at all. What is probability of k calls in a day?

Example: Call center P (k calls during the day )

Example: Call center P (k calls during the day ) = lim P (k calls in n small intervals) n

Example: Call center P (k calls during the day ) = lim P (k calls in n small intervals) n ( ) n = lim p k (1 p) (n k) n k

Example: Call center P (k calls during the day ) = lim n = lim n = lim n P (k calls in n small intervals) ( ) n p k (1 p) (n k) k ( n k ) ( λ n ) k ( 1 λ ) (n k) n

Example: Call center P (k calls during the day ) = lim n = lim n = lim n = lim n P (k calls in n small intervals) ( ) n p k (1 p) (n k) k ( n k ) ( λ n ) k ( 1 λ ) (n k) n n! λ k k! (n k)! (n λ) k ( 1 λ ) n n

Example: Call center P (k calls during the day ) = lim n = lim n = lim n = lim n P (k calls in n small intervals) ( ) n p k (1 p) (n k) k ( n k = λk e λ k! ) ( λ n ) k ( 1 λ ) (n k) n n! λ k k! (n k)! (n λ) k ( 1 λ ) n n Identity proved in the notes lim n n! (n k)! (n λ) k ( 1 λ ) n = e λ n

Poisson random variable The pmf of a Poisson random variable with parameter λ is p X (k) = λk e λ k! k = 0, 1, 2,...

Poisson random variable λ = 10 0.15 0.1 px (k) 5 10 2 0 0 10 20 30 40 50 k

Poisson random variable λ = 20 0.15 0.1 px (k) 5 10 2 0 0 10 20 30 40 50 k

Poisson random variable λ = 30 0.15 0.1 px (k) 5 10 2 0 0 10 20 30 40 50 k

Example: Call center Pmf of binomial with parameters n and p = λ n Poisson with parameter λ converges to pmf of This is an example of convergence in distribution

Binomial random variable n = 40, p = 20 40 0.15 0.1 5 10 2 0 0 10 20 30 40 k

Binomial random variable n = 80, p = 20 80 0.15 0.1 5 10 2 0 0 10 20 30 40 k

Binomial random variable n = 400, p = 20 400 0.15 0.1 5 10 2 0 0 10 20 30 40 k

Poisson random variable λ = 20 0.15 0.1 5 10 2 0 0 10 20 30 40 k

Call-center data Assumptions do not hold over the whole day (why?) They do hold (approximately) for intervals of time Example: Data from a call center in Israel We compare the histogram of the number of calls received in an interval of 4 hours over 2 months and the pmf of a Poisson random variable fitted to the data

Call-center data 0.14 0.12 0.10 0.08 0.06 0.04 0.02 Real data Poisson distribution 0.00 0 5 10 15 20 25 30 35 40 Number of calls

Continuous random variables Useful to model continuous quantities without discretizing Assigning nonzero probabilities to events of the form {X = x} for x R doesn t work! Instead, we only consider events of the form {X S} where S is a union of intervals (formally a Borel set) We cannot consider every possible subset of R for technical reasons

Cumulative distribution function The cumulative distribution function (cdf) of X is defined as F X (x) := P ({X (ω) x : ω Ω}) = P (X x) In words, F X (x) is the probability of X being smaller than x The cdf can be defined for both continuous and discrete random variables

Cumulative distribution function The cdf completely specifies the distribution of the random variable The probability of any interval (a, b] is given by P (a < X b) = P (X b) P (X a) = F X (b) F X (a) To define a continuous random variable we just need a valid cdf! A valid underlying probability space exists, but we don t need to worry about it

Properties of the cdf lim F X (x) = 0, x lim F X (x) = 1, x F X (b) F X (a) if b > a, i.e. F X is nondecreasing

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3

Example 1 FX (x) 0.5 0 1 0 0.5 1 2 2.5 3 4 x

Example P (0.5 < X 2.5) =

Example P (0.5 < X 2.5) = F X (2.5) F X (0.5) = 0.375

Example 1 FX (x) 0.5 P(X (0.5, 2.5]) 0 1 0 0.5 1 2 2.5 3 4 x

Probability density function When the cdf is differentiable, its derivative can be interpreted as a density Probability density function f X (x) := df X (x) d x The pdf is not a probability measure! (It can be greater than 1)

Probability density function By the fundamental theorem of calculus Intuitively, P (a < X b) = F X (b) F X (a) = b a f X (x) dx lim P (X (x, x + )) = f X (x) 0

Properties of the pdf For any union of intervals (any Borel set) S P (X S) = f X (x) dx In particular, S f X (x) dx = 1 From the monotonicity of the cdf f X (x) 0

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3 f X (x) = for x < 0, for 0 x 1 for 1 x 2 for 2 x 3 for x > 3.

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3 f X (x) = 0 for x < 0, for 0 x 1 for 1 x 2 for 2 x 3 for x > 3.

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3 0 for x < 0, 0.5 for 0 x 1 f X (x) = for 1 x 2 for 2 x 3 for x > 3.

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3 0 for x < 0, 0.5 for 0 x 1 f X (x) = 0 for 1 x 2 for 2 x 3 for x > 3.

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3 0 for x < 0, 0.5 for 0 x 1 f X (x) = 0 for 1 x 2 x 2 for 2 x 3 for x > 3.

Example 0 for x < 0 0.5 x for 0 x 1 F X (x) := 0.5 ( for 1 x 2 0.5 1 + (x 2) 2) for 2 x 3 1 for x > 3 0 for x < 0, 0.5 for 0 x 1 f X (x) = 0 for 1 x 2 x 2 for 2 x 3 0 for x > 3.

Example FX (x) 1 0.5 0 0 0.5 1 2 2.5 3 x fx (x) 1 0.5 0 0 1 2 3 x

Example P (0.5 < X 2.5) =

Example P (0.5 < X 2.5) = 2.5 0.5 f X (x) dx

Example P (0.5 < X 2.5) = = 2.5 0.5 1 0.5 f X (x) dx 0.5 dx + 2.5 2 x 2 dx = 0.375

Example 1 P (X (0.5, 2.5]) fx (x) 0.5 0 0 1 2 3 x

Uniform random variable Pdf of a uniform random variable with domain [a, b]: { 1 f X (x) = b a, if a x b, 0, otherwise

Uniform random variable in [a, b] 1 b a 1 fx (x) FX (x) 0 0 a x b a x b

Exponential random variable Used to model waiting times (time until a certain event occurs) Examples: decay of a radioactive particle, telephone call, mechanical failure of a device Pdf of an exponential random variable with parameter λ: { λe λx, if x 0, f X (x) = 0, otherwise

Exponential random variables 1.5 λ = 0.5 λ = 1.0 λ = 1.5 fx (x) 1 0.5 0 0 2 4 6 8 x

Call-center data Example: Data from a call center in Israel We compare the histogram of the inter-arrival times between calls occurring between 8 pm and midnight over two days and the pdf of an exponential random variable fitted to the data

Call center 0.9 0.8 Exponential distribution Real data 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 1 2 3 4 5 6 7 8 9 Interarrival times (s)

Gaussian or normal random variable Extremely popular in probabilistic models and statistics Sums of independent random variables converge to Gaussian distributions under certain assumptions Pdf of a Gaussian random variable with mean µ and standard deviation σ f X (x) = 1 e (x µ)2 2σ 2 2πσ

Gaussian random variables 0.4 0.3 µ = 2 σ = 1 µ = 0 σ = 2 µ = 0 σ = 4 fx (x) 0.2 0.1 0 10 5 0 5 10 x

Height data Example: Data from a population of 25 000 people We compare the histogram of the heights and the pdf of a Gaussian random variable fitted to the data

Height data 0.25 0.20 Gaussian distribution Real data 0.15 0.10 0.05 60 62 64 66 68 70 72 74 76 Height (inches)

Problem The Gaussian cdf does not have a closed form solution This complicates computing the probability that a Gaussian belongs to a set

Standard Gaussian If X is Gaussian with mean µ and standard deviation σ, then U := X µ σ is a standard Gaussian, with mean zero and unit standard deviation ( [ X µ a µ P (X [a, b]) = P σ σ, b µ ]) σ ( ) ( ) b µ a µ = Φ Φ σ σ Φ is the cdf of a standard Gaussian

Beta random variable Useful in Bayesian statistics Unimodal continuous distribution in the unit interval The pdf of a beta distribution with parameters a and b is defined as f β (θ; a, b) := { θ a 1 (1 θ) b 1 β(a,b), if 0 θ 1, 0 otherwise β (a, b) := u a 1 (1 u) b 1 du u

Beta random variables fx (x) 6 4 2 a = 1 b = 1 a = 1 b = 2 a = 3 b = 3 a = 6 b = 2 a = 3 b = 15 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Conditioning on an event We usually define random variables using their pmf, cdf or pdf How can we incorporate the information that X S for some set S?

Conditional pmf If X has pmf p X, the conditional pmf of X given X S is p X X S (x) := P (X = x X S) { px (x) = s S p X (s) if x S 0 otherwise. Valid pmf in the new probability space restricted to the event {X S}

Conditional cdf If X has pdf f X, the conditional cdf of X given X S is F X X S (x) := P (X x X S) P (X x, X S) = P (X S) u x,u S = f X (u) du u S f X (u) du Valid cdf in the new probability space restricted to the event {X S}

Example: Geometric random variables are memoryless We flip a coin repeatedly until we obtain heads, but pause after k 0 flips (which were tails) What is the probability of obtaining heads in k more flips?

Example: Geometric random variables are memoryless P (k more flips)

Example: Geometric random variables are memoryless P (k more flips) = p X X >k0 (k)

Example: Geometric random variables are memoryless P (k more flips) = p X X >k0 (k) = p X (k) m=k 0 +1 p X (m)

Example: Geometric random variables are memoryless P (k more flips) = p X X >k0 (k) = = p X (k) m=k 0 +1 p X (m) (1 p) k 1 p m=k 0 +1 (1 p)m 1 p

Example: Geometric random variables are memoryless P (k more flips) = p X X >k0 (k) = = p X (k) m=k 0 +1 p X (m) (1 p) k 1 p m=k 0 +1 (1 p)m 1 p = (1 p) k k 0 1 p for k > k 0 Geometric series: m=k 0 +1 α m 1 = αk 0 1 α for any α < 1

Example: Exponential random variables are memoryless Assume email inter-arrival times are exponential with parameter λ You get an email, then no email for t 0 minutes How is the waiting time until the next email distributed now?

Example: Exponential random variables are memoryless F T T >t0 (t)

Example: Exponential random variables are memoryless F T T >t0 (t) = t t 0 f T (u) du f T (u) du t 0

Example: Exponential random variables are memoryless F T T >t0 (t) = = t t 0 f T (u) du f T (u) du t 0 t t 0 λe λu du λe λu du t 0

Example: Exponential random variables are memoryless F T T >t0 (t) = = t t 0 f T (u) du f T (u) du t 0 t t 0 λe λu du λe λu du t 0 = e λt e λt 0 e λt 0

Example: Exponential random variables are memoryless F T T >t0 (t) = = t t 0 f T (u) du f T (u) du t 0 t t 0 λe λu du λe λu du t 0 = e λt e λt 0 e λt 0 = 1 e λ(t t 0) for t > t 0

Example: Exponential random variables are memoryless F T T >t0 (t) = = t t 0 f T (u) du f T (u) du t 0 t t 0 λe λu du λe λu du t 0 = e λt e λt 0 e λt 0 = 1 e λ(t t 0) for t > t 0 Differentiating with respect to t f T T >t0 (t) = λe λ(t t 0) for t > t 0

Functions of random variables For any deterministic function g and r.v. X, Y := g (X ) is a random variable Formally, X maps elements of Ω to R, so Y does too since Y (ω) = g (X (ω))

Discrete random variables If X is discrete p Y (y) = P (Y = y) = P (g (X ) = y) = p X (x) {x g(x)=y}

Continuous random variables If X is continuous F Y (y) = P (Y y) = P (g (X ) y) = f X (x) dx, {x g(x) y} Then we can differentiate to obtain the pdf f Y

Gaussian random variable If X is a Gaussian random variable with mean µ and standard deviation σ, derive the distribution of U := X µ σ

Gaussian random variable F U (u)

Gaussian random variable ( ) X µ F U (u) = P u σ

Gaussian random variable ( X µ F U (u) = P σ = (x µ)/σ u ) u 1 2πσ e (x µ)2 2σ 2 dx

Gaussian random variable ( X µ F U (u) = P σ = = (x µ)/σ u u ) u 1 e (x µ)2 2σ 2 dx 2πσ 1 2π e w2 2 dw by the change of variables w = x µ σ

Gaussian random variable ( X µ F U (u) = P σ = = (x µ)/σ u u ) u 1 e (x µ)2 2σ 2 dx 2πσ 1 2π e w2 2 dw by the change of variables w = x µ σ To obtain the pdf we differentiate with respect to u

Gaussian random variable ( X µ F U (u) = P σ = = (x µ)/σ u u ) u 1 e (x µ)2 2σ 2 dx 2πσ 1 2π e w2 2 dw by the change of variables w = x µ σ To obtain the pdf we differentiate with respect to u f U (u) = 1 2π e u2 2

Generating random variables Simulation is crucial to leverage probabilistic models effectively (life is not a homework problem!) It requires being able to sample from arbitrary distributions General approach: 1. Generate samples uniformly from the unit interval [0, 1] 2. Transform the samples so that they have the desired distribution

Sampling from a discrete distribution Aim: Generate a discrete random variable X with pmf p X using samples from a uniform random variable U Possible values of X : x 1 < x 2 <... How can we assign samples u 1, u 2,... from U to x 1, x 2,...?

Sampling from a discrete distribution x 3 x 2 x 1 0 u 1 u 2 u 3 u 4 u 5 1

Sampling from a discrete distribution Idea: Assign samples in an interval of length p X (x i ) to x i

Sampling from a discrete distribution x 3 x 2 x 1 0 u 1 u 2 u 3 u 4 u 5 1

Sampling from a discrete distribution x 1 if 0 U p X (x 1 ) x 2 if p X (x 1 ) U p X (x 1 ) + p X (x 2 ) X =... x i if i 1 j=1 p X (x j ) U i j=1 p X (x j )...

Sampling from a discrete distribution x 1 if 0 U F X (x 1 ) x 2 if F X (x 1 ) U F X (x 2 ) X =... x i if F X (x i 1 ) U F X (x i )...

Sampling from a discrete distribution x 3 x 2 x 1 0 u 1 F X (x 1 ) u 2 u 3 u 4 F X (x 2 ) u 5 1

Inverse-transform sampling Aim: Generate a continuous random variable X with cdf F X using samples from a uniform random variable U Algorithm: 1. Obtain a sample u of U 2. Set x := F 1 X (u)

Inverse-transform sampling F Y (y)

Inverse-transform sampling F Y (y) = P (Y y)

Inverse-transform sampling F Y (y) = P (Y y) = P ( F 1 X (U) y)

Inverse-transform sampling F Y (y) = P (Y y) = P ( F 1 X (U) y) = P (U F X (y))

Inverse-transform sampling F Y (y) = P (Y y) = P ( F 1 X (U) y) = P (U F X (y)) = FX (y) u=0 du

Inverse-transform sampling F Y (y) = P (Y y) = P ( F 1 X (U) y) = P (U F X (y)) = FX (y) u=0 = F X (y) du

Generating an exponential random variable Aim: Generate an exponential random variable X with parameter λ F X (x) := 1 e λx F 1 X (u) = 1 ( ) 1 λ log 1 u F 1 X (U) is an exponential random variable with parameter λ

Generating an exponential random variable F 1 X (u 5) F 1 X (u 4) F 1 X (u 3) F 1 F 1 X (u 2) X (u 1) 0 u 1 u 2 u 3 u 4 u 51