Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda

Probability and statistics Probability: Framework for dealing with uncertainty Statistics: Framework for extracting information from data making probabilistic assumptions

Probability Probability basics: probability spaces, conditional probability, independence, conditional independence Random variables: pmf, cdf, pdf, important distributions, functions of random variables Multivariate random variables: joint pmf, joint cdf, joint pdf, marginal distributions, conditional distributions, independence, joint distribution of discrete/continuous random variables

Probability Expectation: definition, mean, median, variance, Markov and Chebyshev inequalities, covariance, correlation coefficient, covariance matrix, conditional expectation Random processes: definition, mean, autocovariance, important processes (iid, Gaussian, Poisson, random walk), Markov chains Convergence: types of convergence, law of large numbers, central limit theorem, convergence of Markov chains Simulation: motivation, inverse-transform sampling, rejection sampling, Markov-chain Monte Carlo

Statistics Descriptive statistics: histogram, empirical mean/variance, order statistics, empirical covariance, principal component analysis Statistical estimation: frequentist perspective, mean square error, consistency, confidence intervals Learning models: method of moments, maximum likelihood, empirical cdf, kernel density estimation

Statistics Hypothesis testing: definitions (null/alternative hypothesis, Type I/II errors), significance level, power, p value, parametric testing, power function, likelihood-ratio test, permutation test, multiple testing, Bonferroni s method Bayesian statistics: prior, likelihood, posterior, posterior mean/mode Linear regression: linear models, least squares, geometric interpretation, probabilistic interpretation, overfitting

Random walk with a drift We define the random walk X as the discrete-state discrete-time random process X (0) := 0, X (i) := X (i 1) + S (i) + 1, i = 1, 2,... where S (i) = { +1 with probability 1 2, 1 with probability 1 2, is an iid sequence of steps

Random walk with a drift What is the mean of this random process? ( ) E X (i)

Random walk with a drift What is the mean of this random process? ( ) i ( ) E X (i) = E S (j) + 1 j=1

Random walk with a drift What is the mean of this random process? ( ) i ( ) E X (i) = E S (j) + 1 = i j=1 j=1 ( ) E S (j) + n

Random walk with a drift What is the mean of this random process? ( ) i ( ) E X (i) = E S (j) + 1 = = i i j=1 j=1 ( ) E S (j) + n

Random walk with a drift What is the autocovariance? Use the fact that the autocovariance of the random walk without drift W that we studied in the lecture notes is R W (i, j) = min {i, j}

Random walk with a drift X (i)

Random walk with a drift X (i) = W (i) + i

Random walk with a drift ) E ( W (i) X (i) = W (i) + i

Random walk with a drift ) E ( W (i) X (i) = W (i) + i = 0

Random walk with a drift ) E ( W (i) X (i) = W (i) + i = 0 R X (i, j)

Random walk with a drift ) E ( W (i) X (i) = W (i) + i = 0 ( ) ( ) ( ) R X (i, j) := E X (i) X (j) E X (i) E X (j)

Random walk with a drift ) E ( W (i) X (i) = W (i) + i = 0 ( ) R X (i, j) := E X (i) X (j) = E ( ) E X (i) ( ) E X (j) (( W (i) + i ) ( W (j) + j )) E ) ) ( W (i) + i E ( W (j) + j

Random walk with a drift ) E ( W (i) X (i) = W (i) + i = 0 ( ) R X (i, j) := E X (i) X (j) ( ) E X (i) ) )) (( W (i) + i ( W (j) + j ( ) E X (j) ) = E E ( W (i) + i ) ) ) = E ( W (i) W (j) + ie ( W (j) + je ( W (i) + ij ie ( W (j) ) je ( W (i) ) ij ) E ( W (j) + j

Random walk with a drift ) E ( W (i) X (i) = W (i) + i = 0 ( ) R X (i, j) := E X (i) X (j) ( ) E X (i) ) )) (( W (i) + i ( W (j) + j ( ) E X (j) ) = E E ( W (i) + i ) ) ) = E ( W (i) W (j) + ie ( W (j) + je ( W (i) + ij ie = min {i, j} ( W (j) ) je ( W (i) ) ij ) E ( W (j) + j

Random walk with a drift Compute the first-order pmf of X (i). Recall that the first-order pmf of the random walk W equals {( i ) 1 i+x if i + x is even and i x i p W (i) (x) = 2 2 i 0 otherwise

Random walk with a drift p X (i) (x)

Random walk with a drift ( ) p X (i) (x) = P X (i) = x

Random walk with a drift ( ) p X (i) (x) = P X (i) = x ) = P ( W (i) = x i

Random walk with a drift ( ) p X (i) (x) = P X (i) = x ) = P ( W (i) = x i = p W (i) (x 1)

Random walk with a drift ( ) p X (i) (x) = P X (i) = x ) = P ( W (i) = x i = p W (i) (x 1) { =

Random walk with a drift ( ) p X (i) (x) = P X (i) = x ) = P ( W (i) = x i = p W (i) (x 1) = {( ix ) 1 2 2 i

Random walk with a drift ( ) p X (i) (x) = P X (i) = x ) = P ( W (i) = x i = p W (i) (x 1) {( ix ) 1 if x is even and 0 x 2i 2 i = 2

Random walk with a drift ( ) p X (i) (x) = P X (i) = x ) = P ( W (i) = x i = p W (i) (x 1) {( ix ) 1 if x is even and 0 x 2i 2 i = 2 0 otherwise

Random walk with a drift Does the process satisfy the Markov condition? p X (i+1) X (1), X (2),..., X (i) (x i+1 x 1, x 2,..., x i ) = p X (i+1) X (i) (x i+1 x i )

Random walk with a drift p X (i+1) X (1), X (2),..., X (i) (x i+1 x 1, x 2,..., x i )

Random walk with a drift p X (i+1) X (1), X (2),..., X (i) (x i+1 x 1, x 2,..., x i ) = P (x i + S ) (i + 1) + 1 = x i+1

Random walk with a drift p X (i+1) X (1), X (2),..., X (i) (x i+1 x 1, x 2,..., x i ) = P (x i + S ) (i + 1) + 1 = x i+1 = p X (i+1) X (i) (x i+1 x i )

Random walk with a drift We observe that X (10) = 16 and X (20) = 30. What is the best estimator for X (21) in terms of probability of error?

Random walk with a drift p X (21) X (10), X (20) (x 16, 30)

Random walk with a drift p X (21) X (10), X (20) (x 16, 30) = p X (21) X (20) (x 30)

Random walk with a drift p X (21) X (10), X (20) (x 16, 30) = p X (21) X (20) (x 30) 1 2 if x = 32 = 1 2 if x = 30 0 otherwise

Markov chain Consider a Markov chain X with transition matrix [ ] a 1 T X :=, 1 a 0 where a is a constant between 0 and 1. We label the two states 0 and 1. The transition matrix T X has two eigenvectors q 1 := [ 1 ] [ ] 1 a 1, q 1 2 := 1 The corresponding eigenvalues are λ 1 := 1 and λ 2 := a 1

Markov chain For what values of a is the Markov chain irreducible?

Markov chain For what values of a is the Markov chain periodic?

Markov chain Express the stationary distribution of X in terms of a p stat

Markov chain Express the stationary distribution of X in terms of a p stat = 1 ( q 1 ) 1 + ( q 1 ) 2 q 1

Markov chain Express the stationary distribution of X in terms of a 1 p stat = q 1 ( q 1 ) 1 + ( q 1 ) 2 = 1 [ ] 1 2 a 1 a

Markov chain Does the Markov chain always converge in probability for all values of a? Justify that this is the case or provide a counterexample.

Markov chain Express the conditional pmf of X (i) conditioned on X (1) = 0 as a function of a and i. (Hint: Computing q 1 + q 2 could be a helpful first step.) Evaluate the expression at a = 0 and a = 1. Does the result make sense?

Markov chain We have q 1 + q 2

Markov chain We have q 1 + q 2 = [ 1 ] [ ] 1 a 1 + 1 1

Markov chain We have q 1 + q 2 = = [ 1 1 a ] [ 1 + 1 1 ] [ 2 a 1 a 0 ]

Markov chain We have q 1 + q 2 = = [ 1 1 a 1 [ 2 a 1 a 0 ] [ 1 + 1 ] ] p X (0)

Markov chain We have q 1 + q 2 = = [ 1 1 a 1 [ 2 a 1 a 0 ] [ 1 + 1 ] ] p X (0) = [ ] 1 0

Markov chain We have q 1 + q 2 = = [ 1 1 a 1 [ 2 a 1 a 0 ] [ 1 + 1 ] ] p X (0) = [ ] 1 0 = 1 a 2 a ( q 1 + q 2 )

Markov chain p X (i)

Markov chain p X (i) = T ĩ X p X (0)

Markov chain p X (i) = T ĩ p X X (0) = T ĩ 1 a X 2 a ( q 1 + q 2 )

Markov chain p X (i) = T ĩ p X X (0) = T ĩ 1 a X 2 a ( q 1 + q 2 ) = 1 a ( λ i 2 a 1 q 1 + λ i 2 q 2)

Markov chain p X (i) = T ĩ p X X (0) = T ĩ 1 a X 2 a ( q 1 + q 2 ) = 1 a ( λ i 2 a 1 q 1 + λ i 2 q 2) = 1 a 2 a ([ 1 ] [ ]) 1 a + (a 1) i 1 1 1

Markov chain p X (i) = T ĩ p X X (0) = T ĩ 1 a X 2 a ( q 1 + q 2 ) = 1 a ( λ i 2 a 1 q 1 + λ i 2 q 2) = 1 a 2 a = 1 2 a ([ 1 1 a ] [ ]) + (a 1) i 1 1 ] 1 [ 1 (a 1) i+1 (1 a) (1 (a 1) i)

Markov chain For a = 1 we have p X (i) = [ ] 1 0

Markov chain For a = 0 we have p X (i) = 1 2 [ ] 1 ( 1) i+1 1 ( 1) i

Markov chain For a = 0 we have [ ] p X (i) = 1 1 ( 1) i+1 2 1 ( 1) i [ ] 0 if i is odd, 1 = [ ] 1 if i is even. 0

Sampling from multivariate distributions We are interested in generating samples from the joint distribution of two random variables X and Y. If we generate a sample x according to the pdf f X and a sample y according to the pdf f Y, are these samples a realization of the joint distribution of X and Y? Explain your answer with a simple example.

Sampling from multivariate distributions Now, assume that X is discrete and Y is continuous. Propose a method to generate a sample from the joint distribution using the pmf of X and the conditional cdf of Y given X using two independent samples from a distribution that is uniform between 0 and 1. Assume that the conditional cdf is invertible.

Sampling from multivariate distributions 1. Obtain two independent samples u 1 and u 2 from the uniform distribution.

Sampling from multivariate distributions 1. Obtain two independent samples u 1 and u 2 from the uniform distribution. 2. Set x to equal the smallest value a such that p X (a) 0 and u 1 F X (a).

Sampling from multivariate distributions 1. Obtain two independent samples u 1 and u 2 from the uniform distribution. 2. Set x to equal the smallest value a such that p X (a) 0 and u 1 F X (a). 3. Define Set y := F 1 x (u 2 ) F x ( ) := F Y X ( x)

Sampling from multivariate distributions Explain how to generate samples from a random variable with pdf f W (w) = 0.1 λ 1 exp ( λ 1 w) + 0.9 λ 2 exp ( λ 2 w), w 0, where λ 1 and λ 2 are positive constants, using two iid uniform samples between 0 and 1.

Sampling from multivariate distributions Let us define a Bernoulli random variable X with parameter 0.9, such that if X = 0 then Y is exponential with parameter λ 1 and if X = 1 then Y is exponential with parameter λ 2 The marginal distribution of Y is f Y (w) = p X (0) f Y X (w 0) + p X (1) f Y X (w 1)

Sampling from multivariate distributions 1. We obtain two independent samples u 1 and u 2 from the uniform distribution.

Sampling from multivariate distributions 1. We obtain two independent samples u 1 and u 2 from the uniform distribution. 2. If u 1 0.1 we set w := 1 ( ) 1 log λ 1 1 u 2 otherwise we set w := 1 ( ) 1 log λ 2 1 u 2

Convergence Let U be a random variable uniformly distributed between 0 and 1. If we define the discrete random process X X (i) = U for all i, does X converge to 1 U in probability?

Convergence Does X converge to 1 U in distribution?

Convergence You draw some iid samples x 1, x 2,... from a Cauchy random variable. Will the empirical mean 1 n n i=1 x i converge in probability as n grows large? Explain why briefly and if the answer is yes state what it converges to.

Convergence You draw m iid samples x 1, x 2,..., x m from a Cauchy random variable. Then you draw iid samples y 1, y 2,... uniformly from {x 1, x 2,..., x m } (each y i is equal to each element of {x 1, x 2,..., x m } with probability 1/m). Will the empirical mean 1 n n i=1 y i converge in probability as n grows large? Explain why very briefly and if the answer is yes state what it converges to.

Earthquake We are interested in learning a model for the occurrence of earthquakes. We decide to model the time between earthquakes as an exponential random variable with parameter λ. Compute the maximum-likelihood estimate of λ given t 1, t 2,..., t n, which are interarrival times for past earthquakes. Assume that the data are iid.

Earthquake L (λ)

Earthquake L (λ) := f T (1),..., T (n) (t 1,..., t n )

Earthquake L (λ) := f T (1),..., T (n) (t 1,..., t n ) n = λ exp ( λt i ) i=1

Earthquake L (λ) := f T (1),..., T (n) (t 1,..., t n ) n = λ exp ( λt i ) i=1 ( = λ n exp λ ) n t i i=1

Earthquake L (λ) := f T (1),..., T (n) (t 1,..., t n ) n = λ exp ( λt i ) i=1 ( = λ n exp λ ) n t i i=1 log L (λ)

Earthquake L (λ) := f T (1),..., T (n) (t 1,..., t n ) n = λ exp ( λt i ) i=1 ( = λ n exp λ ) n t i i=1 log L (λ) = n log λ λ n i=1 t i

Earthquake d log L t1,...,t n (λ) dλ

Earthquake d log L t1,...,t n (λ) dλ = n λ n i=1 t i

Earthquake d log L t1,...,t n (λ) dλ = n λ n i=1 t i d 2 log L t1,...,t n (λ) dλ 2

Earthquake d log L t1,...,t n (λ) dλ = n λ n i=1 t i d 2 log L t1,...,t n (λ) dλ 2 = n λ 2

Earthquake d log L t1,...,t n (λ) dλ = n λ n i=1 t i d 2 log L t1,...,t n (λ) dλ 2 = n λ 2 λ ML

Earthquake d log L t1,...,t n (λ) dλ = n λ n i=1 t i d 2 log L t1,...,t n (λ) dλ 2 = n λ 2 1 λ ML = 1 n n i=1 t i

Earthquake Find an approximate 0.95 confidence interval based on the central limit theorem for the value of λ. Assume that you know a bound b on the standard deviation (i.e. the variance of the exponential 1/λ 2 is bounded by b 2 ) and express your answer using the Q function. (Hint: Express the ML estimate in terms of the empirical mean.) (See solutions.)

Earthquake What is the posterior distribution of the parameter Λ if we model it as a random variable with a uniform distribution between 0 and u? Express your answer in terms of the sum n i=1 t i, u and the marginal pdf of the data evaluated at t 1, t 2,..., t n c := f T (1),..., T (n) (t 1,..., t n ).

Earthquake f Λ T (1),..., T (n) (λ t 1,..., t n )

Earthquake f Λ T (1),..., T (n) (λ t 1,..., t n ) = f Λ (λ) λ n exp ( λ n i=1 t i) f T (1),..., T (n) (t 1,..., t n )

Earthquake f Λ T (1),..., T (n) (λ t 1,..., t n ) = f Λ (λ) λ n exp ( λ n i=1 t i) f T (1),..., T (n) (t 1,..., t n ) = 1 u c λn exp ( λ ) n t i i=1

Earthquake f Λ T (1),..., T (n) (λ t 1,..., t n ) = f Λ (λ) λ n exp ( λ n i=1 t i) f T (1),..., T (n) (t 1,..., t n ) = 1 u c λn exp ( λ for 0 λ u and zero otherwise ) n t i i=1

Earthquake f Λ T (1),..., T (n) (λ t1,..., tn) λ

Earthquake Explain how you would use the answer in the previous question to construct a confidence interval for the parameter

Chad You hate a coworker and want to predict when he is in the office from the temperature. Chad 61 65 59 61 61 65 61 63 63 59 No Chad 68 70 68 64 64 - - - - - You model his presence using a random variable C which is equal to 1 if he is there and 0 if he is not. Estimate p C.

Chad The empirical pmf is p C (0) = 5 15 = 1 3, p C (1) = 10 15 = 2 3.

Chad You model the temperature using a random variable T. Sketch the kernel density estimator of the conditional distribution of T given C using a rectangular kernel with width equal to 2.

Chad 0.20 f T C (t 0) 0.15 f T C (t 1) 0.10 0.05 0.00 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

Chad If T = 68 what is the ML estimate of C?

Chad If T = 68 what is the ML estimate of C? f T C (68 0) = 0.2 f T C (68 1) = 0

Chad If T = 64 what is the MAP estimate of C?

Chad If T = 64 what is the MAP estimate of C? p C T (0 64)

Chad If T = 64 what is the MAP estimate of C? p C (0) f T C (64 0) p C T (0 64) = p C (0) f T C (64 0) + p C (1) f T C (64 1)

Chad If T = 64 what is the MAP estimate of C? p C (0) f T C (64 0) p C T (0 64) = p C (0) f T C (64 0) + p C (1) f T C (64 1) = 1 3 0.2 1 3 0.2 + 2 3 0.1

Chad If T = 64 what is the MAP estimate of C? p C (0) f T C (64 0) p C T (0 64) = p C (0) f T C (64 0) + p C (1) f T C (64 1) = = 1 2 1 3 0.2 1 3 0.2 + 2 3 0.1

Chad If T = 64 what is the MAP estimate of C? p C (0) f T C (64 0) p C T (0 64) = p C (0) f T C (64 0) + p C (1) f T C (64 1) = = 1 2 1 3 0.2 1 3 0.2 + 2 3 0.1 p C T (1 64)

Chad If T = 64 what is the MAP estimate of C? p C (0) f T C (64 0) p C T (0 64) = p C (0) f T C (64 0) + p C (1) f T C (64 1) = = 1 2 1 3 0.2 1 3 0.2 + 2 3 0.1 p C T (1 64) = 1 p C T (0 64)

Chad If T = 64 what is the MAP estimate of C? p C (0) f T C (64 0) p C T (0 64) = p C (0) f T C (64 0) + p C (1) f T C (64 1) = = 1 2 1 3 0.2 1 3 0.2 + 2 3 0.1 p C T (1 64) = 1 p C T (0 64) = 1 2

Chad What happens if the temperature is 57? Explain how using parametric estimation may alleviate this problem.

3-point shooting The New York Knicks hire you as a data analyst. Your first task is to come up with a way to determine whether a 3-point shooter is any good. You will use the following graph of the function g (θ, n) = θ n. g(θ,n) 0.950 0.900 0.850 0.800 0.750 0.700 0.650 0.600 0.550 0.500 0.450 0.400 0.350 0.300 0.250 0.200 0.150 0.100 0.050 0.005 n = 4 n = 9 n = 14 n = 19 n = 24 0.5 0.6 0.7 0.8 0.9 1.0 θ

3-point shooting 1. Interpret g (θ, n).

3-point shooting 1. Interpret g (θ, n). 2. The coach tells you: I want to make sure that the guy has a shooting percentage over 80%. What is your null hypothesis?

3-point shooting 1. Interpret g (θ, n). 2. The coach tells you: I want to make sure that the guy has a shooting percentage over 80%. What is your null hypothesis? 3. What number of shots does a player need to make in a row for you to reject the null hypothesis with a confidence level of 5%? 14 4. A player makes 9 shots in a row. What is the corresponding p value? Do you declare him as a good shooter? 0.14 5. What is the probability that you do not declare a player who has a shooting percentage of 90% as a good shooter? 1 g (0.9, 14) 0.76 6. You apply the test on 10 players. You adapt the threshold applying Bonferroni s method. What is the new threshold?