MAS223 Statistical Modelling and Inference Examples

Similar documents
MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises and Solutions

Joint Distributions. (a) Scalar multiplication: k = c d. (b) Product of two matrices: c d. (c) The transpose of a matrix:

Exercises and Answers to Chapter 1

Statistics 351 Probability I Fall 2006 (200630) Final Exam Solutions. θ α β Γ(α)Γ(β) (uv)α 1 (v uv) β 1 exp v }

Lecture 1: August 28

18 Bivariate normal distribution I

18.440: Lecture 28 Lectures Review

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

2 Functions of random variables

18.440: Lecture 28 Lectures Review

Chapter 5 continued. Chapter 5 sections

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

4. CONTINUOUS RANDOM VARIABLES

conditional cdf, conditional pdf, total probability theorem?

Bivariate Transformations

Statistics STAT:5100 (22S:193), Fall Sample Final Exam B

Probability and Distributions

Introduction to Normal Distribution

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Conditional densities, mass functions, and expectations

3. Probability and Statistics

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Chapter 5. Chapter 5 sections

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

18.440: Lecture 26 Conditional expectation

Chp 4. Expectation and Variance

STA 256: Statistics and Probability I

1 Probability theory. 2 Random variables and probability theory.

Notes on the Multivariate Normal and Related Topics

Random Variables and Their Distributions

5 Operations on Multiple Random Variables

Continuous Random Variables

1 Presessional Probability

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

E X A M. Probability Theory and Stochastic Processes Date: December 13, 2016 Duration: 4 hours. Number of pages incl.

Joint Probability Distributions and Random Samples (Devore Chapter Five)

MATH/STAT 3360, Probability Sample Final Examination Model Solutions

Stat 5101 Notes: Algorithms

Expectation and Variance

STA 256: Statistics and Probability I

Stat410 Probability and Statistics II (F16)

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

Continuous Random Variables

More than one variable

ECE Lecture #9 Part 2 Overview

1 Review of Probability and Distributions

HT Introduction. P(X i = x i ) = e λ λ x i

Lecture 2: Review of Probability

14.30 Introduction to Statistical Methods in Economics Spring 2009

STA 2201/442 Assignment 2

Hypothesis Testing: The Generalized Likelihood Ratio Test

Review of Probability Theory

APPM/MATH 4/5520 Solutions to Exam I Review Problems. f X 1,X 2. 2e x 1 x 2. = x 2

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Exam 2. Jeremy Morris. March 23, 2006

Formulas for probability theory and linear models SF2941

We introduce methods that are useful in:

Practice Examination # 3

4. Distributions of Functions of Random Variables

Math 152. Rumbos Fall Solutions to Assignment #12

ARCONES MANUAL FOR THE SOA EXAM P/CAS EXAM 1, PROBABILITY, SPRING 2010 EDITION.

Chapter 4 continued. Chapter 4 sections

This does not cover everything on the final. Look at the posted practice problems for other topics.

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

Statistics 3858 : Maximum Likelihood Estimators

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

Chapter 5. Random Variables (Continuous Case) 5.1 Basic definitions

Continuous Random Variables and Continuous Distributions

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

MSc Mas6002 Introductory Material Block A Introduction to Probability and Statistics

University of Chicago Graduate School of Business. Business 41901: Probability Final Exam Solutions

1.12 Multivariate Random Variables

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Master s Written Examination - Solution

Stat 5101 Notes: Brand Name Distributions

LIST OF FORMULAS FOR STK1100 AND STK1110

3 Continuous Random Variables

Final Exam # 3. Sta 230: Probability. December 16, 2012

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

Statistics for scientists and engineers

BASICS OF PROBABILITY

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Delta Method. Example : Method of Moments for Exponential Distribution. f(x; λ) = λe λx I(x > 0)

1.1 Review of Probability Theory

Conditioning a random variable on an event

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 4 Multiple Random Variables

01 Probability Theory and Statistics Review

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Problem Set 8 Fall 2007

Closed book and notes. 60 minutes. Cover page and four pages of exam. No calculators.

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Multiple Random Variables

Transcription:

Chapter MAS3 Statistical Modelling and Inference Examples Example : Sample spaces and random variables. Let S be the sample space for the experiment of tossing two coins; i.e. Define the random variables X to be the number of heads seen, S {HH, HT, T H, T T }. Y to be equal to 5 if we see both a head and a tail, and otherwise. Element of S Value of X Value of Y HH HT 5 T H 5 T T Example : Discrete random variables The random variables X and Y from Example are both discrete random variables. > Calculate P[X ]. If X then either X or X. We have P[X ] + P[X ] 3 4. > Sketch the distribution function of X. A sketch of its distribution function looks like: Example 3: Continuous random variables > Recall from MAS3 that an exponential random variable with parameter λ > has probability density function λe λx x > f X x Calculate P[ X ], and find the distribution function F X x.

We can calculate P[ X ] f X x dx λe λx dx [ e λx] x e λ e λ. To find the distribution function, note that for x we have P[X ] and for x > we have x x P[X x] f X u du λe λu du e λx. Therefore, e λx if x >, F X x A sketch of the distribution function F X x looks like: Example 4: Properties of distribution functions > Let x if x > F x Sketch F and show that F is a distribution function A sketch of F looks like To show that F is a distribution function, we ll check properties -3 from Section... From the definition, F x for all x. Since F x for all x we have lim x F x, and also lim x x.. Since F x for all x, its clear that F x is non-decreasing while x. If x < y then y < x so x y. Hence F is non-decreasing across all x R.

3. From its definition, F is continuous for x, and x,. Since F + F F, we have that F is continuous everywhere. Alternatively, in this course, we allow ourselves to prove continuity by drawing a sketch, as above. Hence, F is a distribution function, and as a result there exists a random variable X with distribution function F X F. Example 5: Calculating expectations and variances > Let X be an Exponential random variable, from Example 3, with p.d.f. λe λx x > f X x Find the mean and variance of X. We can calculate, integrating by parts, E[X] + xf X x dx [ λ e λx ] x λ. xλe λx [x e λx] x e λx dx For the variance, it is easiest to calculate E[X ] and then use from MAS3 that VarX E[X ] E[X]. So, E[X ] + λ x f X x dx xλe λx dx λ x λe λx [x e λx] where we use that we already calculated xλe λx dx λ. Hence, Chapter VarX E[X ] E[X] λ. Example 6: Calculating E[e Y ] where X N,. x > Let Y be a normal random variable, with mean and variance, with p.d.f. f Y y π e y /. x e λx dx Find E[e Y ]. We need to calculate E[e Y ] e y f Y y dy π e y e y / dy. 3

We can t evaluate this integral explicitly. However, we do know the value of a similar integral, that is we know P[Y R] π e y / dy. Our aim is to rewrite E[e Y ] into this form and hope we can deal with whatever else is left over. We can do so by completing the square: e y e y / exp y + y + Putting this into, we have E[e Y ] e / exp y + e y / e /. e / π e y / dy π e z / dz where z y. Then, using we have E[e Y ] e /. See Q.9 for a more general case of this method. Example 7: Mean and variance of the Gamma distribution > Let X have the Gaα, β distribution, where α, β >. Find the mean and variance of X. We can calculate E[X] xf X x dx β α Γα xα e βx dx βα Γα + Γα β α+ using Lemma.3 αγα using Lemma. βγα α β. Similarly, for the variance, So E[X ] x f X x dx β α Γα xα+ e βx dx βα Γα + Γα β α+ using Lemma.3 αα + Γα β Γα using Lemma. αα + β. VarX E[X ] E[X] αα + α β α β. 4

Example 8: Mean and variance of the Beta distribution > Let X have the Beα, β distribution, where α, β >. Find the mean and variance of X. For the mean, For the variance, E[X] Bα, β xα x β dx Bα +, β Bα, β / Γα + Γβ ΓαΓβ using.5 Γα + β + Γα + β αγαγα + β using Lemma. α + βγα + βγα α α + β. E[X ] Bα, β xα+ x β dx Bα +, β Bα, β / Γα + Γβ ΓαΓβ using.5 Γα + β + Γα + β αα + ΓαΓα + β using Lemma. α + βα + β + Γα + βγα αα + α + βα + β +. So, using that VarX E[X ] E[X] we have Chapter 3 VarX αα + α + βα + β + α α + β αα + α + β α α + β + α + β α + β + αβ α + β α + β +. Example 9: Cube root of the Be3, distribution. > Let X Be3, and let Y 3 X. Find the probability density function of Y. From.6, the p.d.f. of the Beα, β distribution is Bα,β f X x xα x β if x, 5

Note that, for any α >, Bα, ΓαΓ Γα + ΓαΓ αγα α by.5 and Lemma.. Putting this, along with α 3 and β into.6, the p.d.f. of X is 3x if x, f X x For the transformation, we use the function gx 3 x, which is strictly increasing. 3 The p.d.f. of X is non-zero on,, and g maps R X, to,, so gr X,. We have g y y 3, so dg dy 3y. Therefore, by Lemma 3. we have 3y 3 3y if y, f Y y otherwise, 9y 8 if y, In fact, using the same calculations as above, it can be seen that this is the p.d.f. of a Be9, distribution. See Q3.5 for a more general case. Example : Standardization of the normal distribution. > Let X Nµ, σ and define Y X µ σ. Show that Y N,. The p.d.f. of the normal distribution with mean µ and variance σ is f X x x µ exp πσ σ with range R X R. The function gx x µ σ is strictly increasing, and gr R. 6

If y x µ σ then x σy + µ, hence the inverse function is g y σy + µ, with derivative dg dy σ >. Hence, by Lemma 3., f Y y exp y σ πσ π exp which is the p.d.f. of a N, random variable. Example : The log-normal distribution. y > Find the probability density function of Y e X, where X Nµ, σ. Recall that Y is known as the log-normal distribution, which we introduced in Section... The probability density function of X is f X x πσ exp x µ σ, which is non-zero for all x R. Our transformation is gx e x, which is strictly increasing for all x R. The range of X is R, which is mapped by g to gr,. We have g y log y, and dg dy f Y y y. Hence, by Lemma 3. the p.d.f. of Y is given by yσ exp π log y µ σ if y, otherwise. Example : Square of a standard normal the chi-squared distribution. > Let X N, and let Y X. Find the p.d.f. of Y and verify that Y has the χ distribution. We aim to find the p.d.f. of Y and check that it matches the p.d.f. given for the χ distribution in Section.3.3. Note that R X R, and we can t apply Lemma 3. because gx x is not strictly monotone on R. If y < then P[Y y] because Y X. Moreover, because the normal distribution is a continuous distribution, P[X ], so also P[Y ] 7

This leaves y >, and in this case we have F Y y P[Y y] P[ y X y] P[X y] P[X y] Φ y Φ y Here, Φx P[X x] is the distribution function of the standard normal distribution. Differentiating with respect to y, we have f Y y y φ y y φ y y φ y πy exp y/. Here, φ is the probability density function of the standard normal distribution. We use that φx φ x. If we recall from Section.3. that Γ/ π, we then have y f Y y Γ/ exp y if y > which exactly matches the p.d.f. given for the χ distribution in Section.3.3. Chapter 4 Example 3: Joint probability density functions > Let T be the triangle {x, y : x,, y, x}. Define kx + y if x, y T fx, y Find the value of k such that f is a joint probability density function. First, we sketch the region T on which f X,Y x, y is non-zero. 8

We need fx, y for all x, y, which means we must have k. T fx, y dx dy. Therefore, fx, y dy dx k k T x k. kx + y dy dx [xy + y 3x dx ] x y dx Also, we need that So k. Here, to find the limits of integration, we describe the region T as being covered by vertical lines, one for each fixed x. With x fixed, the range of y that makes up T is y, x. That is, we use that T {x, y : x,, y, x}. > If X and Y have joint p.d.f. f X,Y x, y fx, y, find P[X + Y > ]. To find P[X + Y > ], we need to integrate f X,Y x, y over the region of x, y for which x, y T and x + y >. Let s call this region T, and sketch it. 9

We have T {x, y : x,, y x, x}. So, P[X + Y > ] x x x + y dy dx [ xy + y ] x y x dx 4x dx [ 4 3 x3 x 3 3 ] 3 Example 4: Marginal distributions > Let X, Y be as in Example 3. Find the marginal p.d.f.s of X and Y. For x,, f X x f X,Y x, y dy x x + y dy [ xy + y ] x y 3x. Here, to find the limits of the integral, we keep x fixed, and then look for the range of y for which f X,Y x, y is non-zero. That is, we use T {x, y : x,, y, x}. For x /,, we have f X,Y x, y, so 3x if x, f X x is the marginal p.d.f. of X. For y,, we have f Y y f X,Y x, y dx y x + y dx [ x + xy ] xy + y 3y. Here, to find the limits of the integral, we keep y fixed, and then look for the range of x for which f X,Y x, y is non-zero. That is, we use T {x, y : y,, x y, }. For y /, we have f X,Y x, y, so + y 3y if y, f Y y is the marginal p.d.f. of Y. Example 5: Conditional distributions

> Let X, Y be as in Example 3. For y,, find the conditional p.d.f. of X given Y y. We obtained f Y y in Example 4, and we know f X,Y x, y from Example 3. Note that, with y, fixed, f X,Y x, y is non-zero only for x y,. So, f X Y y x f X,Y x, y x+y +y 3y if x y, f Y y Example 6: Independence, factorizing f X,Y. > Are the random variables X and Y from Example 3 independent? The random variables X and Y from Example 3 are not independent as the p.d.f. x + y if x, y T fx, y otherwise cannot be factorised as a function of x times a function of y. > Let U and V be two random variables with joint probability density function ue u+3v if u >, v > f U,V u, v Are U and V independent? f U,V u, v can be factorised into a function of x and a function of y, 4ue u 3e 3v if u >, v > f U,V u, v guhv where 4ue u if u > 3e 3v if v > gu hv otherwise, Therefore, U and V are independent. In fact, in this case we can recognize that g is the p.d.f. of a Ga, and h is the p.d.f. of a Exp3, so U and V are Ga, and Exp3 respectively. Example 7: Covariance and correlation > Let X, Y be as in Example 3. Find the covariance CovX, Y.

We want to calculate CovX, Y E[XY ] E[X]E[Y ]. We have E[XY ] x 3. 5 3 x4 dx xyx + y dy dx Using the marginal probability density functions for X and Y that we found in Example 4, we have E[X] E[Y ] xf X x dx yf Y y dy evaluating these two integrals is left to you. So, > Find the correlation ρx, Y. We now need to find ρx, Y of X and Y. We have E[X ] E[Y ] CovX, Y 3 3 4 x3x dx 3 4 y + y 3y dy 5 5 48. CovX,Y VarX VarY. So, we also need to calculate the variances x f X x dx y f Y y dy x 3x dx 3 5 y + y 3y dy 7 3 again, evaluating these two integrals is left to you. From this we obtain, 3 and we get VarX E[X ] E[X] 3 3 5 4 8 VarY E[Y ] E[Y ] 7 5 3 43 7 ρx, Y /48 3 8 43 7.44 Example 8: Calculating conditional expectation > Let X, Y be as in Example 3. Let y,. Find E[X Y y] and E[X Y ]. We have already found the conditional p.d.f. of X in Example 5, it is x+y +y 3y if x y, f X Y y x

So, Hence, E[X Y y] y [ ] x + yx + y 3y dx 3 x3 + yx + y 3y y E[X Y ] + 3Y 5Y 3 3 + Y 3Y. + 3y 5y3 3 + y 3y. > Show that E[E[X Y ]] E[X]. To find E[E[X Y ]], we first note that E[X Y ] gy + 3Y 5Y 3 3 + Y 3Y use then use the usual method for finding the expectation of a function of Y. That is, E[E[X Y ]] E[gY ] gyf Y y dy + 3y 5y 3 3 dy 3 + 5 3 4. We have already shown during Example 7 that E[X] 3 4. Example 9: Proof of E[E[X Y ]] E[X] It is no coincidence that E[E[X Y ]] E[X] in Example 8. In fact, this holds true for all pairs of random variables X and Y. Here is a general proof. We have E[X Y ] gy, where So, gy E[X Y y] xf X Y y x dx. E[E[X Y ]] E[gY ] E[X]. gyf Y y dy x xf X Y y xf Y y dx dy xf X,Y x, y dy dx xf X x dx f X,Y x, y dy dx by definition of the conditional p.d.f. by definition of the marginal p.d.f. Example : Calculation of expectation and variance by conditioning 3

Let X Ga, and, conditional on X x, let Y P ox. Then, using standard results about the mean and variance of Gamma/Poisson random variables, E[X], VarX, E[Y X] X and VarY X X. So, using the formulae from Lemma 4., Chapter 5 E[Y ] E[E[Y X]] E[X] VarY E[VarY X] + VarE[Y X] E[X] + VarX 3. Example : Transforming bivariate random variables > Let X Ga3, and Y Be,, and let X and Y be independent. Find the joint p.d.f. of the vector U, V, where U X + Y and V X Y. The p.d.f.s of X and Y are f X x x e x if x > otherwise, 6y y if y, f Y y By independence, their joint p.d.f. is 3x y ye x if x > and y, f X,Y x, y The transformation we want is u x + y and v x y. So, u + v x, u v y, and the inverse transformation is x u+v u v, and y. Hence, the Jacobian is J det x u y u x v y v. Now, we need to transform the region T {x, y : x >, y, } into the u, v plane. This region is bounded by the three lines x, y and y, which map respectively to the lines u v, u v and u v +. 4

Our transformed region must also be bounded by the three lines; to check which section of the sketch it is we simply find out where some x, y T maps to. We have, T which maps to,, so the shaded region is the image of T. Therefore, f u+v X,Y f U,V u, v, u v if u >, v u, u, v > u 3 3 u + v u v u + ve u+v if u >, v u, u, v > u Example : The Box-Muller transform, simulation of normal random variables Let S Exp and Θ U[, π, and let S and Θ be independent. Then S and Θ have joint p.d.f. given by 4π f S,Θ s, θ e s if s and θ [, π We can think of S and Θ as giving the location of a point S, Θ in polar co-ordinates. We transform this point into Cartesian co-ordinates, meaning that we want to use the transformation X S cosθ and Y S sinθ. Therefore, our transformation is x s cos θ, y s sin θ. This transformation maps the set of s, θ for which f S,Θ s, θ > onto all of R it is just Polar coordinates r, θ with r s. To find the inverse transformation, note that s x +y and y/x tan θ, so θ arctany/x. So the Jacobian is Hence, J det s x θ x s y θ y det x y y/x /x +y/x +y/x f X,Y x, y x +y π e for all x, y R. Now, we can factorise this as f X,Y x, y e x e y, π π + y/x y /x + y/x which implies that X and Y are independent standard normal random variables. Assuming we can simulate uniform random variables, then using the transformation in Q3.3 we can also simulate exponential random variables. Then, using above transformation, we can simulate standard normals. Example 3: Finding the distribution of a sum of Gamma random variables 5

> Suppose that two independent random variables X and Y follow the distributions X Ga4, and Y Ga,. Find the distribution of Z X + Y Let W X. So the transformation we want to apply is z x + y, w x. The inverse transformation is x w and y z w, so the Jacobian is x x J det z w det. y z By independence of X and Y, their joint p.d.f. is 4 Γ4 f X,Y x, y x3 e x Γ ye y if x, y > otherwise 6 6 x3 ye x+y if x, y > y w The region of x, y on which f X,Y x, y is non-zero is x > and y >. This is bounded by the lines x, y, which are respectively mapped to w and z w. The point, is mapped to,, meaning that the shaded area is the region on which f Z,W z, w is non-zero. Hence, the joint p.d.f. of Z and W is 6 6 f Z,W z, w w3 z we z if z > and w, z othwerwise. Lastly, to obtain the marginal p.d.f. of Z, we integrate out w. For z >, f Z z 6 6 e z 6 6 e z z 6 6 z5 e z 6 Γ6 z5 e z. w 3 z w 4 dw z 5 4 z5 5 6

For z we have f Z z. So, we can recognise f Z z as the p.d.f. of a Ga6, random variable, and conclude that Z Ga6,. More generally, this method can be used to show that if X Gaα, β, Y Gaα, β and X and Y are independent, then X + Y Gaα + α, β for any α, α, β. See Q5.8. Chapter 6 Example 4: Mean vectors and covariance matrices Recall the random variables X, Y from Example 3. In Example 7 we calculated that E[X] 3 5 4 and E[Y ]. So the mean vector of X X, Y T is E[X] 3 4 5 In Example 7 we also calculated that CovX, Y 48 Therefore, the covariance matrix of X is CovX 3 8 48. 48 43 7 Example 5: Affine transformation of a random vector > Suppose that the random vector X X, X, X 3 T has E[X], CovX 3. 3 3 43 and that Var[X] 8, VarY 7. Define two new random variables, U X X + X 3 and V X X 3 +. Find the mean vector and covariance matrix of U U, V T. We can express the relationship between X and U as an affine transformation: X U U AX + b X +. V So, we can use Lemma 6.3 to find the mean vector and covariance matrix of U. Firstly,. X 3 E[U] AE[X] + b + + 7

and secondly, CovU A CovXA T 3 3 3 4 3 3 7 3 3 3 3 > Find the correlation coefficient ρu, V.. We can read off VarU, VarV and CovU, V from the covariance matrix of U. So the correlation coefficient of U and V is ρu, V Example 6: Variance of a sum CovU, V VarU VarV 3 3 3. > Suppose that two random variables X and Y have variances σx and σ Y, and covariance CovX, Y. Find the variance of X + Y. If we write U X + Y, then X U U Y where U denotes the matrix with the single entry U. We usually won t bother to write brackets around matrices/vectors. We can apply Lemma 6.3 to this case, with A and X X, Y T, to obtain that CovU A CovXA T. The covariance matrix of X is given by σx CovX, Y CovX. CovX, Y Since U is, CovU VarU, so we have σx CovX, Y VarX + Y CovX, Y which you should recognize. σ Y σ X + CovX, Y + σ Y, σ Y 8

Example 7: The bivariate normal with independent components > Find the p.d.f. of the bivariate normal X X, X T in the case where CovX, X. From Definition 6.4, the general bivariate normal distribution X, with mean vector µ and covariance matrix Σ has joint probability density function f X,X x, x π σ σ exp σ x µ σ x µ x µ + σ x µ σ σ σ σ If we assume CovX, X σ σ, then the p.d.f. simplifies to f X,X x, x exp x µ πσ σ σ x µ σ exp x µ exp x µ πσ πσ σ f X x f X x. 4 Here, in the final line we see factorize f X,X x, x, into the product of the p.d.f. of the Nµ, σ random variable X and the p.d.f. of the Nµ, σ random variables X. Therefore, in this case X and X are independent. Note that, setting µ µ and σ σ, we recover 6.. We have shown above that if CovX, X then X and X are independent. If X and X are independent then it is automatic that CovX, X. Hence: X and X are independent if and only if CovX, X. We will record this fact as Lemma 6.8. Example 8: Plotting the p.d.f. of the bivariate normal. The pdf of a bivariate normal is a bell curve : σ This example is the standard bivariate normal Nµ, Σ where µ, and Σ. It was generated in Mathematica with the code all one line 9

Plot3D[/Pi E^-x^ + y^/, {x, -4, 4}, {y, -4, 4}, PlotRange -> All, ColorFunction -> ColorData["Rainbow"][#3] &] Changing µ alters the positive of the center of the bell, without changing the shape of the curve. For example, taking µ, and Σ gives Changing Σ afters the shape of the bell. For example, taking µ, and Σ 4 gives Changing both µ and Σ together results in a bell curve that is both translated and reshaped. Example 9: Marginal distributions of the bivariate normal, and their covariance. > Let X X, X T have distribution N µ, Σ where µ 3 and Σ 3. Write down the marginal distributions of X and X. From Lemma 6.7 we know that X and X are both univariate normals. We can read their means and covariances off from the mean vector µ and covariance matrix Σ. We have X Nµ, σ, so X N,, and also X Nµ, σ so X N3, 3.

> Find CovX, X and ρx, X. Are X and X independent? From the covariance matrix, CovX, X. Hence, ρx, X CovX, X VarX VarX. 3 6 Clearly, we have CovX, X so X and X are not independent. Example 3: Conditional distributions for bivariate normal > Let a R and let X N µ, Σ where µ and Σ 3 3. Find the conditional distribution of X given X a. By Lemma 6.9, the conditional distribution of X given X a is a univariate normal with mean given by µ + ρ σ σ x µ and variance ρ σ. In this case, µ, µ, ρ 3/, σ, σ, and x a. So, µ + 3a and σ 9. Hence, the conditional distribution of X given X a is N + 3a,. Example 3: Transformations of bivariate normal > Let X N µ, Σ where µ and Σ are as in Example 3. Let Y X + X X Y Find the distribution of Y Y, Y T. We can write Y as an affine transformation of X, that is X Y AX + b +. The matrix A is a non-singular matrix, so by Lemma 6., Y is a bivariate normal. Therefore, if we can find the mean vector and covariance matrix of Y, we know the distribution of Y. and X 4 E[Y] AE[X] + b, CovY A CovXA T 3 53 3. 3 3 So, the distribution of Y is [ ] 4 53 3 Y N,. 3 Example 3: Affine transformation of a three dimensional normal distribution.

> Suppose X X X 3 4 N 3, 9. 4 Find the joint distribution of Y Y, Y T where Y X X and Y X + X + X 3. so and We can write, X Y AX X, X 3 E[Y] AE[X], 4 CovY A CovXA T 4 9 4 6 7 5 9 6. 6 3 It is not hard to see that A is an onto transformation, so Y has a bivariate normal distribution here we use the multivariate equivalent of Lemma 6.. Hence, [ ] Y 9 6 N,. 4 6 3 > Find ρx, X 3. Are X and X 3 independent? From the covariance matrix of X, we can read off Y ρx, X 3 CovX, X 3 VarX VarX 3. 4 4 Since X and X 3 are components of a multivariable normal distribution, and CovX, X 3, by the three dimensional equivalent of Lemma 6.8 X and X 3 are independent. Chapter 7 Example 33: Maximising a function

> Find the value of θ which maximises fθ θ 5 θ on the range θ [, ]. First, we look for turning points. We have f θ 5θ 4 θ + θ 5 θ 4 5 6θ So the turning points are at θ and θ 5 6. To see which ones are local maxima, we calculate the second derivative: f θ 4θ 3 5 6θ + θ 4 6 θ 3 3θ. So, f 5 6 5 6 3 5 < and θ 5 6 is a local maximum. Unfortunately, f, so we don t know if θ is a local maximum, minimum or inflection. However, we can check that f, so it doesn t matter which, we still have f < f 5 6. Hence, θ 5 6 is the global maximiser. Example 34: Likelihood functions and maximum likelihood estimators > Let X be a random variable with Expθ distribution, where the parameter λ is unknown. Find the and sketch the likelihood function of X, given the data x 3. The likelihood function is Lθ; 3 f X 3; θ θe 3θ defined for all θ Θ,. We can plot this in R, for θ,, with the command curvex*exp-3x, from, to5, xlab~theta, ylab"l"~theta~";4 3

Note that we use x as the θ variable here because R hard-codes its use of x as a graph variable. The result is > Given this data, find the likelihood of θ,,,, 5. Amongst these values of θ, which has the highest likelihood? The likelihoods are L ; 3 e 3.7 L ; 3 e 3. L; 3 e 3.5 L; 3 e 6.5 L5; 3 5e 5.5 6 So, restricted to looking at these values, θ has the highest likelihood. > Find the maximum likelihood estimator of θ,, based on the single data point x 3. We need to find the value of θ Θ which maximises Lθ; 3. We differentiate, to look for turning points, obtaining dl dθ e 3θ 3θe 3θ e 3θ 3θ. 4

Hence, there is only one turning point, at θ 3. We differentiate again, obtaining d L dθ 3e 3θ 3θ + e 3θ 3 e 3θ 6 + 9θ At θ 3, we have d L dθ e 6 + 3 <, so the turning point at θ 3 is a local maximum. Since it is the only turning point, it is also the global maximum. Hence, the maximum likelihood estimator of θ is ˆθ 3. Example 35: Models, parameters and data aerosols. > The particle size distribution of an aerosol is the distribution of the diameter of aerosol particles within a typical region of air. The term is also used for particles within a powder, or suspended in a fluid. In many situations, the particle size distribution is modelled using the log-normal distribution. It is typically reasonable to assume that the diameters of particles are independent. Assuming this model, find the joint probability density function of the diameters observed in a sample of n particles, and state the parameters of the model. Recall that the p.d.f. of the log-normal distribution is f Y y yσ exp log y µ π σ if y, The parameters of this distribution, and hence also the parameters of our model, are µ R and σ,. Since the diameters of particles are assumed to be independent, the joint probability density function of Y Y, Y,..., Y n, where Y i is the diameter of the i th particle, is f Y y,..., y n n f Yi y i i πσ n/ y y...y n exp n log y i µ σ if y i > for all i i otherwise. Note that, if one or more of the y i is less than or equal to zero then f Yi y i, which means that also f Y y,..., y n. Example 36: Maximum likelihood estimation with i.i.d. data. > Let X Bernθ, where θ is an unknown parameter. Suppose that we have 3 independent samples of X, which are x {,, }. Find the likelihood function of θ, given this data. 5

The probability function of a single Bernθ random variable is θ if x f X x; θ θ if x otherwise Since our three samples are independent, we model x as a sample from the joint distribution X X, X, X 3, where f X x; θ 3 f Xi x i ; θ i and f Xi is the p.d.f. of a single Bernθ random variable. Since f Xi has several cases, it would be unhelpful to try and expand out this formula before we put in values for the x i. Our likelihood function is therefore Lθ; x f X ; θ f X ; θ f X3 ; θ θθθ θ θ 3. The range of values that the parameter θ can take is Θ [, ]. > Find the maximum likelihood estimator of θ, given the data x. We seek to maximize Lθ; x for θ [, ]. Differentiating once, dl dθ θ 3θ θ 3θ so the turning points are at θ and θ 3. Differentiating again, d L dθ 6θ which gives d L θ dθ and d L θ/3 dθ 4. Hence, θ is a local minimum and θ 3 is a local maximum, so θ 3 maximises Lθ; x over θ [, ]. The maximum likelihood estimator of θ is therefore ˆθ 3. This is, hopefully, reassuring. The number of s in our sample of 3 was, so using independence θ 3 seems like a good guess. See Q7. for a much more general case of this example. Example 37: Maximum likelihood estimation radioactive decay. > Atoms of radioactive elements decay as time passes, meaning that any such atom will, at some point in time, suddenly break apart. This process is known as radioactive decay. The time taken for a single atom of, say, carbon-5 to decay is usually modelled as an exponential random variable, with unknown parameter λ,. The parameter λ is known as the decay rate. The times at which atoms decay are known to be independent. 6

Using this model, find the likelihood function for the time to decay of a sample of n carbon-5 atoms. The decay time X i of the i th atom is exponential with parameter λ,, and therefore has p.d.f. λe λxi if x i > f Xi x i ; λ Since each atom decays independently, the joint distribution of X X i n i is n n λe λxi if x i > for all i f X x; λ f Xi x i ; λ i i otherwise. λ n exp λ n i x i if x i > for all i Therefore, the likelihood function is λ n exp λ n i Lλ; x x i if x i > for all i The range of possible values of the parameter λ is Θ,. > Suppose that we have sampled the decay times of 5 carbon-5 atoms in seconds, accurate to two decimal places, and found them to be x {.5,.9,.88, 4.6, 9.75,.6,.3,.7,.3,.8, 4.5, 9.5,.67, 3.79, 4.3}. Find the maximum likelihood estimator of λ, based on this data. Given this data, for which 5 x i 47.58, our likelihood function is Differentiating, we have i Lλ; x λ 5 e 47.58λ. dl dλ 5λ4 e 47.58λ 47.58λ 5 e 47.58λ λ 4 5 47.58λe 47.58λ which is zero only when λ or λ 5/47.58.3. Since λ is outside of the range Θ, of possible parameter values, the only turning point of interest is λ 5/47.58. Differentiating again with the details left to you, we end up with d L dλ λ3 47.4λ 4 + 63.86λ 5 e 47.58λ λ 3 47.4λ + 63.86λ e 47.58λ 7

Evaluating at our turning point gives d L dλ λ5/47.58 5 3 4.9996 e 5 <. 47.58 So, our turning point is a local maximum. Since there are no other turning points within the allowable range our turning point is the global maximum. estimator of λ, given our data x, is ˆλ 5 47.58.3. Hence, the maximum likelihood In reality, physicists are able to collect vastly more data than n 5, but even with 5 data points we are not far away from the true value of λ, which is λ.8333. Of course, by true value here we mean the value that has been discovered experimentally, with the help of statistical inference. So-called carbon dating typically uses carbon-4, which has a much slower decay rate of approximately. 4. Carbon-4 is present in many living organisms and, crucially, the proportion of carbon in living organisms that is carbon-4 is essentially the same for all living organisms. Once organisms die, the carbon-4 radioactively decays. The key idea behind carbon dating is that, by measuring the concentration of carbon-4 within a fossil, scientists can estimate how long ago that fossil lived. To do so, a highly accurate estimate of the decay rate of carbon-4 is needed. Example 38: Maximum likelihood estimation via log-likelihood mutations in DNA. > When organisms reproduce, the DNA or RNA of the offspring is a combination of the DNA of its one, or two parents. Additionally, the DNA of the offspring contains a small number of locations in which it differs from its parents. These locations are called mutations. The number of mutations per unit length of DNA is typically modelled using a Poisson distribution, with an unknown parameter θ,. The numbers of mutations found in disjoint sections of DNA are independent. Using this model, find the likelihood function for the number of mutations present in a sample of n disjoint strands of DNA, each of which has unit length. Let X i be the number of mutations in the i th strand of DNA. So, under our model, f Xi x i ; θ e θ θ xi for x i {,,,...}, and f Xi x i if x i / N {}. Since we assume the X i are independent, x i! the joint distribution of X X, X,..., X n has probability function f X x n i e θ θ xi x i! x!x!... x n! e nθ θ n xi Actually, the biological details here are rather complicated, and we omit discussion of them. 8

provided all x i N {}, and zero otherwise. Therefore, our likelihood function is Lθ; x The range of possible values for θ is Θ,. x!x!... x n! e nθ θ n xi. > Let x be a vector of data, where x i is the number of mutations observed in a distinct unit length segment of DNA. Suppose that at least one of the x i is non-zero. Find the corresponding log-likelihood function, and hence find the maximum likelihood estimator of θ. The log-likelihood function is lθ; x log Lθ; x, so log Lθ, x log x!x!... x n! e nθ θ n xi n n logx i! nθ + log θ x i. i We now look to maximise lθ; x, over θ,. Differentiating, we obtain dl dθ n + n x i. θ Note that this is much simpler than what we d get if we differentiated Lθ; x. So, the only turning point of lθ, x is at θ n n i x i. Differentiating again, we have d l dθ θ i n x i. Since our x i are counting the occurrences of mutations, x i, and since at least one is non-zero we have d l dθ < for all θ. Hence, our turning point is a maximum and, since it is the only maximum, is also the global maximum. Therefore, the maximum likelhood estimator of θ is ˆθ n x i. n > Mutations rates were measured, for HIV patients, and there were found to be { } x 9, 6, 37, 8, 4, 34, 37, 6, 3, 48, 45 mutations per 4 possible locations i.e. per unit length. This data comes from the article Cuevas et al. 5. i Assuming the model suggested above, calculate the maximum likelihood estimator of the mutation rate of HIV. The data has x i i x i 446 4 so we conclude that the maximum likelihood estimator of the mutation rate θ, given this data, is ˆθ 446. http://journals.plos.org/plosbiology/article?id.37/journal.pbio.5 i 9

Example 39: Maximum likelihood estimation via log-likelihood spectrometry. > Using a mass spectrometer, it is possible to measure the mass 3 of individual molecules. For example, it is possible to measure the masses of individual amino acid molecules. A sample of 5 amino acid molecules, which are all known to be of the same type and therefore, the same mass, were reported to have masses x {65.76, 4.4, 94., 3.3, 5., 4.77, 6., 86.4, 9.4, 66.7, 9., 44.7. 39.33, 58.9}. It is known that these molecules are either Alanine, which has mass 7., or Leucine, which has mass 3.. Given a molecule of mass θ, the spectrometer is known to report its mass as X Nθ, 35, independently for each molecule. Using this model, and the data above, find the likelihoods of Alanine and Leucine. Specify which of these has the greatest the likelihood. Our model, for the reported mass X of a single molecule with real weight θ, is X N, 35. Therefore, X i Nθ, 3 and the p.d.f. of a single data point is f Xi x i exp x i θ π35 35. Therefore, the p.d.f. of the reported masses X X,..., X n of n molecules is n f X x f Xi x i π n/ 35 n exp n x i θ. 45 i We know that, in reality, θ must be one of only two different values; 7. for Alanine and 3. for Leucine. Therefore, our likelihood function is Lθ; x π n/ 35 n exp 45 i n x i θ and the possible range of values for θ is the two point set Θ {7., 3.}. We need to find out which of these two values maximises the likelihood. Our data x contains n 5 data points. A short calculation use e.g. R shows that i 45 and, therefore, that 5 i x i 7..7, 45 5 i x i 3..4. L7.; x.9 34, L3.; x 9.9 38. We conclude that θ 7. has much greater likelihood than θ 3., so we expect that the molecules sampled are Alanine. 3 This is a simplification; in reality a mass spectrometer measure the mass to charge ratio of the molecule, but since the charges of molecule are already known, the mass can be inferred later. Atomic masses are measured in so-called atomic mass units. 3

Note that, if we were to differentiate as we did in other examples, we would find the maximiser θ for Lθ; x across the whole range θ,, which turns out to be θ 8.7. This is not what we want here! The design of our experiment has meant that the range of possible values for θ is restricted to the two point set Θ {7., 3.}. See Q7.5 for the unrestricted case. Example 4: Two parameter maximum likelihood estimation rainfall. > Find the maximum likelihood estimator of the parameter vector θ µ, σ when the data x x, x,..., x n are modelled as i.i.d. samples from a normal distribution Nµ, σ. Our parameter vector is θ µ, σ, so let us write v σ to avoid confusion. As a result, we are interested in the parameters θ µ, v, and the range of possible values of θ is Θ R,. The p.d.f. of the univariate normal distribution Nµ, v is f X x πv e x µ /v. Writing X X,..., X n, where the X i are i.i.d. univariate Nµ, v random variables, the likelihood function of X is Lθ; x f X x Therefore, the log likelihood is lθ; x n exp πv n/ v logπ + logv v n x i µ. i n x i µ. We now look to maximise lθ; x over θ Θ. The partial derivatives are l µ n n x i µ x i nµ v v i i l v n v + n v x i µ. i Solving l µ gives µ n n i x i x. Solving l v gives v n n i x i µ. So both partial derivatives will be zero if and only if i µ x, v n n x i x. 5 i This gives us the value of θ µ, v at the single turning point of l. 3

so Next, we use the Hessian matrix to check if this point is a local maximum. We have l µ n v l µ v v l v n x i nµ i n v v 3 n x i µ Evaluating these at our turning point, we get l µ ṋ 5 v l n µ v 5 v x i n x i l v n 5 v n v 3 x i x n v n nˆv v3 v H i i n v. n v Since n n v < and det H v >, our turning point 5 is a local maximum. Since it is the 3 only turning point, it is also the global maximum. Hence, the MLE is ˆµ x ˆσ ˆv n n x i x. Note ˆµ is the sample mean, and ˆσ is the biased sample variance. i > For the years 985-5, the amount of rainfall in milimeters recorded as falling on Sheffield in December is as follows: {78., 4.3, 38., 36., 59., 36., 78.4, 67.4, 7.4, 3.9, 7.4, 98., 79.4, 57.9, 35.6, 8., 8., 9.8, 6.5, 46.3, 56.7, 4., 74.9, 5.8, 66., 8.8, 4.6, 36., 69.8,.,.} This data comes from the historical climate data stored by the Met Office 4. Meteorologists often model the long run distribution of rainfall by a normal distribution although in some cases the Gamma distribution is used. Assuming that we choose to model the amount of rainfall in Sheffield each December by a normal distribution, find the maximum likelihood estimators for µ and σ. The data has n 3, and x 3 3 i 93.9, 3 3 i x i x 63. 4.4 4 http://www.metoffice.gov.uk/public/weather/climate-historic/ 3

So we conclude that, according to our model, the maximum likelihood estimators are ˆµ 93.9 and ˆσ 4.4, which means that Sheffield receives a N93.9, 4.4 quantity of rainfall, in millimetres, each December. Example 4: Maximum likelihood estimation for the uniform distribution > Find the maximum likelihood estimator of the parameter θ when the data x x, x,..., x n are i.i.d. samples from a uniform distribution U[, θ], with unknown parameter θ >. Here the p.d.f. of X i is fx θ for x θ and zero otherwise. So the likelihood, for θ Θ R +, is θ if θ x Lθ; x n i for all i if θ < x i for some i θ if θ max n i x i if θ < max i x i. Differentiating the likelihood, we see that Lθ; x is decreasing but positive for θ > max i x i. For θ < max i x i we know Lθ; x, so by looking at the graph, we can see that the maximum occurs at This is the MLE. θ ˆθ max x i. i,...,n Example 4: Interval estimation based on likelihood > Suppose that we have i.i.d. data x x, x,..., x n, for which each data point is modelled as a random sample from Nµ, σ where µ is unknown and σ is known. Find the k-likelihood region R k for the parameter µ. First, we need to find the MLE ˆµ of µ. The likelihood function for our model is n Lµ; x φx i ; µ πσ exp n n/ σ x i µ, i where the range of parameter values is all µ R. The log likelihood is lµ; x n logπ + logσ n σ x i µ. 33 i i

The usual process of maximisation which is left for you and is a simplified case of Example 4 shows that the maximum likelihood estimator is the sample mean, ˆµ n n x i. i Now we are ready to identify the k-likelihood region for µ. By definition, the k-likelihood region is So, µ R k if and only if R k {µ R : lµ; x lˆµ; x k}. σ n i x i µ σ We can simplify this inequality, by noting that n x i µ i n x i ˆµ i n x i ˆµ k. i n x i x i µ + µ x i + x iˆµ ˆµ i nµ nˆµ + ˆµ µ n i nµ nˆµ + ˆµ µnˆµ nµ + ˆµ µˆµ nˆµ µ. x i So, µ R k if and only if or in other words, n σ ˆµ µ k, [ ] k k R k ˆµ σ n, ˆµ + σ. n Example 43: Hypothesis tests based on likelihood > In Example 37, if we used a -likelihood test, would we accept the hypothesis that the radioactive decay of carbon-5 is equal to λ.7? We had found, given the data, that the likelihood function of θ was Lλ; x λ 5 e 47.58λ and the maximum likelihood estimator of λ was ˆλ.3. The -likelihood region for λ is the set so λ R if and only if R { } λ > : Lλ; x e Lˆλ; x, λ 5 e 47.58λ e L.3; x.4 5. 34

Note that, unlike the previous example, we can t simplify this inequality and find a nice form for the likelihood region. Our hypothesis is that, in fact, λ.7. Our -likelihood test will pass if λ.7 is within the -likelihood region, and fail if not. We can evaluate use e.g. R,.7 5 e 47.58.7 7.78 5 and note that 7.78 5.4 5. Hence λ.7 is within the -likelihood region and we accept the hypothesis. 35