Foundations of Statistical Inference

Size: px

Start display at page:

Download "Foundations of Statistical Inference"

Ferdinand Eaton
5 years ago
Views:

1 Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT / 27

2 Course arrangements Lectures M.2 and Th.10 SPR1 weeks 1-8 Classes Weeks 3-8 Friday 12-1, 2 4-5, 5-6 SPR1/2 Hand in solutions by 12noon on Wednesday SPR1. Class Tutors : Jonathan Marchini, Charlotte Greenan Notes and Problem sheets will be available at Books Garthwaite, P. H., Jolliffe, I. T. and Jones, B. (2002) Statistical Inference, Oxford Science Publications Leonard, T., Hsu, J. S. (2005) Bayesian Methods, Cambridge University Press. D. R. Cox (2006) Principals of Statistical Inference This course builds on notes from Bob Griffiths and Geoff Nicholls Jonathan Marchini (University of Oxford) BS2a MT / 27

3 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Jonathan Marchini (University of Oxford) BS2a MT / 27

4 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Jonathan Marchini (University of Oxford) BS2a MT / 27

5 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Jonathan Marchini (University of Oxford) BS2a MT / 27

6 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests) Jonathan Marchini (University of Oxford) BS2a MT / 27

7 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests) You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior) Jonathan Marchini (University of Oxford) BS2a MT / 27

8 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests) You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior) Interval estimation(credible intervals, HPD intervals) Jonathan Marchini (University of Oxford) BS2a MT / 27

9 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests) You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior) Interval estimation(credible intervals, HPD intervals) Priors(conjugate priors, improper priors, Jeffreys prior ) Jonathan Marchini (University of Oxford) BS2a MT / 27

10 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests) You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior) Interval estimation(credible intervals, HPD intervals) Priors(conjugate priors, improper priors, Jeffreys prior ) Hypothesis testing (marginal likelihoods, Bayes Factors) Jonathan Marchini (University of Oxford) BS2a MT / 27

11 Part A Statistics The majority of the statistics that you have learned up to now falls under the philosophy of classical (or Frequentist) statistics. This theory makes the assumption that we can randomly take repeated samples of data from the same population. You learned about three types of statistical inference Point estimation (Maximum likelihood, bias, consistency, efficiency, information) Interval estimation(exact and approximate intervals using CLT) Hypothesis testing (Neyman-Pearson lemma, uniformly most powerful tests, generalised likelihood ratio tests) You also had an introduction to Bayesian Statistics Posterior inference (Posterior Likelihood Prior) Interval estimation(credible intervals, HPD intervals) Priors(conjugate priors, improper priors, Jeffreys prior ) Hypothesis testing (marginal likelihoods, Bayes Factors) Jonathan Marchini (University of Oxford) BS2a MT / 27

12 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Jonathan Marchini (University of Oxford) BS2a MT / 27

13 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. Jonathan Marchini (University of Oxford) BS2a MT / 27

14 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Jonathan Marchini (University of Oxford) BS2a MT / 27

15 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. Jonathan Marchini (University of Oxford) BS2a MT / 27

16 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. What are limits of how well we can estimate a parameter θ? Jonathan Marchini (University of Oxford) BS2a MT / 27

17 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. What are limits of how well we can estimate a parameter θ? Cramer-Rao inequality (and bound). Jonathan Marchini (University of Oxford) BS2a MT / 27

18 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. What are limits of how well we can estimate a parameter θ? Cramer-Rao inequality (and bound). How can we find good estimators of a parameter θ? Jonathan Marchini (University of Oxford) BS2a MT / 27

19 Frequentist inference In BS2a we develop the theory of point estimation further. Are there families of distributions about which we can make general statements? Exponential families. How can we summarise all the information in a dataset about a parameter θ? Sufficiency and the Factorization Theorem. What are limits of how well we can estimate a parameter θ? Cramer-Rao inequality (and bound). How can we find good estimators of a parameter θ? Rao-Blackwell Theorem and Lehmann-Scheffé Theorem. Jonathan Marchini (University of Oxford) BS2a MT / 27

20 Bayesian inference Parameters are treated as random variables. Inference starts by specifying a prior distribution on θ based on prior beliefs. Having collected some data we use Bayes Theorem to update our beliefs to obtain a posterior distribution. Jonathan Marchini (University of Oxford) BS2a MT / 27

21 Bayesian inference Parameters are treated as random variables. Inference starts by specifying a prior distribution on θ based on prior beliefs. Having collected some data we use Bayes Theorem to update our beliefs to obtain a posterior distribution. Quick Example Suppose I give a coin and tell you that it is bit biased. We might use a Beta(4,4) distribution to represent our beliefs about the θ. If we observe 30 heads and 10 tails we can use probability theory to infer a posterior distribution for θ of Beta(34, 14). Prior distribution Posterior distribution Jonathan Marchini (University of Oxford) BS2a MT / 27

22 Computational techniques for Bayesian inference It is not always possible to obtain an analytic solution when doing Bayesian Inference, so we study approximate computational techniques in this course. Jonathan Marchini (University of Oxford) BS2a MT / 27

23 Computational techniques for Bayesian inference It is not always possible to obtain an analytic solution when doing Bayesian Inference, so we study approximate computational techniques in this course. These include Markov chain Monte Carlo (MCMC) methods Jonathan Marchini (University of Oxford) BS2a MT / 27

24 Computational techniques for Bayesian inference It is not always possible to obtain an analytic solution when doing Bayesian Inference, so we study approximate computational techniques in this course. These include Markov chain Monte Carlo (MCMC) methods Approximations to marginal likelihoods NEW Variational Approximations Laplace approximations Bayesian Information Criterion (BIC) Jonathan Marchini (University of Oxford) BS2a MT / 27

25 Computational techniques for Bayesian inference It is not always possible to obtain an analytic solution when doing Bayesian Inference, so we study approximate computational techniques in this course. These include Markov chain Monte Carlo (MCMC) methods Approximations to marginal likelihoods NEW Variational Approximations Laplace approximations Bayesian Information Criterion (BIC) The EM algorithm NEW useful in Frequentist and Bayesian inference of missing data problems Jonathan Marchini (University of Oxford) BS2a MT / 27

26 Decision theory Quick Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? Jonathan Marchini (University of Oxford) BS2a MT / 27

27 Decision theory Quick Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss) associated with the decisions. We would put a very high loss on dying! Jonathan Marchini (University of Oxford) BS2a MT / 27

28 Decision theory Quick Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss) associated with the decisions. We would put a very high loss on dying! Decision theory allows us to formalise this problem. Jonathan Marchini (University of Oxford) BS2a MT / 27

29 Decision theory Quick Example You have been exposed to a deadly virus. About 1/3 of people who are exposed to the virus are infected by it, and all those infected by it die unless they receive a vaccine. By the time any symptoms of the virus show up, it is too late for the vaccine to work. You are offered a vaccine for 500. Do you take it or not? The most likely scenario is that you don t have the virus but basing a decision on this ignores the costs (or loss) associated with the decisions. We would put a very high loss on dying! Decision theory allows us to formalise this problem. Closely connected to Bayesian ideas. Applicable to point estimation and hypothesis testing. Inference is based on specifying a number of possible actions we might take and a loss associated with each action. The theory involves determining a rule to make decisions using an overall measure of loss. Jonathan Marchini (University of Oxford) BS2a MT / 27

30 Notation X, Y, Z Capital letters for random variables. x, y, z Lower case letters for realisations of random variables. E X ( ) Expectation with respect to the random variable X. φ = {φ 1,..., φ k } Sometimes we will use bold symbols to denote a vector of parameters. Jonathan Marchini (University of Oxford) BS2a MT / 27

31 Lecture 1 - Exponential families Jonathan Marchini (University of Oxford) BS2a MT / 27

32 Parametric families f (x; θ), θ Θ, probability density of a random variable (rv) which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x),f (x θ),f (x, θ). Jonathan Marchini (University of Oxford) BS2a MT / 27

33 Parametric families f (x; θ), θ Θ, probability density of a random variable (rv) which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x),f (x θ),f (x, θ). Likelihood L(θ; x) = f (x; θ) and log-likelihood l(θ; x) = log(l). Jonathan Marchini (University of Oxford) BS2a MT / 27

34 Parametric families f (x; θ), θ Θ, probability density of a random variable (rv) which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x),f (x θ),f (x, θ). Likelihood L(θ; x) = f (x; θ) and log-likelihood l(θ; x) = log(l). Examples 1. Normal N(θ, 1) : f (x; θ) = 1 e 1 2 (x θ)2 x R, θ R. 2π Jonathan Marchini (University of Oxford) BS2a MT / 27

35 Parametric families f (x; θ), θ Θ, probability density of a random variable (rv) which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x),f (x θ),f (x, θ). Likelihood L(θ; x) = f (x; θ) and log-likelihood l(θ; x) = log(l). Examples 1. Normal N(θ, 1) : f (x; θ) = 1 e 1 2 (x θ)2 x R, θ R. 2π 2. Poisson: f (x; θ) = θx x! e θ, x = 0, 1, 2,..., θ > 0. Jonathan Marchini (University of Oxford) BS2a MT / 27

36 Parametric families f (x; θ), θ Θ, probability density of a random variable (rv) which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x),f (x θ),f (x, θ). Likelihood L(θ; x) = f (x; θ) and log-likelihood l(θ; x) = log(l). Examples 1. Normal N(θ, 1) : f (x; θ) = 1 e 1 2 (x θ)2 x R, θ R. 2π 2. Poisson: f (x; θ) = θx x! e θ, x = 0, 1, 2,..., θ > Regression: f (y; θ) = n i=1 1 e 1 2 (y i p 2πσ j=1 x ij β j ) 2, y R n, σ > 0, β R p. θ = {β, σ}. Jonathan Marchini (University of Oxford) BS2a MT / 27

37 Parametric families f (x; θ), θ Θ, probability density of a random variable (rv) which could be discrete or continuous. θ can be 1-dimensional or of higher dimension. Equivalent notation:f θ (x),f (x θ),f (x, θ). Likelihood L(θ; x) = f (x; θ) and log-likelihood l(θ; x) = log(l). Examples 1. Normal N(θ, 1) : f (x; θ) = 1 e 1 2 (x θ)2 x R, θ R. 2π 2. Poisson: f (x; θ) = θx x! e θ, x = 0, 1, 2,..., θ > Regression: f (y; θ) = n i=1 1 e 1 2 (y i p 2πσ j=1 x ij β j ) 2, y R n, σ > 0, β R p. θ = {β, σ}. Jonathan Marchini (University of Oxford) BS2a MT / 27

38 Exponential families of distributions Definition 1 GJJ 2.6, DRC 2.3 A rv X belongs to a k-parameter exponential family if its probability density function (pdf) can be written as k f (x; θ) = exp A j (θ)b j (x) + C(x) + D(θ), j=1 where x χ, θ Θ, A 1 (θ),..., A k (θ), D(θ) are functions of θ alone and B 1 (x), B 2 (x),..., B k (x), C(x) are well behaved functions of x alone. Jonathan Marchini (University of Oxford) BS2a MT / 27

39 Exponential families of distributions Definition 1 GJJ 2.6, DRC 2.3 A rv X belongs to a k-parameter exponential family if its probability density function (pdf) can be written as k f (x; θ) = exp A j (θ)b j (x) + C(x) + D(θ), j=1 where x χ, θ Θ, A 1 (θ),..., A k (θ), D(θ) are functions of θ alone and B 1 (x), B 2 (x),..., B k (x), C(x) are well behaved functions of x alone. Exponential families are widely used in practice - for example in generalised linear models (see BS1a). Jonathan Marchini (University of Oxford) BS2a MT / 27

40 Example 1 : Poisson We want to put the Poisson distribution in the form (with k = 1) f (x; θ) = exp {A(θ)B(x) + C(x) + D(θ)}, Jonathan Marchini (University of Oxford) BS2a MT / 27

41 Example 1 : Poisson We want to put the Poisson distribution in the form (with k = 1) f (x; θ) = exp {A(θ)B(x) + C(x) + D(θ)}, e θ θ x θ+x log θ log x! /x! = e = exp {(log θ)x log x! θ} Jonathan Marchini (University of Oxford) BS2a MT / 27

42 Example 1 : Poisson We want to put the Poisson distribution in the form (with k = 1) f (x; θ) = exp {A(θ)B(x) + C(x) + D(θ)}, e θ θ x θ+x log θ log x! /x! = e = exp {(log θ)x log x! θ} So A(θ) = log θ, B(x) = x, C(x) = log x!, D(θ) = θ. Jonathan Marchini (University of Oxford) BS2a MT / 27

43 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ) A(θ) B(x) C(x) D(θ) Bin(n, p) ( n x) p x (1 p) n x log p (1 p) x log ( ) n x n log(1 p) Jonathan Marchini (University of Oxford) BS2a MT / 27

44 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ) A(θ) B(x) C(x) D(θ) Bin(n, p) ( n x) p x (1 p) n x log p (1 p) x log ( ) n x Pois(θ) e θ θ x /x! log θ x log(x!) θ n log(1 p) Jonathan Marchini (University of Oxford) BS2a MT / 27

45 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ) A(θ) B(x) C(x) D(θ) Bin(n, p) ( n x) p x (1 p) n x log p (1 p) x log ( ) n x Pois(θ) e θ θ x /x! log θ x log(x!) θ n log(1 p) N(µ, 1) 1 2π exp{ (x µ)2 2 } µ x x 2 /2 1 2 (µ2 log(2π)) Jonathan Marchini (University of Oxford) BS2a MT / 27

46 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ) A(θ) B(x) C(x) D(θ) Bin(n, p) ( n x) p x (1 p) n x log p (1 p) x log ( ) n x Pois(θ) e θ θ x /x! log θ x log(x!) θ n log(1 p) N(µ, 1) 1 2π exp{ (x µ)2 2 } µ x x 2 /2 1 2 (µ2 log(2π)) Exp(θ) θe θx θ x 0 log θ Jonathan Marchini (University of Oxford) BS2a MT / 27

47 Examples of 1-parameter Exponential families Binomial, Poisson, Normal, Exponential. Distn f (x; θ) A(θ) B(x) C(x) D(θ) Bin(n, p) ( n x) p x (1 p) n x log p (1 p) x log ( ) n x Pois(θ) e θ θ x /x! log θ x log(x!) θ n log(1 p) N(µ, 1) 1 2π exp{ (x µ)2 2 } µ x x 2 /2 1 2 (µ2 log(2π)) Exp(θ) θe θx θ x 0 log θ Others : negative binomial, Pareto (with known minimum), Weibull (with known shape), Laplace (with known mean), Log-normal, inverse Gaussian, beta, Dirichlet, Wishart. Exercise: check these distributions Jonathan Marchini (University of Oxford) BS2a MT / 27

48 Example 2 : a 2-parameter family (Gamma) If X Gamma(α, β) then let θ = (α, β) so f (x; θ) = βα x α 1 e βx Γ(α) Jonathan Marchini (University of Oxford) BS2a MT / 27

49 Example 2 : a 2-parameter family (Gamma) If X Gamma(α, β) then let θ = (α, β) so f (x; θ) = βα x α 1 e βx Γ(α) = exp {α log β + (α 1) log x βx log Γ(α)} Jonathan Marchini (University of Oxford) BS2a MT / 27

50 Example 2 : a 2-parameter family (Gamma) If X Gamma(α, β) then let θ = (α, β) so f (x; θ) = βα x α 1 e βx Γ(α) = exp {α log β + (α 1) log x βx log Γ(α)} = exp { (α 1) log x βx log [ Γ(α)β α]} Jonathan Marchini (University of Oxford) BS2a MT / 27

51 Example 2 : a 2-parameter family (Gamma) If X Gamma(α, β) then let θ = (α, β) so f (x; θ) = βα x α 1 e βx Γ(α) = exp {α log β + (α 1) log x βx log Γ(α)} = exp { (α 1) log x βx log [ Γ(α)β α]} And we have A 1 (θ) = α 1, B 1 (x) = log x, A 2 (θ) = β, B 2 (x) = x. Jonathan Marchini (University of Oxford) BS2a MT / 27

52 Some other 2-parameter Exponential families Distribution f (x; θ) A(θ) B(x) C(x) D(θ) N(µ, σ 2 ) e (x µ)2 2σ 2 A 2πσ 2 1 (θ) = 1/2σ 2 B 1 (x) = x log(2πσ2 ) A 2 (θ) = µ/σ 2 B 2 (x) = x µ2 /σ 2 Gamma β α x α 1 e βx Γ(α) A 1 (θ) = α 1 B 1 (x) = log x 0 log [ Γ(α)β α] A 2 (θ) = β B 2 (x) = x Jonathan Marchini (University of Oxford) BS2a MT / 27

53 Exponential family canonical form Let φ j = A j (θ), j = 1,..., k k f (x; φ) = exp φ j B j (x) + C(x) + D(φ). j=1 φ j, j = 1,..., k are the canonical parameters, B j, j = 1,..., k are the canonical observations. These are sometimes called the natural parameters and observations. Jonathan Marchini (University of Oxford) BS2a MT / 27

54 Since R f (x; φ)dx = 1 Jonathan Marchini (University of Oxford) BS2a MT / 27

55 Since we have R f (x; φ)dx = 1 R k exp φ j B j (x) + C(x) + D(φ) dx = 1 j=1 Jonathan Marchini (University of Oxford) BS2a MT / 27

56 Since we have R f (x; φ)dx = 1 k exp φ R j B j (x) + C(x) + D(φ) dx = 1 j=1 k exp{d(φ)} exp φ R j B j (x) + C(x) dx = 1 j=1 Jonathan Marchini (University of Oxford) BS2a MT / 27

57 Since we have R f (x; φ)dx = 1 k exp φ R j B j (x) + C(x) + D(φ) dx = 1 j=1 k exp{d(φ)} exp φ R j B j (x) + C(x) dx = 1 j=1 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Jonathan Marchini (University of Oxford) BS2a MT / 27

58 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Jonathan Marchini (University of Oxford) BS2a MT / 27

59 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Differentiate with respect to φ i. k B i (x) exp φ j B j (x) + C(x) dx = D(φ) exp{ D(φ)} φ i R j=1 Jonathan Marchini (University of Oxford) BS2a MT / 27

60 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Differentiate with respect to φ i. k B i (x) exp φ R j B j (x) + C(x) dx = D(φ) exp{ D(φ)} φ i j=1 k B i (x) exp φ j B j (x) + C(x) + D(φ) dx = D(φ) φ i R j=1 Jonathan Marchini (University of Oxford) BS2a MT / 27

61 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Differentiate with respect to φ i. k B i (x) exp φ R j B j (x) + C(x) dx = D(φ) exp{ D(φ)} φ i j=1 k B i (x) exp φ j B j (x) + C(x) + D(φ) dx = D(φ) φ i R j=1 E[B i (X)] = φ i D(φ) Jonathan Marchini (University of Oxford) BS2a MT / 27

62 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Differentiate with respect to φ i. k B i (x) exp φ R j B j (x) + C(x) dx = D(φ) exp{ D(φ)} φ i j=1 k B i (x) exp φ j B j (x) + C(x) + D(φ) dx = D(φ) φ i R j=1 E[B i (X)] = φ i D(φ) Exercise Cov[B i (X), B j (X)] = 2 φ i φ j D(φ) Jonathan Marchini (University of Oxford) BS2a MT / 27

63 k exp φ j B j (x) + C(x) dx = exp{ D(φ)} R j=1 Differentiate with respect to φ i. k B i (x) exp φ R j B j (x) + C(x) dx = D(φ) exp{ D(φ)} φ i j=1 k B i (x) exp φ j B j (x) + C(x) + D(φ) dx = D(φ) φ i R j=1 E[B i (X)] = φ i D(φ) Exercise Cov[B i (X), B j (X)] = 2 φ i φ j D(φ) Exercise Var[B i (X)] = 2 φ 2 D(φ) i Jonathan Marchini (University of Oxford) BS2a MT / 27

64 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. Jonathan Marchini (University of Oxford) BS2a MT / 27

65 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} Jonathan Marchini (University of Oxford) BS2a MT / 27

66 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} φ 1 = β, φ 2 = α 1, B 1 (x) = x, B 2 (x) = log x Jonathan Marchini (University of Oxford) BS2a MT / 27

67 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} φ 1 = β, φ 2 = α 1, B 1 (x) = x, B 2 (x) = log x D(φ) = α log β log Γ(α) Jonathan Marchini (University of Oxford) BS2a MT / 27

68 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} φ 1 = β, φ 2 = α 1, B 1 (x) = x, B 2 (x) = log x D(φ) = α log β log Γ(α) = (φ 2 + 1) log( φ 1 ) log Γ(φ 2 + 1) Jonathan Marchini (University of Oxford) BS2a MT / 27

69 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} φ 1 = β, φ 2 = α 1, B 1 (x) = x, B 2 (x) = log x D(φ) = α log β log Γ(α) = (φ 2 + 1) log( φ 1 ) log Γ(φ 2 + 1) E[X] = D(φ) = (φ 2 + 1) = α φ 1 φ 1 β Jonathan Marchini (University of Oxford) BS2a MT / 27

70 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} φ 1 = β, φ 2 = α 1, B 1 (x) = x, B 2 (x) = log x D(φ) = α log β log Γ(α) = (φ 2 + 1) log( φ 1 ) log Γ(φ 2 + 1) E[X] = D(φ) = (φ 2 + 1) = α φ 1 φ 1 β Var[X] = 2 φ 2 D(φ) = (φ 2 + 1) 1 φ 2 = α β 1 2 Jonathan Marchini (University of Oxford) BS2a MT / 27

71 Example 3 : Gamma We already know that if X Gamma(α, β) the E(X) = α β and Var(X) = α β 2. β α x α 1 e βx Γ(α) = exp { βx + (α 1) log x + α log β log Γ(α)} φ 1 = β, φ 2 = α 1, B 1 (x) = x, B 2 (x) = log x D(φ) = α log β log Γ(α) = (φ 2 + 1) log( φ 1 ) log Γ(φ 2 + 1) E[X] = D(φ) = (φ 2 + 1) = α φ 1 φ 1 β Var[X] = 2 φ 2 D(φ) = (φ 2 + 1) 1 φ 2 = α β 1 2 Exercise: show E[log X] = ψ 0 (α) log(β) where ψ 0 is the digamma function, and Γ (α) = Γ(α)ψ 0 (α). Jonathan Marchini (University of Oxford) BS2a MT / 27

72 Cumulant Generating Function In a scalar canonical exponential family (k = 1) E X [e sb(x) ] = exp {(φ + s)b(x) + C(x) + D(φ)} dx R = exp{d(φ) D(φ + s)} Jonathan Marchini (University of Oxford) BS2a MT / 27

73 Cumulant Generating Function In a scalar canonical exponential family (k = 1) E X [e sb(x) ] = exp {(φ + s)b(x) + C(x) + D(φ)} dx R = exp{d(φ) D(φ + s)} If M B(X) (s) is the Moment Generating Function (mgf) for B(X), then log(m B(X) (s)) = D(φ) D(φ + s) Jonathan Marchini (University of Oxford) BS2a MT / 27

74 Cumulant Generating Function In a scalar canonical exponential family (k = 1) E X [e sb(x) ] = exp {(φ + s)b(x) + C(x) + D(φ)} dx R = exp{d(φ) D(φ + s)} If M B(X) (s) is the Moment Generating Function (mgf) for B(X), then log(m B(X) (s)) = D(φ) D(φ + s) This is the cumulant generating function (defined as the log of the mgf) for the cumulants of B(X) i.e. log(m B(X) (s)) = κ r s r /r! where κ 1 = E(B(X)) and κ 2 = V (B(X)) Exercise : prove this Jonathan Marchini (University of Oxford) BS2a MT / 27 r=1

75 Example 4 : Binomial We already know that if X Binom(n, p) then E(X) = np. Jonathan Marchini (University of Oxford) BS2a MT / 27

76 Example 4 : Binomial We already know that if X Binom(n, p) then E(X) = np. φ = log p (1 p) p = eφ 1 + e φ Jonathan Marchini (University of Oxford) BS2a MT / 27

77 Example 4 : Binomial We already know that if X Binom(n, p) then E(X) = np. φ = log p (1 p) p = eφ 1 + e φ and B 1 (x) = x, D(φ) = n log(1 p) = n log(1 + e φ ) Jonathan Marchini (University of Oxford) BS2a MT / 27

78 Example 4 : Binomial We already know that if X Binom(n, p) then E(X) = np. φ = log p (1 p) p = eφ 1 + e φ and B 1 (x) = x, D(φ) = n log(1 p) = n log(1 + e φ ) log(m X (s)) = D(φ) D(φ + s) = n log(1 + e φ ) + n log(1 + e φ+s ) Jonathan Marchini (University of Oxford) BS2a MT / 27

79 Example 4 : Binomial We already know that if X Binom(n, p) then E(X) = np. φ = log p (1 p) p = eφ 1 + e φ and B 1 (x) = x, D(φ) = n log(1 p) = n log(1 + e φ ) log(m X (s)) = D(φ) D(φ + s) = n log(1 + e φ ) + n log(1 + e φ+s ) Therefore κ 1 = s log(m X (s)) = n eφ s=0 1 + e φ = np Jonathan Marchini (University of Oxford) BS2a MT / 27

80 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf f (x; θ) = θe x (1 + e x ) θ+1 Jonathan Marchini (University of Oxford) BS2a MT / 27

81 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf θe x f (x; θ) = (1 + e x ) θ+1 { ( e = exp θ log(1 + e x x ) } ) + log 1 + e x + log θ Jonathan Marchini (University of Oxford) BS2a MT / 27

82 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf θe x f (x; θ) = (1 + e x ) θ+1 { ( e = exp θ log(1 + e x x ) } ) + log 1 + e x + log θ and φ = θ, B 1 (x) = log(1 + e x ) and D(φ) = log θ = log( φ) Jonathan Marchini (University of Oxford) BS2a MT / 27

83 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf θe x f (x; θ) = (1 + e x ) θ+1 { ( e = exp θ log(1 + e x x ) } ) + log 1 + e x + log θ and φ = θ, B 1 (x) = log(1 + e x ) and D(φ) = log θ = log( φ) log(m X (s)) = D(φ) D(φ + s) = log( φ) log( (φ + s)) Jonathan Marchini (University of Oxford) BS2a MT / 27

84 Example 5 : Skew-logistic distribution Consider the real valued random variable X with pdf θe x f (x; θ) = (1 + e x ) θ+1 { ( e = exp θ log(1 + e x x ) } ) + log 1 + e x + log θ and φ = θ, B 1 (x) = log(1 + e x ) and D(φ) = log θ = log( φ) log(m X (s)) = D(φ) D(φ + s) = log( φ) log( (φ + s)) E(log(1 + e x )) = 1 = 1 φ + s s=0 θ Var(log(1 + e x 1 s=0 )) = (φ + s) 2 = 1 θ 2 These results are harder to derive directly. Jonathan Marchini (University of Oxford) BS2a MT / 27

85 Family preserved under transformations A smooth invertible transformation of a rv from the Exponential family is also within the Exponential family. If X Y, Y = Y (X) then f Y (y; θ) = f X (x(y); θ) X/ Y k = exp A j (θ)b j (x(y)) + C(x(y)) + D(θ) X/ Y, j=1 The Jacobian depends only on y and so the natural observation B(x(y)), the natural parameter A(θ), and D(θ) do not change, while C(X) C(X(Y )) + log X/ Y. Jonathan Marchini (University of Oxford) BS2a MT / 27

86 Exponential family conditions 1. The range of X, χ does not depend on θ. Jonathan Marchini (University of Oxford) BS2a MT / 27

87 Exponential family conditions 1. The range of X, χ does not depend on θ. 2. ( A 1 (θ), A 2 (θ),..., A k (θ) ) have continuous second derivatives for θ Θ. Jonathan Marchini (University of Oxford) BS2a MT / 27

88 Exponential family conditions 1. The range of X, χ does not depend on θ. 2. ( A 1 (θ), A 2 (θ),..., A k (θ) ) have continuous second derivatives for θ Θ. 3. If dim Θ = d, then the Jacobian [ ] Ai (θ) J(θ) = θ j has full rank d for θ Θ. Jonathan Marchini (University of Oxford) BS2a MT / 27

89 Exponential family conditions 1. The range of X, χ does not depend on θ. 2. ( A 1 (θ), A 2 (θ),..., A k (θ) ) have continuous second derivatives for θ Θ. 3. If dim Θ = d, then the Jacobian [ ] Ai (θ) J(θ) = θ j has full rank d for θ Θ. [Part of the regularity conditions which we cite later.] Jonathan Marchini (University of Oxford) BS2a MT / 27

90 The family is minimal if no linear combination exists such that for all x χ. λ 0 + λ 1 B 1 (x) + + λ k B k (x) = 0 Jonathan Marchini (University of Oxford) BS2a MT / 27

91 The family is minimal if no linear combination exists such that λ 0 + λ 1 B 1 (x) + + λ k B k (x) = 0 for all x χ. The dimension of the family is defined to be d, the rank of J(θ). Jonathan Marchini (University of Oxford) BS2a MT / 27

92 The family is minimal if no linear combination exists such that for all x χ. λ 0 + λ 1 B 1 (x) + + λ k B k (x) = 0 The dimension of the family is defined to be d, the rank of J(θ). A minimal exponential family is said to be curved when d < k and linear when d = k. We refer to a (k, d) curved exponential family. Jonathan Marchini (University of Oxford) BS2a MT / 27

93 The family is minimal if no linear combination exists such that for all x χ. λ 0 + λ 1 B 1 (x) + + λ k B k (x) = 0 The dimension of the family is defined to be d, the rank of J(θ). A minimal exponential family is said to be curved when d < k and linear when d = k. We refer to a (k, d) curved exponential family. Example 6 (X 1, X 2 ) independent, normal, unit variance, means (θ, c/θ), c known. log f (x; θ) = x 1 θ + cx 2 /θ θ 2 /2 c 2 θ 2 / is a (2, 1) curved exponential family. Jonathan Marchini (University of Oxford) BS2a MT / 27

94 Linear relations among A s do not generate curvature Take a k-parameter exponential family and impose A 1 = aa 2 + b for constants a and b. k A i B i + C + D = i=1 k A i B i + A 2 (ab 1 + B 2 ) + (C + bb 1 ) + D i=3 This gives a k 1 parameter EF, not a (k, k 1)-CEF. Jonathan Marchini (University of Oxford) BS2a MT / 27

95 Examples not in an exponential family 1 Uniform on [0, θ], θ > 0. f (x; θ) = 1, x [0, θ] θ > 0 θ Jonathan Marchini (University of Oxford) BS2a MT / 27

96 Examples not in an exponential family 1 Uniform on [0, θ], θ > 0. f (x; θ) = 1, x [0, θ] θ > 0 θ 2 The Cauchy distribution f (x; θ) = [π(1 + (x θ) 2 )] 1, x R Jonathan Marchini (University of Oxford) BS2a MT / 27

Course arrangements. Foundations of Statistical Inference

Course arrangements. Foundations of Statistical Inference Course arrangements Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 013 Lectures M. and Th.10 SPR1 weeks 1-8 Classes Weeks 3-8 Friday 1-1, 4-5, 5-6