ML Testing (Likelihood Ratio Testing) for non-gaussian models

Size: px

Start display at page:

Download "ML Testing (Likelihood Ratio Testing) for non-gaussian models"

Posy Goodwin
5 years ago
Views:

1 ML Testing (Likelihood Ratio Testing) for non-gaussian models Surya Tokdar

2 ML test in a slightly different form Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) = {θ : l x (θ) max θ Θ l x (θ) c2 2 } ML test = reject H 0 if B c (x) Θ 0 = Same as: reject H 0 if Or equivalently: reject H 0 if max l x(θ) max l x (θ) c 2 /2 θ Θ θ Θ 0 Λ(x) = max θ Θ L x (θ) max θ Θ0 L x (θ) k 1 / 1

3 Size calculation Need to calculate max θ Θ0 P [X θ] (Λ(X ) > k) IID Will work for models of the form: X i g(x i θ), θ Θ And hypothesis: H 0 : A T θ = a 0 for a given p r matrix A and given scalar a 0 (typically a 0 = 0) Under regularity conditions: 2 log Λ(X ) d χ 2 (r) when IID X i g(x i θ 0 ) with θ 0 Θ 0. Follows from asymptotic normality of ˆθ MLE (X ) 2 / 1

4 Asymptotic normality of MLE Theorem. Assume the following (Cramér) conditions 1. θ θ log g(x θ) is twice continuously differentiable. 2. θ g( θ) is identifiable θ 3 log g(x i θ) < h(x), θ nbhd(θ 0 ), E [X θ0 ]h(x 1 ) < 4. ˆθ MLE (x) is the unique solution of l x (θ) = 0 for every x. IID If X i g(x i θ 0 ) then n(ˆθ MLE θ 0 ) d N (0, {I F 1 (θ 0 )} 1) where I F 1 (θ) = E [X i θ] 2 θ 2 log g(x 1 θ) = Fisher information at θ. 3 / 1

5 A sketch of a proof First order Taylor approximation: 0 = 1 lx (ˆθ MLE ) 1 lx (θ 0 ) + n l X (θ 0 ) T (ˆθ MLE θ 0 ) n n n Write T i = θ log g(x i θ). ETi = 0, VarT i = I F 1 (θ 0) 1 n l X (θ 0 ) = n( T 0) d N(0, I F 1 (θ 0)) [by CLT] Write W i = 2 log g(x θ 2 i θ) at θ = θ 0 EWi = I1 F (θ 0) l X (θ 0) n = W p I1 F (θ 0) Rest is Slutksy s theorem 4 / 1

6 Three large n approximations to Λ(x) Assume θ 0 Θ 0, i.e., A T θ 0 = a 0 Notation: ˆθ H0 = argmax θ Θ0 l x (θ) (must have A T ˆθH0 = a 0 ) Quadratic, Wald, Rao approximations F (X ) = (A T ˆθ MLE a 0 ) T {A T Ix 1 A} 1 (A T ˆθ MLE a 0 ) 2 log Λ(X ) W (X ) = n(a T ˆθ MLE a 0) T {A T I1 F (ˆθ MLE ) 1 A} 1 (A T ˆθ MLE a 0) S(X ) = 1 l n X (ˆθ H0 ) T {I1 F (ˆθ H0 )} 1 l X (ˆθ H0 ) 2 log Λ(X ) F (X ) p 0, etc. Each of 2 log Λ(X ), F (X ), W (X ), S(X ) d χ 2 (r) 5 / 1

7 Sketches of proof if you cared - part I 2 log Λ(X ) F (X ) follows from Quadratic approximation: l x (θ) l x (ˆθ MLE ) 1 2 (θ ˆθ MLE ) T I X (θ ˆθ MLE ) and the corresponding profile log-likelihood of η = A T θ: l x(η) l x (ˆθ MLE ) 1 2 (η AT ˆθMLE ) T {A T I 1 X A} 1 (η A T ˆθMLE ) and by noting Λ(x) = l x(a T ˆθ H0 ) l x(a T ˆθ MLE ) 2 log Λ(X ) W (X ) follows similarly, use I X ni F 1 (ˆθ MLE ). 6 / 1

8 Sketches of proof if you cared - part II 2 log Λ(X ) S(X ) follows because 0 = l x (ˆθ MLE ) l x (ˆθ H0 ) + l x (ˆθ H0 ) T (ˆθ MLE ˆθ H0 ) l x (ˆθ H0 ) ni F 1 (ˆθ H0 ) And so ˆθ MLE ˆθ H0 I F 1 (ˆθ H0 ) 1 l x (ˆθ H0 ) Plug this into the original quadratic approximation and use I X ni F 1 (θ 0) ni F 1 (ˆθ H0 ) because ˆθ MLE θ 0 7 / 1

9 Sketches of proof if you cared - part III It is easiest to prove W (X ) d χ 2 (r) By asymptotic normality of MLE and Slutsky s theorem W (X ) d Z T {A T I1 F (θ 0 ) 1 A} 1 Z where Z N r (0, {A T I1 F (θ 0) 1 A}) But Z N r (0, Σ) means Z T Σ 1 Z χ 2 (r). 8 / 1

10 Back to size calculation Recall: Each of 2 log Λ(X ), F (X ), W (X ), S(X ) d χ 2 (r) So each of the following test procedures has size α ML/LRT: Reject H0 if 2 log Λ(x) > q χ 2(1 α, r) Quadratic: Reject H 0 if F (x) > q χ 2(1 α, r) Wald: Reject H0 if W (x) > q χ 2(1 α, r) Rao: Reject H 0 if S(x) > q χ 2(1 α, r) Here q χ 2(u, r) = F 1 χ 2 (u, r) where F χ 2(x, r) = CDF of χ 2 (r) Also, one can calculate p-values as 1 F χ 2(2 log Λ(x), r) etc. 9 / 1

11 Which one to use? Depends on which is easier to compute To get W (x) you don t need ˆθ H0, only need ˆθ MLE. To get S(x) you only need ˆθ H0, don t need ˆθ MLE. In many cases F (X ) = W (X ) because I X = ni F 1 (ˆθ MLE ) This happens in most generalized linear models: IND Y i g(y i β, z i ) = h(z i, y i )e ξ(β)y i zi T β A(β,z i ) and Wald s tests are popular choices Multinomial model of category counts, S(X ) = Pearson s-χ 2 10 / 1

12 Example: low birthweight Natality data n = 500 records on US births in June Y i = 1 if i-th birth record has birthweight < 2500g, Y i = 0 otherwise. z i = (Cigarettes i, Black i ) records daily number of cigarettes and the race of the mother. Model log P(Y i = 1) 1 P(Y i = 1) = β 1 + β 2 Cigarettes i + β 3 Black i + β 4 Cigarettes i Black i. Can compute ˆβ MLE and I X (numerically). 11 / 1

13 MLE and curvature From glm() function in R ˆβ MLE = I 1 X = How would you test the null hypothesis that for a black mother, probability of low birthweight depends on the number of cigarettes? For a non-black mother? How would you test that this dependence is different for black and non-black mothers? 12 / 1

14 Categorical data & Multinomial models Data: Counts of categories formed by one or more attributes Tables could be one-way, two-way, etc., depending on how many attributes are used to decide categories. Hair color Eye color Blue Green Brown Black Total Blonde Red Brown Black Total / 1

15 Multinomial model Data are k category counts of n objects X = (X 1,, X k ), X j 0, X 1 + X X k = n. Model X Multinomial(n, p) where p = (p 1,, p k ) k k is k-dim simplex, contains k-dim prob vectors b = (b 1,, b k ), i.e., b j 0 j and j b j = 1) Multinomial(n, p) has pmf ( ) n f (x p) = p x 1 1 x 1 x px k k k for x = (x 1,, x k ) with x i 0 and i x i = n, f (x p) = 0 otherwise. 14 / 1

16 Multinomial model (contd.) This is an extension of the binomial distributions. In X we record the counts from n independent trials with k outcomes with probabilities given by p. Consequently X j Bin(n, p j ) for any single category 1 j k. Similarly for two categories j 1 j 2, X j1 + X j2 Bin(n, p j1 + p j2 ), and so on. 15 / 1

17 Multinomial model (contd.) For two-way tables, with k 1 rows and k 2 columns, we can write X and p either as k 1 k 2 -dim vectors, or more commonly as k 1 k 2 matrices. Even when they re written as matrices, we ll write X Multinomial(n, p) to mean that the multinomial distribution is placed on the vector forms of X and p. For multi-way table we ll use arrays of appropriate dimensions to represent X and p. 16 / 1

18 MLE To obtain the MLE of p based on data X = x, we can maximize the log-likelihood function in p with an additional Lagrange component to account for the constraint k j=1 p j = 1: l x (p, λ) = const + The solution in p equals: ˆp MLE = k k x j log p j + λ( p j 1) j=1 ( x1 n,, x ) k n j=1 17 / 1

19 Hypothesis testing: Mendel s peas Mendel, the founder of modern genetics, studied how physical characteristics are inherited in plants. His studies led him to propose the laws of segregation and independent assortment. We ll test this in a simple context. Under Mendel s laws, when pure round-yellow and pure green-wrinkled pea plants are cross-bred, the next generation of plant seeds should exhibit a 9:3:3:1 ratio of round-yellow, round-green, wrinkled-yellow and wrinkled-green combinations of shape and color. In a sample of 556 plants from the next generation the observed counts for these combinations are (315, 108, 101, 32). Does the data support Mendel s laws? 18 / 1

20 Formalization Data X = (X 1, X 2, X 3, X 4 ) giving the category counts of the four types of plants Model X Multinomial(n = 556, p), p 4 Test H 0 : p = ( 9 16, 3 16, 3 16, 1 16 ) 19 / 1

21 Hardy-Weinberg equilibrium The spotting on the wings of Scarlet tiger moths are controlled by a gene that comes in two varieties (alleles) whose combinations (moths have pairs of chromosomes) produce three varieties of spotting pattern: white spotted, little spotted and intermediate. If the moth population is in Hardy-Weinberg equilibrium (no current selection drift), then these varieties should be in the ratio a 2 : (1 a) 2 : 2a(1 a), where a (0, 1) denotes the abundance of the dominant white spotting allele. In a sample of 1612 moths, the three varieties were counted to be 1469, 5 and 138. Is the moth population in HW equilibrium? 20 / 1

22 Formalization Data X = (X 1, X 2, X 3 ) the category counts of the three spotting patterns, model X Multinomial(n = 1612, p), p 3, Test whether H 0 : p HW 3, HW 3 is a subset of 3 containing all vectors of the form (a 2, (1 a) 2, 2a(1 a)) for some a (0, 1). 21 / 1

23 Hair and eye color Are hair color and eye color two independent attributes? In this case, writing p in the matrix form p = ((p ij )), 1 i k 1 and 1 j k 2, we want to test if p ij = p (1) i p (2) j for some p (1) k1 and p (2) k2. It s elementary that if p ij factors as above, then p (1) i = k 2 j=1 p ij and p (2) j = k 1 i=1 p ij. Writing the row and column totals as p i and p j, the test of independence is often represented by H 0 : p ij = p i p j, i, j We ll use I k 1,k 2 to denote this set. 22 / 1

24 ML testing dim( k ) = k 1, as vector p k satisfy k i=1 p i = 1. Hypotheses H 0 : p Θ 0, Θ 0 is a sub-simplex of k of dimension q < k. Mendel s peas: Θ0 = ( 9 16, 3 16, 3 16, 1 16 ), q = 0 HW equilibrium: Θ0 = HW 3, q = 1 Independence: Θ 0 = I k 1,k 2, q = k 1 + k 2 2. Could use any of ML, quadratic, Wald or Rao tests 23 / 1

25 Pearson s χ 2 tests Rao s test statistics boils down to a very convenient form: S(X ) = k (X j nˆp H0,j) 2 j=1 nˆp H0,j = (Observed Expected) 2 Expected For categorical data, this was earlier discovered by Pearson who also ascertained its approximate χ 2 distribution. Because this is Rao statistic, we have S(X ) d χ 2 (k 1 q) 24 / 1

26 Example 1: point null For testing H 0 : p = p 0 against p p 0, where p 0 = (p 0,1,, p 0,k ) k is a fixed probability vector of interest (as in Mendel s peas example), then S(X ) = k (X j np 0,j ) 2 np 0,j j=1 which is asymptotically χ 2 (k 1 0) = χ 2 (k 1). Size-α Pearson s test rejects H 0 if S(X ) > q χ 2(1 α, k 1). 25 / 1

27 Example 2: parametric form HW test: H 0 : p 0 = (η 2, η(1 η), η(1 η), (1 η) 2 ), for some 0 < η < 1 To compute ˆp H0 it is equivalent to write the likelihood function in η and maximize: L X (η) = const. {η 2 } x 1 {η(1 η)} x 2 {η(1 η)} x 3 {(1 η) 2 } x 4 = const. η 2x 1+x 2 +x 3 (1 η) x 2+x 3 +2x 4 and so ˆη MLE = 2x 1+x 2 +x 3 2n. 26 / 1

28 Example 2: parametric form (contd.) Once we have ˆη MLE, we can construct ˆp H0 = (ˆη 2 MLE, ˆη MLE (1 ˆη MLE ), ˆη MLE (1 ˆη MLE ), (1 ˆη MLE ) 2 ) and evaluate S(X ). Because Θ 0 has dimension q = 1 (only a single number η needs to be known), the asymptotic distribution of S(X ) is χ 2 (k 2). The same calculations carry through for a more general parametric form: H 0 : p = (g 1 (η),, g k (η)) where η E is q-dim vector and g 1 (η),, g k (η) are functions such that for every η E, (g 1 (η),, g k (η)) k. 27 / 1

29 Example 3: independence Consider testing H 0 : p ij = p i p j, i, j To get ˆp H0, we write the likelihood function in terms of p i, 1 i k 1 and p j, 1 j k 2 : k 1 k 2 L x (p 1,, p k1, p 1,, p k2 ) = const. (p i p j ) x ij i=1 j=1 { k1 = const. i=1 p x i i }{ k2 where x i = k 2 j=1 x ij and x j = k 1 i=1 x ij are the margin counts of our two-way table. j=1 p x j j } 28 / 1

30 Example 3: independence (contd.) Because (p 1,, p k1 ) k1 maximizer is given by: and (p 1,, p k2 ) k2, the ˆp i = x i n, ˆp j = x j n, 1 i k 1, 1 j k 2 And so ˆp H0 has coordinates: ˆp H0,ij = x i x j n 2 giving k 1 k 2 S(X ) = i=1 j=1 (X ij X i X j n ) 2 X i X j n Because dim(θ 0 ) = q = k k 2 1, S(X ) d χ 2 (k 1 k 2 1 k k 2 + 1) = χ 2 ((k 1 1)(k 2 1)). 29 / 1

31 Hair-eye color Hair color Eye color Blue Green Brown Black Total Blonde 20 (13.9) 15 (13.6) 18 (28.4) 14 (11.0) 67 Red 11(8.5) 4 (8.3) 24 (17.4) 2 (6.7) 41 Brown 9 (15.4) 11 (15.1) 36 (31.4) 18 (12.2) 74 Black 8 (10.2) 17 (10.0) 20 (20.8) 4 (8.1) 49 Total S(x) = So the p-value is 1 F (4 1)(4 1) (30.9) = 1 F 9 (30.9) / 1

Theorem. Assume the following (Cramér) conditions 1. θ θ

Theorem. Assume the following (Cramér) conditions 1. θ θ ML test i a slightly differet form ML Testig (Likelihood Ratio Testig) for o-gaussia models Surya Tokdar Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) ={θ : l x (θ) max θ Θ l x (θ) c2