arxiv: v2 [stat.me] 26 Jun 2012

Size: px

Start display at page:

Download "arxiv: v2 [stat.me] 26 Jun 2012"

Marilyn Banks
6 years ago
Views:

1 The Two-Way Lkelhood Rato (G Test and Comparson to Two-Way χ Test Jesse Hoey June 7, 01 arxv: v [stat.me] 6 Jun 01 1 One-Way Lkelhood Rato or χ test Suppose we have a set of data x and two hypotheses H R and H S. We wsh to know whch hypothess explans the data better. To do ths, we compute the lkelhood rato ( P (x HR log P (x H S Assumng the data are..d gven each hypothess, we have P (x H J = P (x H J, where J R, S, and thus the lkelhood rato s L = ( P (x H R log (1 P (x H S The Bayesan formulaton of the problem could be approached by parametersng H R and H S wth some unknown parameters, θ R and θ S, respectvely. The posteror dstrbuton over these parameters s then gven by ntegratng the lkelhoods over all possble values L = log ( P (θ R H R P (x θ R, H R dθ R log ( P (θ S H S P (x θ S, H s dθ S ( These ntegratons can sometmes be performed analytcally, or usng some numercal ntegraton technques. However, we wll focus nstead on a smple heurstc method whch s related to the χ statstcs dscussed above. Note that Davd MacKay [3] explctly assumes the parameters have an ntrnsc arty to them (multnomals wth an ntrnsc number of bns. Ths assumpton may not be always correct, and n fact, may lead to ncorrect assumptons. Now suppose that the hypotheses are multnomal probablty dstrbutons H R = {r 1,..., r N }, wth the constrant that r = 1, and each r corresponds to some range (bn of the data x R (and smlarly we have s for H S, then the lkelhood rato can be wrtten as a sum over the N bns by groupng terms n Equaton 1 nto the bns: ( r F log N where F s the number of data that fall nto bn. The equvalent ch-squared test s to compute the χ statstc for each hypothess χ R = (F r N r N s χ S = (F s N s N and compare them, choosng the one wth the smaller χ. Davd MacKay argues effectvely for the use of the lkelhood rato [3]. We wll see n more detal the condtons n whch the ch-squared test s not applcable n Secton 4. 1

2 Two-Way Lkelhood Rato Test If we wsh to compare two sets of data, x R and x S, and ask whether they are drawn from the same dstrbuton or from two dfferent dstrbutons, then our frst hypothess s that there are two models H R and H S to explan the data, and the second hypothess s that there s a sngle model H R+S that explans the data. Thus, the queston can be formulated as the lkelhood rato L = log ( P (xr, x S H R, H S P (x R, x S H R+S = log ( P (xr H R P (x R H R+S + log ( P (xs H S P (x S H R+S where we have made the assumpton that x R s ndependent of H s (and vce-versa f the two dstrbutons are dfferent, and that x R s ndependent of x S gven H R+S f the two dstrbutons are the same, both of whch are true gven the..d assumpton of data gven hypotheses. The Bayesan formulaton of the problem s to parameterse H R, H S and H R+S wth some unknown parameters, θ R, θ S and θ R+S, respectvely. The lkelhoods n (3 are then gven by ntegratng over all possble parameter values ( P (θr, θ S H R, H S P (x R, x S θ R, θ S, H R, H S dθ R dθ S L = log (4 P (θr+s H R+S P (x R, x S θ R+S H R+S dθ R+s These ntegratons can sometmes be performed analytcally, or usng some numercal ntegraton technques. However, n ths note, we wll use the most lkely estmate for the parameters, gven the data. Ths smple method s related to the χ statstcs dscussed above, but wll see some lmtatons of t n Secton 4. We can estmate the parameters of H R drectly from the data, as the most lkely estmate usng a multnomal wth values r = R /R, wth R beng the number of data ponts n x R that fall nto bn, and R = R. Smlarly for H s s a multnomal s = S /S, and S = S. Fnally, we can estmate H R+S n the same way gven both datasets, to gve a multnomal wth values (R + S /(R + S. Usng the same transformaton (from data to bns as above, the lkelhood rato becomes L = bns ( R log R /R (R + S /(R + S + bns ( S log S /S (R + S /(R + S whch s smply the weghted sum of the Kullback-Lebler dvergences of the two datasets from the average dstrbuton L = R D KL (r p + S D KL (s p where p = R+S R+S s the probablty of a data pont fallng n bn estmated from both sets of data. It s also a symmetrsed relatve entropy measure comparng the data to ts own dstrbuton (e.g. R to R /R and to the average dstrbuton of both sets of data ((R + S /(R + S. We can see ths better by expandng out the logs of fractons as dfferences of logs and cancellng terms to obtan. L = ( R log( R R + S log( S S (R + S log( R + S R + S or [ L = R r log(r + S s log(s (R + S p log(p The frst term s the (negatve entropy of the dstrbuton r (scaled by the number of dataponts, the second s the negatve entropy of s, and the thrd s the entropy of the jont dstrbutons. Denotng γ r, γ s, γ p as the entropy of r, s and p, respectvely, we have L = [Rγ r + Sγ s (R + Sγ p ] (6 [ R = (R + S R + S γ r + S ] R + S γ s γ p (7 ] (3 (5

3 where the entropy γ(x = x log(x. Equaton 6 can be understood by notng that f the two dstrbutons H R and H S are the same, then averagng them wll make no dfference to the entropy of the dstrbutons. If, on the other hand, H R and H S are dfferent, then the average of the two wll have hgher entropy. Thus, γ p wll be larger f the dstrbutons are dfferent, makng L also larger (due to the negatve sgn, whch s what we expect from the orgnal defnton of the lkelhood rato for the two-way problem as gven n (3. More precsely, t s the case that the sum of the entropy of any two probablty dstrbutons wll be less than the entropy of ther average. To show ths, note that the entropy γ(x = x log(x s a concave functon, meanng every pont on every chord les on or below the functon [1], so that αγ(r + βγ(s γ(αr + βs where α + β = 1, and equalty s acheved when r = s. By nducton, ths s true even for a weghted sum: α r log(r + β s log(s (αr + βs log(αr + βs (8 If we use α = R R+S and β = S R+S, then p = αr + βs, and Equaton (8 says that the square bracket n Equaton (7 s always negatve, so that L 0. The extreme cases are 1. r and s are dentcal, then L = 0.. r = 0 for all where s > 0, and s = 0 for all where r > 0. In ths case, ether r or s s zero, and [ L = (R + S α log(α r + β log(β ] s = (R + S [α log(α + β log(β] Snce α + β = 1, ths functon has a maxmum of (R + S/ at α = 0.5, and a mnmum of 0 at α = 1 or 0. Thus, we can see that 0 L 1 (R + S, wth the mnmum acheved for dentcal dstrbutons, and the maxmum acheved for maxmally dfferent dstrbutons. 3 Two-Way χ test If nstead, we use the two-way χ test, we compute the expected counts, whch s the average dstrbuton of the two datasets. Snce R+S R+S s the average dstrbuton gven both sets of data, we have the expected counts n bn for the two datasets as E R ( = R R + S R + S E S ( = S R + S R + S In many treatments of ths problem, partcularly n the bologcal scences, the {1,..., N} are referred to as the rows and the datasets {R, S} are referred to as the columns n a contngency table. Typcally, the rows are a set of features of the data, and the columns are two dfferent datasets, usually obtaned n two dfferent condtons. To answer the queston of whether the two datasets are drawn from the same hypothess or not, we formulate the null hypothess, whch states that they are, and then fgure out the expected counts as above. The ch-squared statstc for the two sets of data s χ = J {R,S} N (J E J ( = (R E R ( + (S E S ( E J ( E R ( E S ( N N (9 3

4 puttng n the defntons of the expected counts from (9 above, and dong some algebra, we get ( S/RR R/SS χ = R + S exactly equaton ( n [4]. Ths value of χ, f large, tells us that the null hypothess can be rejected, and thus that the dstrbutons are lkely to be dfferent. To know what large means, we can use a ch-squared probablty test, that gves us the probablty that the sum of the squares of ν random normal varables of unt varance and zero mean wll be greater than χ [4]. Another way to say ths s the probablty that a partcular value of χ would have occurred by chance f the null hypothess was correct. The ch-squared probablty test s therefore smply the ntegral of the probablty densty of the χ dstrbuton: P (χ ν = Q( ν, χ = Γ( ν, χ Γ( ν The number of degrees of freedom n the hypotheses s ν. If the two datasets are drawn wthout regard for each other (no constrants on the number of dataponts drawn, then the number of degrees of freedom, ν, s the number of bns n whch one of the datasets has at least one count. Typcally, f P (χ ν < 0.05 (the p-value, the ch-squared test s deemed sgnfcant, and the null hypothess can be safely rejected. A smple test that can be used s to reject the null hypothess f χ > ν [4](p One- and Two-Way G-test Interestngly, the lkelhood rato can be more formally related to the χ test, by consderng the G-test, defned as [5] G = O log(o /E where O s the observed counts and E s the expected counts. Note that ths s smply the Kullback- Lebler dvergence between observed and expected counts, multpled by a factor of two. When summed over all data ponts n our two-column example, ths s G = R R log( E R ( + S S log( E S ( (10 puttng n the expressons for the expected counts from above (9, we obtan exactly G = L, gven by Equaton (5 above. In general, wth smaller amounts of data, the ch-squared test wll sometmes gve ncorrect answers, whereas the G-test wll not, and so s the recommended test [3, 5]. To see n more detal why ths s so, we can wrte O = E + δ, wth δ = 0 so that the total number of counts stays the same. The G-test s then G = (E + δ log(1 + δ. E 4

5 If we Taylor expand ths around δ E x x + O(x3, we get = 0 (the pont at whch O and E agree, and usng log(1 + x G (E + δ ( δ 1 E δ E + O(δ 3 δ = δ O(δ 3 E (O E E and so, we see that G χ when O s close to E. However, the more O and E are dfferent, the less well ths approxmaton wll work, and χ wll tend to compute erroneous answers. The effects of a sngle outler n a small sample set wll be more pronounced, whch explans why the χ often fals n stuatons wth lttle data. Ths s the same reason why a lnear regresson can fal wth lttle data, due to the strong effects of outlers. Snce the χ value s just an approxmaton to the G-value, the G-value can also be used n the chsquared probablty test. Ths method s recommended by most texts on statstcs for the bologcal scences. However, t s unclear why one would want to do ths, and what the valdty s snce the chsquared test s based on the pdf of χ. The G-test drectly gves (twce the log lkelhood of the rato of one hypothess vs. the other, and so a sgnfcance can be attrbuted drectly. However, recall that these tests are both based on models or hypotheses whose parameters are derved from the data tself. Instead of computng Equaton (4 drectly, as we should do, we are takng the most lkely estmate of the parameters θ R, θ S and θ R+S (those derved drectly from the data, and collapsng the ntegrals to these pont estmates. One mplcaton of ths s that the G-values wll depend on the complexty of our models (e.g. the number of bns n our multnomals/hstograms. Ths s smply the model overfttng the data: the models derved from each data set R and S wll, wth enough complexty, perfectly ft the data. Therefore, to nterpret the G-value from Equaton (10, we must take the complexty of the model nto account. To evaluate sgnfcance, the value of the lkelhood rato (G/ should be compared to the number of degrees of freedom, ν. If G > ν, then the null hypothess can be safely rejected. Ths corresponds roughly to a p < Lkelhood rato tests for dynamc models In the prevous sectons, we assumed the data were..d dstrbuted, and that the models (hypotheses were smple multnomals. It s also possble that the data are sequentally dependent, such as when they come from a dynamc model. For example, f the data arse from a hdden Markov model, then the same consderatons apply as above. For any type of model H J, J {R, S, R + S} traned on the data n J, we can compute each of P (x R H R, P (x S H S, P (x R H R+S and P (x S H R+S, and then use Equaton (3 to compute the lkelhood rato, and use a ch-squared probablty test as usual. If the H are hdden Markov models, then the lkelhoods wll be computed usng the standard forward equatons []. Acknowledgements Thanks to Chrs Wllams for explanng the factor of n G and ts relatonshp to χ, to Stephen McKenna for pontng to the Bayesan soluton for the problem of ntegratng over all parameters, whch resolves the ssue of why a sgnfcance test s necessary, and to Olva Stevenson for pontng out the possblty for emotonal creatvty. 5

6 References [1] Chrstopher M. Bshop. Pattern Recognton and Machne Learnng. Sprnger, 006. [] A.P. Dempster, N.M. Lard, and D.B. Rubn. Maxmum lkelhood from ncomplete data usng the EM algorthm. Journal of the Royal Statstcal Socety, 39(B:1 38, [3] Davd J.C. MacKay. Bayes or ch-squared? or does t not matter?, 005. [4] Wllam H. Press, Saul A. Teukolsky, Wllam T. Vetterlng, and Bran P. Flannery. Numercal Recpes n C. Cambrdge Unversty Press, edton, 199. [5] Robert R. Sokal and F. James Rohlf. Bometry: The Prncples and Practces of Statstcs n Bologcal Research. W.H. Freeman, 3 edton,

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton