On mutual information estimation for mixed-pair random variables

Similar documents
j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Lecture 3: Probability Distributions

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

EGR 544 Communication Theory

Generalized Linear Methods

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Multiple Choice. Choose the one that best completes the statement or answers the question.

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Lecture Notes on Linear Regression

Google PageRank with Stochastic Matrix

Probability Theory (revisited)

Another converse of Jensen s inequality

Estimation: Part 2. Chapter GREG estimation

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

First Year Examination Department of Statistics, University of Florida

The Order Relation and Trace Inequalities for. Hermitian Operators

NUMERICAL DIFFERENTIATION

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Goodness of fit and Wilks theorem

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Modelli Clamfim Equazione del Calore Lezione ottobre 2014

Appendix B. Criterion of Riemann-Stieltjes Integrability

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Quantum and Classical Information Theory with Disentropy

Singular Value Decomposition: Theory and Applications

ISQS 6348 Final Open notes, no books. Points out of 100 in parentheses. Y 1 ε 2

Conjugacy and the Exponential Family

Bayesian predictive Configural Frequency Analysis

A Note on Bound for Jensen-Shannon Divergence by Jeffreys

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Maximizing the number of nonnegative subsets

x = , so that calculated

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Random Partitions of Samples

Homework Assignment 3 Due in class, Thursday October 15

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Non-Mixture Cure Model for Interval Censored Data: Simulation Study ABSTRACT

STAT 3008 Applied Regression Analysis

More metrics on cartesian products

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Maximum Likelihood Estimation

Linear Approximation with Regularization and Moving Least Squares

Markov Chain Monte Carlo Lecture 6

Computing MLE Bias Empirically

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Expected Value and Variance

Lecture 17 : Stochastic Processes II

Inductance Calculation for Conductors of Arbitrary Shape

Chapter 7 Channel Capacity and Coding

Complete Convergence for Weighted Sums of Weakly Negative Dependent of Random Variables

On quasiperfect numbers

The lower and upper bounds on Perron root of nonnegative irreducible matrices

/ n ) are compared. The logic is: if the two

Asymptotics of the Solution of a Boundary Value. Problem for One-Characteristic Differential. Equation Degenerating into a Parabolic Equation

Convexity preserving interpolation by splines of arbitrary degree

Excess Error, Approximation Error, and Estimation Error

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Interval Estimation of Stress-Strength Reliability for a General Exponential Form Distribution with Different Unknown Parameters

Introduction to Random Variables

Global Sensitivity. Tuesday 20 th February, 2018

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Composite Hypotheses testing

Error Probability for M Signals

CS 468 Lecture 16: Isometry Invariance and Spectral Techniques

3.1 ML and Empirical Distribution

Complete subgraphs in multipartite graphs

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Adaptive Kernel Estimation of the Conditional Quantiles

NP-Completeness : Proofs

Lecture 4: September 12

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Neryškioji dichotominių testo klausimų ir socialinių rodiklių diferencijavimo savybių klasifikacija

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

Small Area Interval Estimation

Testing for seasonal unit roots in heterogeneous panels

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Notes on Frequency Estimation in Data Streams

Limited Dependent Variables

Statistics and Probability Theory in Civil, Surveying and Environmental Engineering

Learning Theory: Lecture Notes

Supplementary Notes for Chapter 9 Mixture Thermodynamics

A Robust Method for Calculating the Correlation Coefficient

A quantum-statistical-mechanical extension of Gaussian mixture model

CSCE 790S Background Results

Parameters Estimation of the Modified Weibull Distribution Based on Type I Censored Samples

Convergence of random processes

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

The Geometry of Logit and Probit

Transcription:

On mutual nformaton estmaton for mxed-par random varables November 3, 218 Aleksandr Beknazaryan, Xn Dang and Haln Sang 1 Department of Mathematcs, The Unversty of Msssspp, Unversty, MS 38677, USA. E-mal: abeknaza@olemss.edu, xdang@olemss.edu, sang@olemss.edu Abstract We study the mutual nformaton estmaton for mxed-par random varables. One random varable s dscrete and the other one s contnuous. We develop a kernel method to estmate the mutual nformaton between the two random varables. The estmates enjoy a central lmt theorem under some regular condtons on the dstrbutons. The theoretcal results are demonstrated by smulaton study. Keywords: central lmt theorem, entropy, kernel estmaton, mxed-par, mutual nformaton. MSC 21 subject classfcaton: 62G5, 62G2 1 Introducton The entropy of a dscrete random varable X wth countable support {x 1, x 2,...} and p = P(X = x ) s defned to be H(X) = p log p, and the (dfferental) entropy of a contnuous random varable Y wth probablty densty functon f(y) s defned as H(Y ) = f(y) log f(y)dy. If d 2, H(X) or H(Y ) s also called the jont entropy of the components n X or Y. Entropy s a measure of dstrbuton uncertanty and naturally t has applcaton n the felds of nformaton theory, statstcal classfcaton, pattern recognton and so on. Let P X, P Y be probablty measures on some arbtrary measure spaces X and Y respectvely. Let P XY be the jont probablty measure on the space X Y. If P XY s absolutely contnuous dp wth respect to the product measure P X P Y, let XY d(p X P Y ) be the Radon-Nkodym dervatve. Then the general defnton of the mutual nformaton (e.g., [3]) s gven by dp XY I(X, Y ) = dp XY log d(p X P Y ). (1) 1 Correspondng author. X Y 1

If two random varables X and Y are ether both dscrete or both contnuous then the mutual nformaton of X and Y can be expressed n terms of entropes as I(X, Y ) = H(X) + H(Y ) H(X, Y ). (2) However, n practce and applcaton, we often need to work on a mxture of contnuous and dscrete random varables. There are several ways for the mxture. 1). One random varable X s dscrete and the other random varable Y s contnuous; 2). A random varable Z has both dscrete and contnuous components,.e., Z = X wth probablty p and Z = Y wth probablty 1 p, where < p < 1, X s a dscrete random varable and Y s a contnuous random varable; 3). a random vector wth each dmenson component beng dscrete, contnuous or mxture as n 2). In [11], the authors extend the defnton of the jont entropy for the frst case mxture,.e., for the par of random varables, where the frst random varable s dscrete and the second one s contnuous. Our goal s to study the mutual nformaton for that case and provde the estmaton of the mutual nformaton from a gven..d. sample {X, Y } N =1. In [3], the authors appled the k-nearest neghbor method to estmate the Radon-Nkodym dervatve and, therefore, to estmate the mutual nformaton for all three mxed cases. In the lterature, f the random varables X and Y are ether both dscrete or both contnuous, the estmaton of mutual nformaton s usually performed by the estmaton of the three entropes n (2). The estmaton of a dfferental entropy has been well studed. An ncomplete lst of the related research ncludes the nearest-neghbor estmator [7], [12], [1]; the kernel estmator [1], [6], [4], [5] and the orthogonal projecton estmator [8], [9]. Basharn [2] studed the plug-n entropy estmator for the fnte value dscrete case and obtaned the mean, the varance and the central lmt theorem of ths estmator. Vu, Yu and Kass [13] studed the coverage-adjusted entropy estmator wth unobserved values for the nfnte value dscrete case. 2 Man results Consder a random vector Z = (X, Y ). We call Z a mxed-par f X R s a dscrete random varable wth countable support X = {x 1, x 2,...} whle Y s a contnuous random varable. Observe that Z = (X, Y ) nduces measures {µ 1, µ 2, } that are absolutely contnuous wth respect to the Lebesgue measure, where µ (A) = P(X = x, Y A), for every Borel set A n. There exsts a non-negatve functon g(x, y) wth h(x) := g(x, y)dy be the probablty mass functon on X and f(y) := g (y) be the margnal densty functon of Y. Here, g (y) = g(x, y), N. In partcular, denote p = h(x ), N. We have that f (y) = 1 p g (y) s the probablty densty functon of Y condtoned on X = x. In [11], the authors gave the followng regulaton of mxed-par and then defned the jont entropy of a mxed-par. Defnton 2.1 (Good mxed-par). A mxed-par random varables Z = (X, Y ) s called good f the followng condton s satsfed: g(x, y) log g(x, y) dxdy = g (y) log g (y) dy <. X Essentally, we have a good mxed-par random varables when restrcted to any of the X values, the condtonal dfferental entropy of Y s well-defned. 2

Defnton 2.2 (Entropy of a mxed-par). The entropy of a good mxed-par random varable s defned by H(Z) = g(x, y) log g(x, y)dxdy = g (y) log g (y)dy. X As g (y) = p f (y) then we have that H(Z) = g (y) log g (y)dy = p f (y) log p f (y)dy = p log p f (y)dy p f (y) log f (y)dy (3) = p log p p f (y) log f (y)dy = H(X) + p H(Y X = x ). We take the conventon log = and log / =. From the general formula of the mutual nformaton (1), we get that g(x, y)dxdy I(X, Y ) = g(x, y) log X h(x)f(y)dxdy dxdy = g (y) log g (y) p f(y) dy = g (y) log g (y)dy g (y) log p dy g (y) log f(y)dy = p f (y) log[p f (y)]dy (4) p log p f (y)dy f(y) log f(y)dy = p log p f (y)dy + p f (y) log f (y)dy p log p f(y) log f(y)dy = H(Z) + H(X) + H(Y ) = H(Y ) p H(Y X = x ) := H(Y ) I. Let (X, Y ), (X 1, Y 1 ),..., (X N, Y N ) be a random sample drawn from a mxed dstrbuton wth dscrete component havng support {, 1,, m}, and let p = P(X = ), m wth < p < 1, p = 1. Also suppose that the contnuous component has pdf f(y). Denote ˆp = N I(X k = )/N, m, and let Ī = ˆp [ N ˆp ] 1 N I(X k = ) log f (Y k ) N = N 1 I(X k = ) log f (Y k ) (5) 3

and H(Y ) = N 1 N log f(y k ) (6) be the estmators of I = p H(Y X = ), m, and H(Y ) respectvely, where f (y) s the probablty densty functon of Y condtoned on X =, m. Denote a = (1, 1,, 1). Let Σ be the covarance matrx of (log f(y ), I(X = ) log f (Y ),, I(X = m) log f m (Y )). Theorem 2.1 a Σa > f and only f X and Y are dependent. For the estmator Ī(X, Y ) = H Ī (7) = of I(X, Y ) we have that N( Ī(X, Y ) I(X, Y )) N(, a Σa) (8) gven that X and Y are dependent. Furthermore, the varance a Σa can be calculated by a Σa = var ( log f(y ) ) + 2 2 p E [log f (Y )] 2 = = p [E log f (Y ) log f(y ) E log f (Y )E log f(y )] = <j m p p j [E log f (Y )][E j log f j (Y )], where E s the condtonal expectaton of Y gven X =, m. p 2 ( E [log f (Y )] ) 2 (9) Proof. Frst of all, a Σa snce Σ s the varance covarance matrx. If a Σa = then ( ) var log f(y ) I(X = ) log f (Y ) = a Σa = and log f(y ) m = I(X = ) log f (Y ) C for some constant C. But log f(y ) = I(X = ) log f (Y ) = = = I(X = ) log f(y ) f (Y ). Hence log f(y ) f (Y ) C. Then f (y) = cf(y) for some constant c > and for all m. But f(y) = m = p f (y) = cf(y) m = p = cf(y). Hence, c 1 and f (y) = f(y) for all m. Then X and Y are ndependent. On the other hand, f X and Y are ndependent, then f (y) = f(y) for all m. Therefore, log f(y ) m = I(X = ) log f (Y ) = and a Σa =. Hence, a Σa = f and only f X and Y are ndependent. Notce that the vector ( H(Y ), Ī,, Īm) s the sample mean of a sequence of..d. random vectors {(log f(y k ), I(X k = ) log f (Y k ),, I(X k = m) log f m (Y k )) } N wth mean (H(Y ), I,, I m ). Then, by central lmt theorem, we have H H Ī N. I. N(, Σ), Ī m I m 4

and, gven a Σa >, we have (8). By the formula for varance decomposton, we have var ( I(X = ) log f (Y ) ) = E { var[i(x = ) log f (Y ) X] } + var { E[I(X = ) log f (Y ) X] } = E { I(X = )var[log f (Y ) X] } + var { I(X = )E[log f (Y ) X] } = E { I(X = ) var j (log f j (Y ))I(X = j) } j= + var { I(X = ) E j (log f j (Y ))I(X = j) } j= = var [log f (Y )]E { I(X = ) } + ( E [log f (Y )] ) 2 var { I(X = ) } = p var [log f (Y )] + (p p 2 ) ( E [log f (Y )] ) 2 = p E [log f (Y )] 2 p 2 ( E [log f (Y )] ) 2, (1) m. Here var s the condtonal varance of Y when X =, m. By smlar calculaton, ( ) Cov I(X = ) log f (Y ), I(X = j) log f j (Y ) (11) = p p j [E log f (Y )][E j log f j (Y )], for all < j m, and ( ) Cov I(X = ) log f (Y ), log f(y ) = p [E log f (Y ) log f(y ) E log f (Y )E log f(y )]. (12) Thus, the covarance matrx Σ of (log f(y ), I(X = ) log f (Y ),, I(X = m) log f m (Y )) and therefore a Σa can be calculated by the above calculaton (1)-(12). We then have (9). We consder the case when the random varables X and Y are dependent. Note that n ths case a Σa > and we have (8). However, Ī(X, Y ) s not a practcal estmator snce the densty functons nvolved are not known. Now let K( ) be a kernel functon n and let h be the bandwdth. Then ˆf k (y) = { } 1 (N ˆp 1)h d I(X j = )K{(y Y j )/h} j k are the leave-one-out estmators of the functons f, m, and Î = N 1 N are estmators of I = p H(Y X = ), m. Also Ĥ = N 1 I(X k = ) log ˆf k (Y k ) (13) N log ˆf k (Y k ) (14) 5

s an estmator of H(Y ), where ˆf k (y) = { (N 1)h d } 1 j k K{(y Y j )/h} = = { (N 1)h d } 1 j k[ = N ˆp 1 N 1 ˆf k (y). I(X k = )]K{(y Y j )/h} = (15) Theorem 2.2 Assume that the tals of f,, f m are decreasng lke x α,, x αm, respectvely, as x. Also assume that the kernel functon has approprately heavy tals as n [4]. If h = o(n 1/8 ) and α, α m are all greater than 7/3 n the case d = 1, greater than 6 n the case d = 2 and greater than 15 n the case d = 3, then for the estmator m Î(X, Y ) = Ĥ Î, (16) = we have N( Î(X, Y ) I(X, Y )) N(, a Σa). (17) Proof. Under the condtons n the theorem, applyng the formula (3.1) or (3.2) from [5], we have Ĥ = H + o(n 1/2 ), Î = Ī + o(n 1/2 ),, Î m = Īm + o(n 1/2 ). Together wth Theorem 2.1, we have (17). We may take the probablty densty functon of Student-t dstrbuton wth proper degree of freedom nstead of the normal densty functon as the kernel functon. On the other hand, f X and Y are ndependent then I(X, Y ) = Ī(X, Y ) = and we have that Î(X, Y ) = o(n 1/2 ). 3 Smulaton study In ths secton we conduct a smulaton study wth m = 1,.e., the random varable X takes two possble values and 1, to confrm the man results stated n (17) for the kernel mutual nformaton estmaton of good mxed-pars. Frst we study some one dmensonal examples. Let t(ν, µ, σ) be the Student t dstrbuton wth degree of freedom ν, locaton parameter µ and scale parameter σ and let pareto(x m, α) be the Pareto dstrbuton wth densty functon f(x) = αx α mx (α+1) I(x x m ). We study the mxture for the followng four cases: 1). t(3,, 1) and t(12,, 1); 2). t(3,, 1) and t(3, 2, 1); 3). t(3,, 1) and t(3,, 3); 4). pareto(1, 2) and pareto(1, 1). For each case, p =.3 for the frst dstrbuton and p 1 =.7 for the second dstrbuton. The second row of Table 1 lsts the mathematca calculaton of the mutual nformaton (MI) as stated n (4) for each case. The thrd row of Table 1 gves the average of 4 estmates based on formula (16). For each estmate, we use the probablty densty functon of the Student t dstrbuton wth degree of freedom 3,.e. t(3,, 1), as the kernel functon. We also have smulaton study wth kernel functons satsfyng the condtons n the man results and obtaned smlar results. We take h = N 1/5 as the bandwdth for the frst three cases and h = N 1/5 /24 for the last case. The data sze for each estmate s N = 5, n each case. The Pareto dstrbutons pareto(1, 2) and pareto(1, 1) have very dense area on the rght of 1. Ths s the reason that we take a relatvely small bandwdth for ths case. To apply the kernel method n estmaton, one should 6

select an optmal bandwdth based on some crtera, for example, to mnmze the mean squared error. It s nterestng to nvestgate the bandwdth selecton problem from both theoretcal and applcaton vewponts. However, t seems that the study n ths drecton s very dffcult. We leave t as an open queston for future study. It s clear that the average of the estmates matches the true value of mutual nformaton. We apply mathematca to calculate the covarance matrx Σ of (log f(y ), I(X = ) log f (Y ), I(X = 1) log f 1 (Y )) and, therefore, the value of a Σa for each case by formulae (1)-(12) or (9). The values of a Σa are.2189236,.392179,.15451 and.274812 respectvely for the four cases. The fourth row of Table 1 lsts the values of (a Σa/N) 1/2 whch serves as the asymptotc approxmaton of the standard devaton of the estmator Î(X, Y ) n the central lmt theorem (17). The last row gves the sample standard devaton from M = 4 estmates. These two values also have good match. mxture t(3,, 1) t(3,, 1) t(3,, 1) pareto(1, 2) t(12,, 1) t(3, 2, 1) t(3,, 3) pareto(1, 1) MI.11819.223.1263.21123 mean of estmates.1167391.1991132.114199.21447 (a Σa/N) 1/2.6617.25.18.23 sample sd.6616724.2345997.1819982.2349275 Table 1: True value of the mutual nformaton and the mean value of the estmates. 7 6 5 4 3 2.8.9.1.11.12.13.14.15 2 18 16 14 12 8 6 4 2.185.19.195.2.25.21 25 2 15 5.94.96.98.1.12.14.16.18.11 18 16 14 12 8 6 4 2.19.195.2.25.21.215 Fgure 1: The hstograms wth kernel densty fts of M = 4 estmates. Top left: t(3,, 1) and t(12,, 1). Top rght: t(3,, 1) and t(3, 2, 1). Bottom left: t(3,, 1) and t(3,, 3). Bottom rght: pareto(1, 2) and pareto(1, 1). 7

Quantles of Input Sample Quantles of Input Sample Quantles of Input Sample Quantles of Input Sample.14.135.13.125.12.115.11.15.1.95.9-4 -3-2 -1 1 2 3 4 Standard Normal Quantles.26.24.22.2.198.196.194.192.19-4 -3-2 -1 1 2 3 4 Standard Normal Quantles.18.16.14.12.1.98.96-4 -3-2 -1 1 2 3 4 Standard Normal Quantles.21.28.26.24.22.2.198.196.194.192-4 -3-2 -1 1 2 3 4 Standard Normal Quantles Fgure 2: The Q-Q plots of M = 4 estmates. Top left: t(3,, 1) and t(12,, 1). Top rght: t(3,, 1) and t(3, 2, 1). Bottom left: t(3,, 1) and t(3,, 3). Bottom rght: pareto(1, 2) and pareto(1, 1). Fgure 1 and 2 show the hstograms wth kernel densty fts and normal Q-Q plots of 4 estmates for each case. It s clear that the values of Î(X, Y ) follow a normal dstrbuton. We study two examples n the two dmensonal case. Let t ν (µ, Σ ) be the two dmensonal Student t dstrbuton wth degree of freedom ν, mean µ and shape matrx Σ. We study the mxture n two cases: 1). t 5 (, I) and t 25 (, I); 2). t 5 (, I) and t 5 (, 3I). Here I s the dentty matrx. For each case, p =.3 for the frst dstrbuton and p 1 =.7 for the second dstrbuton. Table 2 summarzes 2 estmates of the mutual nformaton wth h = N 1/5 and sample sze N = 5, for each estmate. We take t 3 (, I) as the kernel functon. Same as the one dmensonal case, we apply mathematca to calculate the true value of MI and (a Σa/N) 1/2 whch s gven n formula (9). Fgure 3 shows the hstograms wth kernel densty fts and normal Q-Q plots of 2 estmates for each example. It s clear that the values of Î(X, Y ) also follow a normal dstrbuton n the two dmensonal case. In summary, the smulaton study confrms the central lmt theorem as stated n (17). 8

Quantles of Input Sample Quantles of Input Sample mxture t 5 (, I) t 5 (, I) t 25 (, I) t 5 (, 3I) MI.1158.22516 mean of estmates.112381.222715 (a Σa/N) 1/2.6577826.231299 sample sd.8356947.2315134 Table 2: True value of the mutual nformaton and the mean value of the estmates. 5 45 4 35 3 25 2 15 5.8.9.1.11.12.13.14.15.16 18 16 14 12 8 6 4 2.19.195.2.25.21.215.15.21.14.13.12.25.11.1.2.9.8-3 -2-1 1 2 3 Standard Normal Quantles.195-3 -2-1 1 2 3 Standard Normal Quantles Fgure 3: The hstograms and Q-Q plots of M = 2 estmates. Left: t 5 (, I) and t 25 (, I). Rght: t 5 (, I) and t 5 (, 3I). Acknowledgement The authors thank the edtor and the referees for carefully readng the manuscrpt and for the suggestons that mproved the presentaton. Ths research s supported by the College of Lberal Arts Faculty Grants for Research and Creatve Achevement at the Unversty of Msssspp. The research of Haln Sang s also supported by the Smons Foundaton Grant 586789. References [1] Ahmad, I. A. and Ln, P. E. 1976. A nonparametrc estmaton of the entropy for absolutely contnuous dstrbutons. IEEE Trans. Informaton Theory. 22, 372-375. [2] Basharn, G. P. 1959. On a statstcal estmate for the entropy of a sequence of ndependent random varables. Theory of Probablty and Its Applcatons. 4, 333-336. 9

[3] Gao, W., Kannan, S., Oh, S. and Vswanath, P. 217. Estmatng mutual nformaton for dscrete-contnuous mxtures. Advances n Neural Informaton Processng Systems. 5988-5999. [4] Hall, P. 1987. On Kullback-Lebler Loss and Densty Estmaton. Ann. Statst. 15, no. 4, 1491-1519. [5] Hall, P. and Morton, S. 1993. On the estmaton of entropy. Ann. Inst. Statst. Math. 45, 69-88. [6] Joe, H. 1989. On the estmaton of entropy and other functonals of a multvarate densty. Ann. Inst. Statst. Math. 41, 683-697. [7] Kozachenko, L. F. and Leonenko, N. N. 1987. Sample estmate of entropy of a random vector. Problems of Informaton Transmsson, 23, 95-11. [8] Laurent, B. 1996. Effcent estmaton of ntegral functonals of a densty. Ann. Statst. 24, 659-681. [9] Laurent, B. 1997. Estmaton of ntegral functonals of a densty and ts dervatves. Bernoull 3, 181-211. [1] Leonenko, N., Pronzato, L. and Savan, V. 28. A class of Rény nformaton estmators for multdmensonal denstes. Ann. Statst. 36, 2153 2182. Correctons, Ann. Statst. 38 (21), 3837-3838. [11] Nar, C., Prabhakar, B. and Shah, D. On entropy for mxtures of dscrete and contnuous varables. arxv:cs/6775 [12] Tsybakov, A. B. and van der Meulen, E. C. 1994. Root-n consstent estmators of entropy for denstes wth unbounded support. Scand. J. Statst., 23, 75-83. [13] Vu, V. Q., Yu, B. and Kass, R. E. 27. Coverage-adjusted entropy estmaton. Statst. Med., 26, 439-46. 1