Lecture 17: Minimax estimation of high-dimensional functionals. 1 Estimating the fundamental limit is easier than achieving it: other loss functions

Size: px

Start display at page:

Download "Lecture 17: Minimax estimation of high-dimensional functionals. 1 Estimating the fundamental limit is easier than achieving it: other loss functions"

Ashlee Sanders
5 years ago
Views:

1 EE378A tatistical igal Processig Lecture 3-05/29/207 Lecture 7: Miimax estimatio of high-dimesioal fuctioals Lecturer: Jiatao Jiao cribe: Joatha Lacotte Estimatig the fudametal limit is easier tha achievig it: other loss fuctios We emphasize that it is a geeral pheomeo that estimatig the fudametal limit is easier tha achievig it. Recall the defiitio of the two problems of achievig ad estimatig the fudametal limit: Achievig the fudametal limit: Estimatig the fudametal limit: if ˆX(X,...,X ) P X P if Û(X,...,X ) P X P E P Λ(X, ˆX) U(P X )] E P Û U(P X) ] Here we observe i.i.d. samples X, X 2,..., X X with distributio P X P, where P is a collectio of probability distributio o X. The U(P X ) = arg miˆx E P Λ(X, ˆx)] is the Bayes evelope. I last lecture, we show that i the case of Λ(X, ˆx) = Λ(X, ˆP ) = log ˆP ad P = M (X), it takes samples to achieve the fudametal limit, ad l samples to estimate the fudametal limit. I this lecture we show that similar pheomeo happes for aother widely used loss fuctio: the Hammig loss. uppose X = (Y, Z) where Y Y, Y = ad Z {0, }. Oe may iterpret Y as the feature, ad Z as the label while castig it as a biary classificatio problem. Cosider the Hammig loss: Λ(x, t) = {t(y ) Z}. The Bayes evelope i this case is give by where η(y) = PZ = Y = y]. U(P X ) = Emi(η(Y ), η(y ))], Theorem. Cosider P = {P X P Z (0) = 2 }. The, it requires samples to achieve U(P X) ad log() samples to estimate U(P X). 2 Boostig the Chow-Liu algorithm with improved mutual iformatio estimates Graphical models provide us with efficiet computatioal tools to coduct iferece i high dimesioal data with potetial structure, cf. ] ad refereces therei. Learig the structure ad parameters of graphical models from empirical data is therefore the startig poit for all these applicatios. It has bee kow that exact learig of a geeral graphical model is NP-hard 2], ad there exist tractable sub-classes amog which tree graphical models are the most famous. The semial work of Chow ad Liu 3] cotributed a efficiet algorithm to compute the Maximum Likelihood Estimator (MLE) of tree structured graphical model based o empirical data, ad costitutes oe of the very few cases where the exact MLE ca be solved efficietly. There are various approaches towards learig more complex structures, for which we refer the reader to 4] for a review.

2 Cocretely, the Chow Liu algorithm (CL) addresses the followig questio. Give i.i.d. samples of a radom vector X = (X, X 2,..., X d ), where X i X, X <, we wat to estimate the joit distributio of X. Chow ad Liu 3] assumed that P X ca be factorized as: P X = d P Xmi X mj(i), 0 j(i) < i, () i= where (m, m 2,..., m d ) is a ukow permutatio of itegers, 2,..., d, ad P Xi X 0 is by defiitio equal to P Xi. The, CL outputs the distributio P X that maximizes the likelihood of the observed data. Iterestigly, this optimizatio problem ca be efficietly solved after beig trasformed ito a Maximum Weight paig Tree (MWT) problem, which ca be solved usig the Prim or Kruskal algorithm. I particular, they showed that the MLE of the tree structure boils dow to the followig expressio: E ML = arg max I( ˆP e ), (2) E Q :Q is a tree e E Q where I( ˆP e ) is the mutual iformatio associated with the empirical distributio of the two odes coected via edge e, ad E Q is the set of edges of a tree distributio Q (i.e., Q factors as a tree). I words, it suffices to first compute the empirical mutual iformatio betwee ay two odes (i total ( d 2) pairs), ad the maximum weight spaig tree is the tree structure that maximizes the likelihood. To obtai estimates of distributios o each edge, Chow ad Liu 3] simply assiged the empirical distributio while pickig a arbitrary ode as the tree root. We begi by askig the followig atural questio: Questio 2. Is the Chow Liu algorithm optimal for learig tree graphical models? ice the Chow Liu algorithm exactly solves the MLE, ad has bee widely used i may applicatios, its optimality seems to be tacitly assumed i much of the literature. However, a closer ispectio of the statistical theory 5, 6] reveals that it is oly kow that the Chow Liu algorithm performs essetially optimally whe the umber of samples grows to, while the umber of states of the tree has fixed size. Ideed, the moder theory of the maximum likelihood estimatio paradigm 7] oly justifies the asymptotic efficiecy of MLE, without geeral o-asymptotic guaratees whe we have fiitely may samples. I cotrast, various moder data-aalytic applicatios deal with datasets that do ot have the luxury of too may observatios compared to the alphabet size. To explai the isights uderlyig our improved algorithm, we revisit equatio (2) ad ote that if we were to replace the empirical mutual iformatio with the true mutual iformatio, the output of the MWT would be the true edges of the tree. I light of this, the CL algorithm ca be viewed as a plug-i estimator that replaces the true mutual iformatio with a estimate of it, amely the empirical mutual iformatio. Naturally the, it is to be expected that a better estimate of the mutual iformatio would lead to smaller probability of error i idetifyig the tree. However, how bad ca the empirical mutual iformatio be as a estimate for the true mutual iformatio? The followig theorem i 8] implies that it ca be highly sub-optimal i high dimesioal regimes. Theorem 3. uppose we have two radom variables X, X 2 X, X <. The miimax sample complexity i estimatig the mutual iformatio I(X ; X 2 ) uder mea squared error is Θ( X 2 / l X ), while the sample complexity required by the empirical mutual iformatio to be cosistet is Θ( X 2 ). I words, Theorem 3 implies that for the miimax rate-optimal estimator, it suffices to take X 2 / l X samples to cosistetly estimate the mutual iformatio I(X ; X 2 ) for ay uderlyig distributios. At the same time, uless X 2, there exist distributios for which the error of the empirical mutual iformatio would be bouded away from zero. Furthermore, whe X is large, the approach of estimatig the three etropy terms i I(X ; X 2 ) = H(X ) + H(X 2 ) H(X, X 2 ) (3) 2

3 separately usig the miimax rate-optimal etropy estimators is miimax rate-optimal for estimatig the mutual iformatio. Now we apply the improved mutual iformatio estimates to improve the Chow Liu algorithm. I the followig experimet, we fix d = 7, X = 300, costruct a star tree (i.e. all radom variables are coditioally idepedet give X ), ad geerate a radom joit distributio by assigig idepedet Beta(/2, /2)- distributed radom variables to each etry of the margial distributio P X ad the trasitio probabilities P Xk X, 2 k d (with ormalizatio). The, we icrease the sample size from 0 3 to , ad for each we coduct 20 Mote Carlo simulatios. Note that the true tree has d = 6 edges, ad ay estimated set of edges will have at least oe overlap with these 6 edges because the true tree is a star graph. We defie the wrog-edges-ratio i this case as the umber of edges differet from the true set of edges divided by d 2 = 5. Thus, if the wrog-edges-ratio equals oe, it meas that the estimated tree is maximally differet from the true tree ad, i the other extreme, a ratio of zero correspods to perfect recostructio. We compute the expected wrog-edges-ratio over 20 Mote Carlo simulatios for each, ad the results are exhibited i Figure Modified CL Origial CL Expected wrog edges ratio umber of samples x 0 4 Figure : The expected wrog-edges-ratio of our modifed algorithm ad the origial CL algorithm for sample sizes ragig from 0 3 to Figure reveals itriguig phase trasitios for both the modified ad the origial CL algorithm. Whe we have fewer tha samples, both algorithms yield a wrog-edges-ratio of, but soo after the sample size exceeds 6 0 3, the modified CL algorithm begis to recostruct the etwork perfectly, while the origial CL algorithm cotiues to fail maximally util the sample size exceeds , 8 times the sample size required by the ew algorithm, which we temporarily call Modified Chow Liu algorithm. Ope Questio 4. The exact sample complexity for learig tree graphical models is ope. Although it is 3

4 clear that usig improved mutual iformatio estimates improves the tree learig process, it is ot clear that it is the optimal way for tree learig. Oe may start from askig the followig easier questio: for ay joitly distributed radom variables (X, X 2, X 3 ), what is the sample complexity of testig I(X ; X 2 ) I(X ; X 3 ) ɛ 2 agaist I(X ; X 2 ) I(X ; X 3 ) ɛ? Here 0 < ɛ < ɛ 2 are fixed costats. 3 upport size estimatio Give a discrete probability distributio P, we wat to estimate its port size: (P ) = i {pi>0} uder the assumptio that mi i p i. Note that this assumptio immediately implies that the port size of P ca be at most. The questio is: how ca we desig the miimax rate-optimal estimator for estimatig the true port size? The materials i this sectio come from 9]. We demostrate i this sectio that usig the approximatio based methodology i last lecture, oe ca ituitively obtai the sample complexity i a systematic fashio. It was show i the last lecture that we ca distiguish the cases p i log ad p i log with overwhelmig probability. It motivates us to separate the problem ito two regimes:. The case of : i this case, the fuctioal we wat to estimate is a costat, ad it suffices to use the plug-i approach. The resultig bias for each p i is: log 2. The case of log E(ˆp i 0) pi>0] = ( p i ) e pi e. (4) : i this case, there might be some p i that falls i the iterval log others fall i the iterval ]., Those p i that fall ito ] suffices to look at the p i s that are i. Claim 5. uppose q (0, ). The, if 0, log P K poly K,P K (0)=0 x q,] Applyig Claim 5 to the scaled iterval ] for p i, log is P K (x) 0, log ], where q = 0, log ], while log, ] are hadled with the MLE, ad it ( ) q K + e K q. (5) q log, K log, we obtai that the bias e K q = e log log = e log, (6) which vaishes at log as l. Hece, we have (ituitively) show that the bias of the miimax optimal approach should be of order ( ) e Θ log. (7) 4

5 Refereces ] M. J. Waiwright ad M. I. Jorda, Graphical models, expoetial families, ad variatioal iferece, Foudatios ad Treds R i Machie Learig, vol., o. -2, pp. 305, ] D. Karger ad N. rebro, Learig Markov etworks: Maximum bouded tree-width graphs, i Proceedigs of the twelfth aual ACM-IAM symposium o Discrete algorithms. ociety for Idustrial ad Applied Mathematics, 200, pp ] C. Chow ad C. Liu, Approximatig discrete probability distributios with depedece trees, Iformatio Theory, IEEE Trasactios o, vol. 4, o. 3, pp , ] Y. Zhou, tructure learig of probabilistic graphical models: a comprehesive survey, arxiv preprit arxiv:.6925, 20. 5] C. Chow ad T. Wager, Cosistecy of a estimate of tree-depedet probability distributios (corresp.), Iformatio Theory, IEEE Trasactios o, vol. 9, o. 3, pp , ] V. Y. Ta, A. Aadkumar, L. Tog, ad A.. Willsky, A large-deviatio aalysis of the maximumlikelihood learig of Markov tree structures, Iformatio Theory, IEEE Trasactios o, vol. 57, o. 3, pp , 20. 7] E. L. Lehma ad G. Casella, Theory of poit estimatio. priger, 998, vol. 3. 8] J. Jiao, K. Vekat, Y. Ha, ad T. Weissma, Miimax estimatio of fuctioals of discrete distributios, IEEE Trasactios o Iformatio Theory, vol. 6, o. 5, pp , ] Y. Wu ad P. Yag, Chebyshev polyomials, momet matchig, ad optimal estimatio of the usee, arxiv preprit arxiv: ,

Lecture 16: Achieving and Estimating the Fundamental Limit

Lecture 16: Achieving and Estimating the Fundamental Limit EE378A tatistical igal Processig Lecture 6-05/25/207 Lecture 6: Achievig ad Estimatig the Fudametal Limit Lecturer: Jiatao Jiao cribe: William Clary I this lecture, we formally defie the two distict problems