Lecture 16: Achieving and Estimating the Fundamental Limit

Size: px

Start display at page:

Download "Lecture 16: Achieving and Estimating the Fundamental Limit"

Aubrey Underwood
5 years ago
Views:

1 EE378A tatistical igal Processig Lecture 6-05/25/207 Lecture 6: Achievig ad Estimatig the Fudametal Limit Lecturer: Jiatao Jiao cribe: William Clary I this lecture, we formally defie the two distict problems of achievig ad estimatig the fudametal limit, ad show that uder the logarithmic loss, it is easier to estimate the fudametal limit tha to achieve it. The Bayes evelope The Bayes evelope itroduced i the previous lectures ca be viewed as the fudametal limit of predictio. Ideed, for a specified loss fuctio Λ(x, ˆx), the miimum average loss i predictig X P X is give by the Bayes evelope: U(P X ) mi ˆx E X PX [Λ(X, ˆx). () i.i.d. Throughout this lecture, we observe X, X 2,..., X P X, where X X = {, 2,..., }. I other words, the alphabet size of X is X =. We deote by M the space of probability measures o X. We take Λ(x, ˆx) to be the logarithmic loss i the sequel, i other words, we have Λ(x, ˆx) = Λ(x, ˆP ) = log ˆP (x), (2) for ay x X, ˆP M. For o-egative sequeces a γ, b γ, we use the otatio a γ b γ to deote that there exists a uiversal a costat C such that sup γ γ b γ C, ad a γ b γ is equivalet to b γ a γ. Notatio a γ b γ is equivalet to a a γ b γ ad b γ a γ. Notatio a γ b γ meas that lim if γ γ b γ =, ad a γ b γ is equivalet to b γ a γ. We write a b = mi{a, b} ad a b = max{a, b}. Moreover, poly K deotes the set of all polyomials of degree o more tha K. 2 Achievig the fudametal limit i.i.d. Give i.i.d. observatios X, X 2,..., X P X, we would like to costruct a predictor ˆP = ˆP (X, X 2,..., X ) to predict a fresh ew idepedet radom variable X P X, where X is idepedet of the traiig data {X i } i=. The average risk of predictig X usig the predictor ˆP uder the logarithmic loss is give by [log, (3) where the expectatio is over the radomess of (X, X 2,..., X, X) P (+) X.

2 2. The iappropriate questio of miimax risk ice the distributio P X is ukow, we may take the miimax approach i decisio theory ad aim at solvig the miimax risk. I other words, we aim at solvig [ if sup log. (4) ˆP P X M We ow show that this questio leads to a degeerate aswer that may ot be what we wat. Theorem. The miimax risk is give by if ˆP sup P X M [ log = log(), (5) ad the miimax risk achievig ˆP ca be take to be U = (,,..., ), where U is the uiform distributio o X. Proof We first show that the miimax risk is at least log(). Ideed, for ay predictor ˆP M, we have [log X, X 2,..., X = P X (x) log (6) ˆP (x) x X x X P X (x) log P X (x) (7) = H(P X ), (8) where we used the o-egativity of the KL divergece, ad H(P X ) is the hao etropy. Takig P X = U, we have H(P X ) = log. Takig expectatios o both sides with respect to X, X 2,..., X, we kow [log log (9) for ay predictor ˆP. O the other had, takig ˆP U, we have [log which proves that the miimax risk is at most log. = log, (0) Theorem shows that solvig the miimax risk i predictio may lead to iappropriate aswers. Ideed, the miimax optimal solutios turs out to be a degeerate aswer that igores all the traiig data. What we show ext is that focusig o the miimax regret solves this problem i a meaigful way. 2.2 Achievig the fudametal limit: miimax regret As we argued i the proof of Theorem, for ay predictor ˆP = ˆP (X, X 2,..., X ), we have [log H(P X ). () 2

3 It motivates us to defie the miimax regret as follows: [ if sup log H(P X ). (2) ˆP P X M We have the followig algebraic maipulatios for ay predictor ˆP : [ [ [log H(P X ) = P X (x) log ˆP (x) P X (x) log P X (x) X, X 2,..., X x X x X [ = P X (x) log P X(x) ˆP (x) x X (3) (4) = D(P X ˆP ), (5) where D(P Q) = P (x) x X P (x) log Q(x) is the KL divergece betwee P ad Q. I other words, solvig the miimax regret of predictig a fresh ew idepedet radom variable X based o i.i.d. traiig samples X, X 2,..., X is equivalet to solvig the problem of estimatig the discrete distributio P X uder the KL divergece loss. The miimax regret is characterized by the followig theorem. Theorem 2. 2 if ˆP sup P X M [ log { ( + o()) H(P X ) = 2 log(e) if ( + o()) log( ) if (6) Moreover, if lim sup c (0, ), the miimax regret is bouded away from zero. The predictor ˆP that achieves the performace above i the regime of is: ˆP (x) = (x) + β((x)) + j= β(( x j)) for ay x X, (7) where (x) = (X i = x), (8) i= ad (X, X 2,..., X ) is the traiig data. Here 2 if k = 0 β(k) = if k = 3 4 o.w. (9) The predictor ˆP that achieves the performace above i the regime of is: ˆP (x) = (x) + log + log for ay x X. (20) Paiski, Liam. Variatioal Miimax Estimatio of Discrete Distributios uder KL Loss. I NIP, pp Braess, Dietrich, ad Thomas auer. Berstei polyomials ad learig theory. Joural of Approximatio Theory 28, o. 2 (2004):

4 Vaishig miimax regret implies that there exists a predictor ˆP such that its average predictio error [log o the test set approaches the fudametal limit H(P X ). Theorem 2 shows that its takes at least samples to achieve vaishig miimax regret. It ca be uderstood ituitively that oe eeds at least to see all the symbols at least oce to be able to costruct a predictor whose performace is able to approach the fudametal limit. The miimax regret defiitio reflects the traditioal way of uderstadig of the difficulty of machie learig tasks. I machie learig practice, we iteratively improve our traiig algorithm, ad use its predictio accuracy o the test set to measure the performace of our predictio algorithm. The best performace achieved by existig schemes o the test set is usually uderstood as the limit of predictio for a specific dataset. I this cotext, Theorem 2 ca be iterpreted i the way that with samples, there does ot exist ay predictio algorithm based o traiig samples whose performace o the test set ca approach the Bayes evelope i the worst case. As we show i the ext sectio, there exist algorithms that ca estimate the fudametal limit with samples without explicitly costructig a predictio algorithm. 3 Estimatig the fudametal limit We defie the problem of estimatig the fudametal limit as solvig the followig miimax problem: if Ĥ sup Ĥ H(P X), (2) P X M where the ifimum is take over all possible estimators Ĥ = Ĥ(X, X 2,..., X ) that are fuctios of the empirical traiig data. The materials i this sectio are maily take from. Jiao, Jiatao, Kartik Vekat, Yaju Ha, ad Tsachy Weissma. Miimax estimatio of fuctioals of discrete distributios. IEEE Trasactios o Iformatio Theory 6, o. 5 (205): Jiao, Jiatao, Kartik Vekat, Yaju Ha, ad Tsachy Weissma. Maximum likelihood estimatio of fuctioals of discrete distributios. arxiv preprit arxiv: (204). 3. The miimax rates We have the followig theorem. Theorem uppose if Ĥ log. The, sup Ĥ H(P X) P X M log l + l. (22) Theorem 3 shows that it suffices to take samples to cosistetly estimate the fudametal limit H(P X ). It is very surprisig that the umber of samples required is i fact subliear i : oe ca estimate the hao etropy uiformly over all P X M eve if oe has ot see most of the symbols i the alphabet X i the empirical samples. 3 Valiat, Gregory, ad Paul Valiat. Estimatig the usee: a /log ()-sample estimator for etropy ad support size, show optimal via ew CLTs. I Proceedigs of the forty-third aual ACM symposium o Theory of computig, pp ACM, Valiat, Gregory, ad Paul Valiat. The power of liear estimators. I Foudatios of Computer ciece (FOC), 20 IEEE 52d Aual ymposium o, pp IEEE, Wu, Yihog, ad Pegku Yag. Miimax rates of etropy estimatio o large alphabets via best polyomial approximatio. IEEE Trasactios o Iformatio Theory 62, o. 6 (206): Jiao, Jiatao, Kartik Vekat, Yaju Ha, ad Tsachy Weissma. Miimax estimatio of fuctioals of discrete distributios. IEEE Trasactios o Iformatio Theory 6, o. 5 (205):

5 3.2 Natural cadidate: the empirical etropy Oe of the most atural estimators for the hao etropy H(P X ) give i.i.d. samples is the empirical etropy, which is defied as the followig. Deote the empirical distributio by ˆP = (ˆp, ˆp 2,..., ˆp ), where ˆp i = i= (X i = i) is the empirical frequecy of symbol i i the traiig set. The empirical etropy is defied as H( ˆP ), which plugs-i the empirical distributio ito the hao etropy fuctioal. Ituitively, sice the hao etropy is a cotiuous fuctioal for fiite alphabet distributios, ad ˆP coverges to the true distributio P X as, the plug-i estimate H( ˆP ) should be a decet estimator for H(P X ) if is fixed ad. It is ideed true: it is oly i the high dimesios that the empirical etropy starts to behave poorly as a estimate for the hao etropy. We have the followig theorem quatifyig the performace of the empirical etropy i estimatig H(P X ). Theorem uppose. The, sup H( ˆP ) H(P X ) P X M + l. (23) Comparig Theorem 4 ad 3, it seems that the mai differece is that oe has improved the term to l i the miimax rate-optimal etropy estimator, while keepig the secod term uchaged. We ow ivestigate where the two terms come from, ad how oe may costruct the miimax rate-optimal estimators based o the empirical etropy. 3.3 Aalysis of the empirical etropy For ay estimator Ĥ, its performace i estimatig H(P X) ca be characterized via its bias defied as Ĥ H(P X ), ad the cocetratio of Ĥ aroud its expectatio Ĥ. The cocetratio property may be partially characterized by the variace of the estimator Ĥ, amely Var(Ĥ) = (Ĥ Ĥ)2. We ow argue that i Theorem 4, the term l comes from the bias, ad the term comes from the variace. Itroduce the cocave fuctio f(x) = x l( x ) o [0,. It is clear that We have the followig claim. Claim 5. If p i, the H( ˆP ) = f(ˆp i ). (24) i= 0 f(p i ) Ef(ˆp i ). (25) Moreover, Var(H( ˆP )) (l())2 2(l() + 3)2. (26) The results i Claim 5 are ispirig. It shows that the variace of the empirical etropy ca be uiversally bouded regardless of the support size. Moreover, the bias cotributed by each symbol will be liearly added up together, cotributig the term. It is clear that i the regime of fixed ad, the variace domiates, but i the high dimesios the bias domiates. Hece, the key to improvig the empirical etropy would be to reduce the bias i high dimesios without icurrig too much additioal variace. 7 Jiao, Jiatao, Kartik Vekat, Yaju Ha, ad Tsachy Weissma. Maximum likelihood estimatio of fuctioals of discrete distributios. arxiv preprit arxiv: (204). 8 Wu, Yihog, ad Pegku Yag. Miimax rates of etropy estimatio o large alphabets via best polyomial approximatio. IEEE Trasactios o Iformatio Theory 62, o. 6 (206):

6 3.4 How ca we improve the empirical etropy? It has bee a log jourey to fid the miimax rate-optimal estimators. Harris i 975 proposed expadig E p H( ˆP ) usig a Taylor expasio ad obtaied H( ˆP ) = H(P X ) ( ) + o( ). (27) p i 3 The Taylor expasio result looks decet i the regime where p i s are ot too small. Ideed, for very small p i the remaider term i= p i may be much larger tha the true etropy H(P X ) itself. This ituitio turs out to be correct: it suffices to do a first-order bias correctio usig Taylor series i the regime of ot too small p i. I geeral, for ˆp B(, p), we may write [f(ˆp ) = f(p) + ( ) 2 f p( p) (p) + O P 2, which motivates the bias correctio: i= ˆf c = f(ˆp ) 2 f (ˆp ) ˆp ( ˆp ). I the etropy estimatio case, we follow the bias correctio above ad do the followig 9 Costructio 6. If the true p i l, we use f(ˆp i) + 2 istead of f(ˆp i) to estimate f(p i ). Now the focus is o the small p i regime. We eed to uderstad precisely which term cotributed the bias boud. Assume for ow that all p i l. We have the followig maipulatios: H( ˆP ) H(P X ) = = f(ˆp i ) f(p i ) (28) i= (f(ˆp i ) P K ( ˆp i )) i= (f(p i ) P K (p i )) + where P K ( ) is a arbitrary polyomial with order o more tha K. The followig two observatios are crucial for the improvemets of empirical etropy. Claim 7. If p i l, we have ˆp i l with probability at least 4. i= Claim 8. uppose K l. The for ay costat c > 0, if sup P K poly K x [0, c l f(x) P K (x) Utilizig those two claims, ad coditioig o the evet that all ˆp i c l obtai that (f(ˆp i ) P K ( ˆp i )) l i= (f(p i ) P K (p i )) i= (P K ( ˆp i ) P K (p i )), (29) i= c l. (30), p i c l, we immediately (3) l, (32) 9 Note that this bias correctio ituitio does ot easily geeralize to higher order correctios. For a systematic approach to do higher order bias correctio with Taylor series, we refer the readers to Yaju Ha, Jiatao Jiao, Tsachy Weissma, Miimax Rate-Optimal Estimatio of Divergeces betwee Discrete Distributios, arxiv preprit arxiv: (206). 6

7 which implies that [ E P (P K ( ˆp i ) P K (p i )) i= sice we already kow i Claim 5 that H( ˆP ) H(P X ). Thus, we have idetified the reaso of the poor bias of the empirical etropy: it is because the plug-i approach i estimatig the polyomial P K icurs too much bias. Realizig this turs out to be the crucial factor that leads to the miimax rate-optimal estimator: uder the multiomial model there exists ubiased estimators for ay polyomial P K whose order is o more tha. Ideed, whe X B(, p), for ay iteger r {,..., }: [ X(X )... (X r + ) E = p r. ( )... ( r + ) We complete the costructio of the miimax rate-optimal estimator by doig the followig: Costructio 9. If the true p i l, we use the ubiased estimator of polyomial P K(p i ) to estimate f(p i ). Here P K ( ) is the best approximatio polyomial of f(p i ) over the iterval [ 0, c l itroduced i Claim 8. As for the last step, we eed to use the Cheroff boud to show the followig results o cofidece itervals i the biomial model: Claim 0. There exist c, c 2, c 3, c 4 positive real umbers such that: log() if ˆp i [0, c the p log() i [0, c 2 with probability at least. 4 log() if ˆp i [c 3, the p log() i [c 4, with probability at least. 4 There are other details eeded to make the whole proof work: for example, oe eeds to argue that this approach does ot icrease the variace by too much, ad also show miimax lower bouds. I practice oe may also remove the costat term i P K ( ) to esure that oe assigs zero to symbols that have ever appeared i the traiig data. Thus, we have costructed a miimax rate-optimal estimator that does ot require the kowledge of the support size, but behaves early as well as the exact miimax estimator with the kowledge of the support size. (33) 7

Three Approaches towards Optimal Property Estimation and Testing

Three Approaches towards Optimal Property Estimatio ad Testig Jiatao Jiao (taford EE) Joit work with: Yaju Ha, Dmitri Pavlichi, Kartik Vekat, Tsachy Weissma Frotiers i Distributio Testig Workshop, FOC