Functional Bregman Divergence and Bayesian Estimation of Distributions

Size: px

Start display at page:

Download "Functional Bregman Divergence and Bayesian Estimation of Distributions"

Chad Baker
5 years ago
Views:

1 FUNCTIONAL BRGAN DIVRGNC Functional Breman Diverence and Bayesian stimation of Distributions B. A. Friyik, S. Srivastava, and. R. Gupta Abstract A class of distortions termed functional Breman diverences is defined, which includes squared error and relative entropy. A functional Breman diverence acts on functions or distributions, and eneralizes the standard Breman diverence for vectors and a previous pointwise Breman diverence that was defined for functions. A recently published result showed that the mean minimizes the expected Breman diverence. The new functional definition enables the extension of this result to the continuous case to show that the mean minimizes the expected functional Breman diverence over a set of functions or distributions. It is shown how this theorem applies to the Bayesian estimation of distributions. stimation of the uniform distribution from independent and identically drawn samples is used as a case study. Index Terms Breman diverence, Bayesian estimation, uniform distribution, learnin BRGAN diverences are a useful set of distortion functions that include squared error, relative entropy, loistic loss, ahalanobis distance, and the Itakura-Saito function. Breman diverences are popular in statistical estimation and information theory. Analysis usin the concept of Breman diverences has played a key role in recent advances in statistical learnin [] [0], clusterin [], [2], inverse problems [3], maximum entropy estimation [4], and the applicability of the data processin theorem [5]. Recently, it was discovered that the mean is the minimizer of the expected Breman diverence for a set of d-dimensional points [], [6]. In this paper we define a functional Breman diverence that applies to functions and distributions, and we show that this new definition is equivalent to Breman diverence applied to vectors. The functional definition eneralizes a pointwise Breman diverence that has been previously defined for measurable functions [7], [7], and thus extends the class of distortion functions that are Breman diverences; see Section I-A.2 for an example. ost importantly, the functional definition enables one to solve functional minimization problems usin standard methods from the calculus of variations; we extend the recent result on the expectation of vector Breman diverence [], [6] to show that the mean minimizes the expected Breman diverence for a set of functions or distributions. We show how this theorem links to Bayesian estimation of distributions. For distributions from the exponential family distributions, many popular diverences, such B. A. Friyik is with the Department of athematics, Purdue University, West Lafayette, IN ( bfriyik@math.purdue.edu). S. Srivastava is with the Department of Applied athematics, University of Washinton, Seattle, WA 9895 ( santosh@amath.washinton.edu).. R. Gupta is with the Department of lectrical nineerin, University of Washinton, Seattle, WA 9895 ( upta@ee.washinton.edu). This author s work was supported in part by the United States Office of Naval Research. as relative entropy, can be expressed as a (different) Breman diverence on the exponential distribution parameters. The functional Breman definition enables stroner results and a more eneral application. In Section we state a functional definition of the Breman diverence and ive examples for total squared difference, relative entropy, and squared bias. In later subsections, the relationship between the functional definition and previous Breman definitions is established, and properties are noted. Then in Section 2 we present the main theorem: that the expectation of a set of functions minimizes the expected Breman diverence. We discuss the application of this theorem to Bayesian estimation, and as a case study compare different estimates for the uniform distribution iven independent and identically drawn samples. Proofs are in the appendix. Readers who are not familiar with functional derivatives may find helpful our short introduction to functional derivatives [8] or the text by Gelfand and Fomin [9]. I. FUNCTIONAL BRGAN DIVRGNC Let ( R d, Ω, ν ) be a measure space, where ν is a Borel measure and d is a positive inteer. Let φ be a real functional over the normed space L p (ν) for p. Recall that the bounded linear functional δφ[f; ] is the Fréchet derivative of φ at f L p (ν) if φ[f + a] φ[f] φ[f; a] δφ[f; a] + ɛ[f, a] a L p (ν) () for all a L p (ν), with ɛ[f, a] 0 as a L p (ν) 0 [9]. Then iven an appropriate functional φ, a functional Breman diverence can be defined: Definition I. (Functional Definition of Breman Diverence). Let φ : L p (ν) R be a strictly convex, twice-continuously Fréchet-differentiable functional. The Breman diverence d φ : L p (ν) L p (ν) [0, ) is defined for all admissible f, L p (ν) as d φ [f, ] φ[f] φ[] δφ[; f ], (2) where δφ[; ] is the Fréchet derivative of φ at. Here, we have used the Fréchet derivative, but the definition (and results in this paper) can be easily extended usin other definitions of functional derivatives; a sample extension is iven in Section I-A.3.

2 Report Documentation Pae Form Approved OB No Public reportin burden for the collection of information is estimated to averae hour per response, includin the time for reviewin instructions, searchin existin data sources, atherin and maintainin the data needed, and completin and reviewin the collection of information. Send comments reardin this burden estimate or any other aspect of this collection of information, includin suestions for reducin this burden, to Washinton Headquarters Services, Directorate for Information Operations and Reports, 25 Jefferson Davis Hihway, Suite 204, Arlinton VA Respondents should be aware that notwithstandin any other provision of law, no person shall be subject to a penalty for failin to comply with a collection of information if it does not display a currently valid OB control number.. RPORT DAT RPORT TYP 3. DATS COVRD to TITL AND SUBTITL Functional Breman Diverence and Bayesian stimation of Distributions 5a. CONTRACT NUBR 5b. GRANT NUBR 5c. PROGRA LNT NUBR 6. AUTHOR(S) 5d. PROJCT NUBR 5e. TASK NUBR 5f. WORK UNIT NUBR 7. PRFORING ORGANIZATION NA(S) AND ADDRSS(S) University of Washinton,Department of lectrical nineerin,seattle,wa, PRFORING ORGANIZATION RPORT NUBR 9. SPONSORING/ONITORING AGNCY NA(S) AND ADDRSS(S) 0. SPONSOR/ONITOR S ACRONY(S) 2. DISTRIBUTION/AVAILABILITY STATNT Approved for public release; distribution unlimited 3. SUPPLNTARY NOTS. SPONSOR/ONITOR S RPORT NUBR(S) 4. ABSTRACT A class of distortions termed functional Breman diverences is defined, which includes squared error and relative entropy. A functional Breman diverence acts on functions or distributions, and eneralizes the standard Breman diverence for vectors and a previous pointwise Breman diverence that was defined for functions. A recently published result showed that the mean minimizes the expected Breman diverence. The new functional definition enables the extension of this result to the continuous case to show that the mean minimizes the expected functional Breman diverence over a set of functions or distributions. It is shown how this theorem applies to the Bayesian estimation of distributions. stimation of the uniform distribution from independent and identically drawn samples is used as a case study. 5. SUBJCT TRS 6. SCURITY CLASSIFICATION OF: 7. LIITATION OF ABSTRACT a. RPORT unclassified b. ABSTRACT unclassified c. THIS PAG unclassified Same as Report (SAR) 8. NUBR OF PAGS 0 9a. NA OF RSPONSIBL PRSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-8

3 2 FUNCTIONAL BRGAN DIVRGNC A. xamples Different choices of the functional φ lead to different Breman diverences. Illustrative examples are iven for squared error, squared bias, and relative entropy. Functionals for other Breman diverences can be derived based on these examples, from the example functions for the discrete case iven in Table of [6], and from the fact that φ is a strictly convex functional if it has the form φ() φ((t))dt where φ : R R, φ is strictly convex and is in some well-defined vector space of functions [20]. ) Total Squared Difference: Let φ[] 2 dν, where φ : L 2 (ν) R, and let, f, a L 2 (ν). Then φ[ + a] φ[] ( + a) 2 dν 2 dν 2 adν + a 2 dν. Because a 2 dν a L 2 (ν) as a 0 in L 2 (ν), a 2 L 2 (ν) a L a 2 (ν) 0 L 2 (ν) δφ[; a] 2 adν, which is a continuous linear functional in a. To show that φ is strictly convex we show that φ is stronly positive. When the second variation δ 2 φ and the third variation δ 3 φ exist, they are described by φ[f; a] δφ[f; a] + 2 δ2 φ[f; a, a] + ɛ[f, a] a 2 L p (ν) (3) δφ[f; a] + 2 δ2 φ[f; a, a] + 6 δ3 φ[f; a, a, a] + ɛ[f, a] a 3 L p (ν), where ɛ[f, a] 0 as a L p (ν) 0. The quadratic functional δ 2 φ[f; a, a] defined on normed linear space L p (ν) is stronly positive if there exists a constant k > 0 such that δ 2 φ[f; a, a] k a 2 L p (ν) for all a L2 (ν). By definition of the second Fréchet derivative, δ 2 φ[; b, a] δφ[ + b; a] δφ[; a] 2 ( + b)adν 2 adν 2 badν. Thus δ 2 φ[; b, a] is a quadratic form, where δ 2 φ is actually independent of and stronly positive since δ 2 φ[; a, a] 2 a 2 dν 2 a 2 L 2 (ν) for all a L 2 (ν), which implies that φ is strictly convex and d φ [f, ] f 2 dν 2 dν 2 (f )dν (f ) 2 dν f 2 L 2 (ν). 2) Squared Bias: Under definition (2), squared bias is a Breman diverence, this we have not previously seen noted in the literature despite the importance of minimizin bias in estimation [2]. Let φ[] ( dν ) 2, where φ : L (ν) R. In this case φ[ + a] φ[] ( 2 dν + dν 2 ( adν) ( adν + ) 2 dν adν) 2. (4) Note that 2 dν adν is a continuous linear functional on L (ν) and ( adν ) 2 a 2 L (ν), so that 0 ( adν ) 2 a L (ν) a 2 L (ν) a L a (ν). L (ν) Thus from (4) and the definition of the Fréchet derivative, δφ[; a] 2 dν adν. By the definition of the second Fréchet derivative, δ 2 φ[; b, a] δφ[ + b; a] δφ[; a] 2 ( + b)dν adν 2 2 bdν adν dν is another quadratic form, and δ 2 φ is independent of. Then δ 2 φ[; a, a] is stronly positive because, ( δ 2 φ[; a, a] 2 ) 2 adν 2 a 2 L (ν) 0 adν for a L (ν), and thus φ is in fact strictly convex. The Breman diverence is thus d φ [f, ] ( ( ( 2 ( fdν) 2 ( fdν) + ) 2 (f )dν f 2 L (ν). 2 dν) 2 2 dν) 2 dν (f )dν dν fdν

4 FRIGYIK, SRIVASTAVA, GUPTA 3 3) Relative ntropy of Simple Functions: Denote by S the collection of all interable simple functions on the measure space (R d, Ω, ν), that is, the set of functions which can be written as a finite linear combination of indicator functions such that if S, can be expressed, (x) t α i I Ti ; α 0 0, i0 where I Ti is the indicator function of the set T i, {T i } t i is a collection of mutually disjoint sets of finite measure and T 0 R d \ t i T i. We adopt the convention that T 0 is the set on which is zero and therefore α i 0 if i 0. Consider the normed vector space (S, L (ν)) and let W be the subset (not necessarily a vector subspace) of nonneative functions in this normed space: If W then R d ln dν W { S subject to 0}. t i T i α i ln α i dν t α i ln α i ν(t i ), (5) i since 0 ln 0 0. Define the functional φ on W, φ[] ln dν, W. (6) R d The functional φ is not Fréchet-differentiable at because in eneral it cannot be uaranteed that + h is non-neative on the set where 0 for all perturbin functions h in the underlyin normed vector space ( S, L (ν)) with norm smaller than any prescribed ɛ > 0. However, a eneralized Gâteaux derivative can be defined if we limit the perturbin function h to a vector subspace. To that end, let G be the subspace of ( S, L (ν)) defined by G {f S subject to f dν dν}. It is straihtforward to show that G is a vector space. We define the eneralized Gâteaux derivative of φ at W to be the linear operator δ G φ[; ] if φ[ + h] φ[] δ G φ[; h] lim 0. (7) h L (ν) 0 h L (ν) h G Note, that δ G φ[; ] is not linear in eneral, but it is on the vector space G. In eneral, if G is the entire underlyin vector space then (7) is the Fréchet derivative, and if G is the span of only one element from the underlyin vector space then (7) is the Gâteaux derivative. Here, we have eneralized the Gâteaux derivative for the present case that G is a subspace of the underlyin vector space. It remains to be shown that iven the functional (6), the derivative (7) exists and yields a Breman diverence correspondin to the usual notion of relative entropy. Consider the possible solution δ G φ[; h] ( + ln )hdν, (8) R d which coupled with (6) does yield relative entropy. It remains to be shown only that (8) satisfies (7). Note that φ[ + h] φ[] δ G φ[; h] (h + ) ln h + hdν R d (h + ) ln h + hdν, (9) where is the set on which is not zero. Because W, there are m, > 0 such that m on. Let h G be such that h L (ν) m, then + h 0. Our oal is to show that the expression φ[ + h] φ[] δ G φ[; h] h L (ν) (0) is non-neative and that it is bounded above by a bound that oes to 0 as h L (ν) 0. We start by boundin the interand from above usin the inequality ln x x : (h + ) ln h + Then since h 2 / ( h L (ν)) 2 /m, φ[ + h] φ[] δ G φ[; h] h L (ν) h (h + ) h h h2. h 2 h L (ν) dν ν() m h L (ν). Since is interable ν() < and the riht hand side oes to 0 as h L (ν) 0. Next, in order to show that (0) is non-neative we have to prove that the interal (9) is not neative. To do so, we normalize the measure and apply Jensen s inequality. Take the first term of the interand of (9), (h + ) ln h + dν ( h + ln h + ) dν, h + L (ν) ln h + ( ) h + L (ν) λ d ν, L (ν) where the normalized measure d ν dν is a probability measure and λ(x) x ln x is a convex function on (0, ). L (ν) By Jensen s inequality and then chanin the measure back to dν

5 4 FUNCTIONAL BRGAN DIVRGNC dν, ( ) h + λ d ν ( ) h + L (ν)λ d ν ( ) + h L L (ν)λ (ν) L (ν) ( ) + h L + h L (ν) ln (ν) L (ν) ( + h L (ν) ) L (ν) + h L (ν) + h L (ν) L (ν) hdν, L (ν) where we used the fact that ln x x for all x > 0. By combinin these two latest results we find that (h + ) ln h + dν hdν, so equivalently (9) is always non-neative. This fact also confirms that the resultin relative entropy d φ [f, ] is always non-neative, because (9) is d φ [f, ] if one sets h f. Lastly, one must show that the functional defined in (6) is strictly convex. Aain we will show this by showin that the second variation of φ[] is stronly positive. Let f G and f L (ν) m. Usin the Taylor expansion of ln one can express, ( δ G φ[ + f; h] δ G φ[; h] h ln + f ) dν h f dν + ɛ[f, ] f L (ν), where ɛ[f, ] oes to 0 as f L (ν) 0 because Therefore and ɛ[f, ] L (ν) ν() h L (ν) 2 2 f L (ν). δgφ[; 2 h, f] δgφ[; 2 h, h] h f dν, h 2 dν h L (ν). Thus δg 2 φ[; h, h] is stronly positive. B. Relationship to Other Breman Diverence Definitions Two propositions establish the relationship of the functional Breman diverence to other Breman diverence definitions. Proposition I.2 (Functional Breman Diverence Generalizes Vector Breman Diverence). The functional definition (2) is a eneralization of the standard vector Breman diverence d φ(x, y) φ(x) φ(y) φ(y) T (x y), () where x, y R n, and φ : R n R is strictly convex and twice differentiable. Jones and Byrne describe a eneral class of diverences between functions usin a pointwise formulation [7]. Csiszár specialized the pointwise formulation to a class of diverences he termed Breman distances B s,ν [7], where iven a σ- finite measure space (, Ω, ν), and non-neative measurable functions f(x) and (x), B s,ν (f, ) equals s(f(x)) s((x)) s ((x))(f(x) (x))dν(x). (2) The function s : (0, ) R is constrained to be differentiable and strictly convex, and the limit lim x 0 s(x) and lim x 0 s (x) must exist, but not necessarily be finite. The function s plays a role similar to the function φ in the functional Breman diverence; however, s acts on the rane of the functions f,, whereas φ acts on the functions f,. Proposition I.3 (Functional Definition Generalizes Pointwise Definition). Given a pointwise Breman diverence as per (2), an equivalent functional Breman diverence can be defined as per (2) if the measure ν is finite. However, iven a functional Breman diverence d φ [f, ], there is not necessarily an equivalent pointwise Breman diverence. C. Properties of the Functional Breman Diverence The Breman diverence for random variables has some well-known properties, as reviewed in [, Appendix A]. Here, we note that the same properties hold for the functional Breman diverence (2). We ive complete proofs in [8].. Non-neativity: The functional Breman diverence is non-neative: d φ [f, ] 0 for all admissible inputs. 2. Convexity: The Breman diverence d φ [f, ] is always convex with respect to f. 3. Linearity: The functional Breman diverence is linear such that, d (c φ +c 2 φ 2 )[f, ] c d φ [f, ] + c 2 d φ2 [f, ]. 4. quivalence Classes: Partition the set of strictly convex, differentiable functionals {φ} into classes such that φ and φ 2 belon to the same class if d φ [f, ] d φ2 [f, ] for all f, A. For brevity we will denote d φ [f, ] simply by d φ. Let φ φ 2 denote that φ and φ 2 belon to the same class, then is an equivalence relation in that is satisfies the properties of reflexivity (d φ d φ ), symmetry (if d φ d φ2, then d φ2 d φ ), and transitivity (if d φ d φ2 and d φ2 d φ3, then d φ d φ3 ). Further, if φ φ 2, then they differ only by an affine transformation. 5. Linear Separation: The locus of admissible functions f L p (ν) that are equidistant from two fixed functions, 2 L p (ν) in terms of functional Breman diverence form a hyperplane.

6 FRIGYIK, SRIVASTAVA, GUPTA 5 6. Dual Diverence: Given a pair (, φ) where L p (ν) and φ is a strictly convex twice-continuously Fréchet-differentiable functional, then the function-functional pair (G, ψ) is the Leendre transform of (, φ) [9], if φ[] ψ[g] + (x)g(x)dν(x), (3) δφ[; a] G(x)a(x)dν(x), (4) where ψ is a strictly convex twice-continuously Fréchetdifferentiable functional, and G L q (ν), where p + q. Given Leendre transformation pairs f, L p (ν) and F, G L q (ν), d φ [f, ] d ψ [G, F ]. 7. Generalized Pythaorean Inequality: For any admissible f,, h L p (ν), d φ [f, h] d φ [f, ] + d φ [, h] + δφ[; f ] δφ[h; f ]. II. INIU PCTD BRGAN DIVRGNC Consider two sets of functions (or distributions), and A. Let F be a random function with realization f. Suppose there exists a probability distribution P F over the set, such that P F (f) is the probability of f. For example, consider the set of Gaussian distributions, and iven samples drawn independently and identically from a randomly selected Gaussian distribution N, the data imply a posterior probability P N (N ) for each possible eneratin realization of a Gaussian distribution N. The oal is to find the function A that minimizes the expected Breman diverence between the random function F and any function A. The followin theorem shows that if the set of possible minimizers A includes PF [F ], then PF [F ] minimizes the expectation of any Breman diverence. Note the theorem requires slihtly stroner conditions on φ than the definition of the Breman diverence (2) requires. Theorem II. (inimizer of the xpected Breman Diverence). Let δ 2 φ[f; a, a] be stronly positive and let φ C 3 (L (ν); R) be a three-times continuously Fréchetdifferentiable functional on L (ν). Let be a set of functions that lie on a manifold, and have associated measure d such that interation is well-defined. Suppose there is a probability distribution P F defined over the set. Let A be a set of functions that includes PF [F ] if it exists. Suppose the function minimizes the expected Breman diverence between the random function F and any function A such that ar inf A P F [d φ (F, )]. Then, if exists, it is iven by PF [F ]. (5) A. Bayesian stimation Theorem II. can be applied to a set of distributions to find the Bayesian estimate of a distribution iven a posterior or likelihood. For parametric distributions parameterized by θ R n, a probability measure Λ(θ), and some risk function R(θ, ψ), ψ R n, the Bayes estimator is defined [22] as ˆθ ar inf ψ R n R(θ, ψ)dλ(θ). (6) That is, the Bayes estimator minimizes some expected risk in terms of the parameters. It follows from recent results [6] that ˆθ [Θ] if the risk R is a Breman diverence, where Θ is the random variable whose realization is θ; this property has been previously noted [8], [0]. The principle of Bayesian estimation can be applied to the distributions themselves rather than to the parameters: ĝ ar inf R(f, )P F (f)d, (7) A where P F (f) is a probability measure on the distributions f, d is a measure for the manifold, and A is either the space of all distributions or a subset of the space of all distributions, such as the set. When the set A includes the distribution PF [F ] and the risk function R in (7) is a functional Breman diverence, then Theorem II. establishes that ĝ PF [F ]. For example, in recent work, two of the authors derived the mean class posterior distribution for each class for a Bayesian quadratic discriminant analysis classifier, and showed that the classification results were superior to parameter-based Bayesian quadratic discriminant analysis [23]. Of particular interest for estimation problems are the Breman diverence examples iven in Section I-A: total squared difference (mean squared error) is a popular risk function in reression [2]; minimizin relative entropy leads to useful theorems for lare deviations and other statistical subfields [24]; and analyzin bias is a common approach to characterizin and understandin statistical learnin alorithms [2]. B. Case Study: stimatin a Scaled Uniform Distribution As an illustration of the theorem, we present and compare different estimates of a scaled uniform distribution iven independent and identically drawn samples. Let the set of uniform distributions over [0, θ] for θ R + be denoted by U. Given independent and identically distributed samples, 2,..., n drawn from an unknown uniform distribution f U, the eneratin distribution is to be estimated. The risk function R is taken to be squared error or total squared error dependin on context. ) Bayesian Parameter stimate: Dependin on the choice of the probability measure Λ(θ), the interal (6) may not be finite; for example, usin the likelihood of θ with Lebesue measure the interal is not finite. A standard solution is to use a amma prior on θ and Lebesue measure. Let Θ be a random parameter with realization θ, let the amma distribution have parameters t and t 2, and denote the maximum of the data as

7 6 FUNCTIONAL BRGAN DIVRGNC max max{, 2,..., n }. Then a Bayesian estimate is formulated [22, p. 240, 285]: [Θ {, 2,..., n }, t, t 2 ] θ max e θ n+t + θt 2 dθ. (8) e θ n+t + θt 2 dθ max The interals can be expressed in terms of the chi-squared random variable I 2 v with v derees of freedom: [Θ {, 2,..., n }, t, t 2 ] t 2 (n + t a) P (χ 2 2(n+t ) < 2 t 2 max ) P (χ 2 2(n+t ) < 2 t 2 max ). (9) Note that (6) presupposes that the best solution is also a uniform distribution. 2) Bayesian Uniform Distribution stimate: If one restricts the minimizer of (7) to be a uniform distribution, then (7) is solved with A U. Because the set of uniform distributions does not enerally include its mean, Theorem II. does not apply, and thus different Breman diverences may ive different minimizers for (7). Let P F be the likelihood of the data (no prior is assumed over the set U), and use the Fisher information metric ( [25] [27]) for d. Then the solution to (7) is the uniform distribution on [0, 2 /n max ]. Usin Lebesue measure instead ives a similar result: [0, 2 /(n+/2) max ]. We were unable to find these estimates in the literature, and so their derivations are presented in the appendix. 3) Unrestricted Bayesian Distribution stimate: When the only restriction placed on the minimizer in (7) is that be a distribution, then one can apply Theorem II. and solve directly for the expected distribution PF [F ]. Let P F be the likelihood of the data (no prior is assumed over the set U), and use the Fisher information metric for d. Solvin (5), notin that the uniform probability of x is f(x) /a if x a and zero otherwise, and the likelihood of the n drawn points is (/ max ) n if a max and zero otherwise, ( ) ( ) ( da ) max(x, max) a a n a (x) max da a n a n ( max ) n. (20) (n + )[max(x, max )] n+ III. FURTHR DISCUSSION AND OPN QUSTIONS We have defined a eneral Breman diverence for functions and distributions that can provide a foundation for results in statistics, information theory and sinal processin. Theorem II. is important for these fields because it ties Breman diverences to expectation. As shown in Section II-A, Theorem II. can be directly applied to distributions to show that Bayesian distribution estimation simplifies to expectation when the risk function is a Breman diverence and the minimizin distribution is unrestricted. It is common in Bayesian estimation to interpret the prior as representin some actual prior knowlede, but in fact prior knowlede often is not available or is difficult to quantify. Another approach is to use a prior to capture coarse information from the data that may be used to stabilize the estimation [9], [23]. In practice, priors are sometimes chosen in Bayesian estimation to tame the tail of likelihood distributions so that expectations will exist when they miht otherwise be infinite [22]. This mathematically convenient use of priors adds estimation bias that may be unwarranted by prior knowlede. An alternative to mathematically convenient priors is to formulate the estimation problem as a minimization of an expected Breman diverence between the unknown distribution and the estimated distribution, and restrict the set of distributions that can be the minimizer to be a set for which the Bayesian interal exist. Open questions are how such restrictions tradeoff bias for reduced variance, and how to find or define an optimal restricted set of distributions for this estimation approach. Finally, there are some results for the standard vector Breman diverence that have not been extended here. It has been shown that a standard vector Breman diverence must be the risk function in order for the mean to be the minimizer of an expected risk [6, Theorems 3 and 4]. The proof of that result relies heavily on the discrete nature of the underlyin vectors, and it remains an open question as to whether a similar result holds for the functional Breman diverence. Another result that has been shown for the vector case but remains an open question in the functional case is converence in probability [6, Theorem 2]. ACKNOWLDGNTS This work was funded in part by the Office of Naval Research, Code 32, Grant # N The authors thank Inderjit Dhillon, Castedo llerman, and Galen Shorack for helpful discussions. A. Proof of Proposition I.2 APPNDI: PROOFS We ive a constructive proof that there is a correspondin functional Breman diverence d φ [f, ] for a specific choice of φ : L (ν) R, where ν n i δ c i and f, L (ν). Here, δ x denotes the Dirac measure such that all mass is concentrated at x, and {c, c 2,..., c n } is a collection of n distinct points in R d. For any x R n, define φ[f] φ(x, x 2,..., x n ), where f(c ) x, f(c 2 ) x 2,..., f(c n ) x n. Then the difference is φ[f; a] φ[f + a] φ[f] φ ((f + a)(c ),..., (f + a)(c n )) φ (x,..., x n ) φ (x + a(c ),..., x n + a(c n )) φ (x,..., x n ). Let a i be short hand for a(c i ), and use the Taylor expansion for functions of several variables to yield φ[f; a] φ(x,..., x n ) T (a,..., a n ) + ɛ[f, a] a L. Therefore, δφ[f; a] φ(x,..., x n ) T (a,..., a n ) φ(x) T a,

8 FRIGYIK, SRIVASTAVA, GUPTA 7 where x (x, x 2,..., x n ) and a (a,..., a n ). Thus, from (3), the functional Breman diverence definition (2) for φ is equivalent to the standard vector Breman diverence: d φ[f, ] φ[f] φ[] δφ[; f ] B. Proof of Proposition I.3 φ(x) φ(y) φ(y) T (x y). (2) First, we ive a constructive proof of the first part of the proposition by showin that iven a B s,ν, there is an equivalent functional diverence d φ. Then, the second part of the proposition is shown by example: we prove that the squared bias functional Breman diverence iven in Section I-A.2 is a functional Breman diverence that cannot be defined as a pointwise Breman diverence. Note that the interal to calculate B s,ν is not always finite. To ensure finite B s,ν, we explicitly constrain lim x 0 s (x) and lim x 0 s(x) to be finite. From the assumption that s is strictly convex, s must be continuous on (0, ). Recall from the assumptions that the measure ν is finite, and that the function s is differentiable on (0, ). Given a B s,ν, define the continuously differentiable function { s(x) x 0 s(x) s( x) + 2s(0) x < 0. Specify φ : L (ν) R as φ[f] Note that if f 0, φ[f] s(f(x))dν. s(f(x))dν. Because s is continuous on R, s(f) L (ν) whenever f L (ν), so the above interals always make sense. It remains to be shown that δφ[f; ] completes the equivalence when f 0. For h L (ν), φ[f + h] φ[f] s(f(x) + h(x))dν s(f(x))dν s(f(x) + h(x)) s(f(x))dν s (f(x))h(x) + ɛ (f(x), h(x)) h(x)dν s (f(x))h(x) + ɛ (f(x), h(x)) h(x)dν, where we used the fact that s(f(x) + h(x)) s(f(x)) + ( s (f(x)) + ɛ(f(x), h(x))) h(x) s(f(x)) + (s (f(x)) + ɛ(f(x), h(x))) h(x), because f 0. On the other hand, if h(x) 0 then ɛ(f(x), h(x)) 0, and if h(x) 0 then ɛ(f(x), h(x)) s(f(x) + h(x)) s(f(x)) h(x) + s (f(x)). Suppose {h n } L (ν) such that h n 0. Then there is a measurable set such that its complement is of measure 0 and h n 0 uniformly on. There is some N > 0 such that for any n > N, h n (x) ɛ for all x. Without loss of enerality, assume that there is some > 0 such that for all x, f(x). Since s is continuously differentiable, there is a K > 0 such that max{ s (t) subject to t [ ɛ, + ɛ]} K, and by the mean value theorem for almost all x. Then s(f(x) + h(x)) s(f(x)) h(x) ɛ(f(x), h(x)) 2K, K, except on a set of measure 0. The fact that h(x) 0 almost everywhere implies that ɛ(f(x), h(x)) 0 almost everywhere, and by Lebesue s dominated converence theorem, the correspondin interal oes to 0. As a result, the Fréchet derivative of φ is δφ[f; h] s (f(x))h(x)dν. (22) Thus the functional Breman diverence is equivalent to the iven pointwise B s,ν. We additionally note that the assumptions that f L (ν) and that the measure ν is finite are necessary for this proof. Counterexamples can be constructed if f L p or ν() such that the Fréchet derivative of φ does not obey (22). This concludes the first part of the proof. To show that the squared bias functional Breman diverence iven in Section I-A.2 is an example of a functional Breman diverence that cannot be defined as a pointwise Breman diverence we prove that the converse statement leads to a contradiction. Suppose (, Σ, ν) and (, Σ, µ) are measure spaces where ν is a non-zero σ-finite measure and that there is a differentiable function f : (0, ) R such that ( 2 ξdν) f(ξ)dµ, (23) where ξ L (ν). Let f(0) lim x 0 f(x), which can be finite or infinite, and let α be any real number. Then ( ) 2 ( ) 2 f(αξ)dµ αξdν α 2 ξdν α 2 f(ξ)dµ. Because ν is σ-finite, there is a measurable set such that 0 < ν() <. Let \ denote the complement of in. Then ( ) 2 α 2 ν 2 () α 2 I dν α 2 f(i )dµ α 2 f(0)dµ + α 2 f()dµ \ α 2 f(0)µ(\) + α 2 f()µ().

9 8 FUNCTIONAL BRGAN DIVRGNC Also, However, ( 2 αi dν) so one can conclude that ( 2 α 2 ν 2 () αi dν). f(αi )dµ f(αi )dµ + \ f(0)µ(\) + f(α)µ(); α 2 f(0)µ(\) + α 2 f()µ() f(αi )dµ f(0)µ(\) + f(α)µ(). (24) Apply equation (23) for ξ 0 to yield ( 2 0 0dν) f(0)dµ f(0)µ(). Since ν() > 0, µ() 0, so it must be that f(0) 0, and (24) becomes α 2 ν 2 () α 2 f()µ() f(α)µ() α R. The first equation implies that µ() 0. The second equation determines the function f completely: Then (23) becomes ( f(α) f()α 2. 2 ξdν) f()ξ 2 dµ. Consider any two disjoint measurable sets, and 2, with finite nonzero measure. Define ξ I and ξ 2 I 2. Then ξ ξ + ξ 2 and ξ ξ 2 I I 2 0. quation (23) becomes ξ dν ξ 2 dν f() ξ ξ 2 dµ. (25) This implies the followin contradiction: ξ dν ξ 2 dν ν( )ν( 2 ) 0, (26) but f() C. Proof of Theorem II. ξ ξ 2 dµ 0. (27) Recall that for a functional J to have an extremum (minimum) at f ˆf, it is necessary that δj[f; a] 0 and δ 2 J[f; a, a] 0, for f ˆf and for all admissible functions a A. A sufficient condition for a functional J[f] to have a minimum for f ˆf is that the first variation δj[f; a] must vanish for f ˆf, and its second variation δ 2 J[f; a, a] must be stronly positive for f ˆf. Let J[] PF [d φ (F, )] d φ [f, ]P F (f)d (φ[f] φ[] δφ[; f ])P F (f)d,(28) where (28) follows by substitutin the definition of Breman diverence (2). Consider the increment J[; a] J[ + a] J[] (29) (φ[ + a] φ[]) P F (f)d (δφ[ + a; f a] δφ[; f ]) P F (f)d, (30) where (30) follows from substitutin (28) into (29). Usin the definition of the differential of a functional iven in (), the first interand in (30) can be written as φ[ + a] φ[] δφ[; a] + ɛ[, a] a L (ν). (3) Take the second interand of (30), and subtract and add δφ[; f a], δφ[ + a; f a] δφ[; f ] δφ[ + a; f a] δφ[; f a] + δφ[; f a] δφ[; f ] (a) δ 2 φ[; f a, a] + ɛ[, a] a L (ν) + δφ[; f ] δφ[; a] δφ[; f ] (b) δ 2 φ[; f, a] δ 2 φ[; a, a] + ɛ[, a] a L (ν) δφ[; a] (32) where (a) follows from the linearity of the third term, and (b) follows from the linearity of the first term. Substitute (3) and (32) into (30), ( J[; a] δ 2 φ[; f, a] δ 2 φ[; a, a] ) + ɛ[, a] a L (ν) P F (f)d. Note that the term δ 2 φ[; a, a] is of order a 2 L (ν), that is, δ 2 φ[; a, a] m L (ν) a 2 L (ν) for some constant m. Therefore, where, J[ + a] J[] δj[; a] L lim (ν) 0, a L (ν) 0 a L (ν) δj[; a] δ 2 φ[; f, a]p F (f)d. (33) For fixed a, δ 2 φ[;, a] is a bounded linear functional in the second arument, so the interation and the functional can be interchaned in (33), which becomes [ ] δj[; a] δ 2 φ ; (f ) P F (f)d, a.

10 FRIGYIK, SRIVASTAVA, GUPTA 9 Usin the functional optimality conditions, J[] has an extremum for ĝ if [ ] δ 2 φ ĝ; (f ĝ) P F (f)d, a 0. (34) Set a (f ĝ) P (f)d in (34) and use the assumption that the quadratic functional δ 2 φ[; a, a] is stronly positive, which implies that the above functional can be zero if and only if a 0, that is, 0 (f ĝ)p F (f)d, (35) ĝ PF [F ], (36) where the last line holds if the expectation exists (i.e. if the measure is well-defined and the expectation is finite). Because a Breman diverence is not necessarily convex in its second arument, it is not yet established that the above unique extremum is a minimum. To see that (36) is in fact a minimum of J[], from the functional optimality conditions it is enouh to show that δ 2 J[ĝ; a, a] is stronly positive. To show this, for b L (ν), consider δj[ + b; a] δj[; a] (c) ( δ 2 φ[ + b; f b, a] δ 2 φ[; f, a] ) P F (f)d (d) ( δ 2 φ[ + b; f b, a] δ 2 φ[; f b, a] + δ 2 φ[; f b, a] δ 2 φ[; f, a] ) P F (f)d (e) ( δ 3 φ[; f b, a, b] + ɛ[, a, b] b L (ν) + δ 2 φ[; f, a] δ 2 φ[; b, a] δ 2 φ[; f, a] ) P F (f)d (f) ( δ 3 φ[; f, a, b] δ 3 φ[; b, a, b] + ɛ[, a, b] b L (ν) δ2 φ[; b, a] ) P F (f)d, (37) where (c) follows from usin interal (33); (d) from subtractin and addin δ 2 φ[; f b, a]; (e) from the fact that the variation of the second variation of φ is the third variation of φ [28]; and (f) from the linearity of the first term and cancellation of the third and fifth terms. Note that in (37) for fixed a, the term δ 3 φ[; b, a, b] is of order b 2 L (ν), while the first and the last terms are of order b L (ν). Therefore, δj[ + b; a] δj[; a] δ 2 J[; a, b] L lim (ν) b L (ν) 0 b L (ν) 0, where δ 2 J[; a, b] δ 3 φ[; f, a, b]p F (f)d + δ 2 φ[; a, b]p F (f)d. (38) Substitute b a, ĝ and interchane interation and the continuous functional δ 3 φ in the first interal of (38), then [ ] δ 2 J[ĝ; a, a] δ 3 φ ĝ; (f ĝ)p F (f)d, a, a + δ 2 φ[ĝ; a, a]p F (f)d δ 2 φ[ĝ; a, a]p F (f)d (39) k a 2 L (ν) P F (f)d k a 2 L (ν) > 0, (40) where (39) follows from (35), and (40) follows from the stron positivity of δ 2 φ[ĝ; a, a]. Therefore, from (40) and the functional optimality conditions, ĝ is the minimum. D. Derivation of the Bayesian Distribution-based Uniform stimate Restricted to a Uniform inimizer Let f(x) /a for all 0 x a and (x) /b for all 0 x b. Assume at first that b > a; then the total squared difference between f and is x (f(x) (x)) 2 dx a ( a ) 2 + (b a) b b a ab b a, ab ( ) 2 b where the last line does not require the assumption that b > a. In this case, the interal (7) is over the one-dimensional manifold of uniform distributions U; a Riemannian metric can be formed by usin the differential arc element to convert Lebesue measure on the set U to a measure on the set of parameters a such that (7) is re-formulated in terms of the parameters for ease of calculation: b ar min b R + b a a max ab a n df da da, (4) 2 where /a n is the likelihood of the n data points bein drawn from a uniform distribution [0, a], and the estimated distribution is uniform on [0, b ]. The differential arc element can be calculated by expandin df/da in terms of df da 2 the Haar orthonormal basis { a, φ jk (x)}, which forms a complete orthonormal basis for the interval 0 x a, and then the required norm is equivalent to the norm of the basis coefficients of the orthonormal expansion: df da. (42) 2 a3/2 For estimation problems, the measure determined by the Fisher information metric may be more appropriate than Lebesue measure [25] [27]. Then d I(a) 2 da, (43) where I is the Fisher information matrix. For the onedimensional manifold formed by the set of scaled uniform

11 0 FUNCTIONAL BRGAN DIVRGNC distributions U, the Fisher information matrix is ( d I(a) [ 2 lo )] a da 2 a 0 a 2 a dx a 2, so that the differential element is d da a. We solve (7) usin the Lebesue measure (42); the solution with the Fisher differential element follows the same loic. Then (4) is equivalent to ar min b J(b) b b a a max ab a n+3/2 da b a da a max ab a + a b da n+3/2 b ab a n+3/2 2 (n + /2)(n + 3/2)b n+3/2 b(n + 2 )n+ 2 max + (n + 3/2) n+3/2 max The minimum is found by settin the first derivative to zero: J 2 (n + 3/2) (ˆb) (n + /2)(n + 3/2) + 0 ˆb2 (n + /2)max n+/2 ˆb 2 n+/2 max. To establish that ˆb is in fact a minimum, note that J (ˆb) > 0. n+/2 ˆb max 2 3 n+7/2 n+/6 max Thus, the restricted Bayesian estimate is the uniform distribution over [0, 2 n+/2 max ]. RFRNCS [] B. Taskar, S. Lacoste-Julien, and. I. Jordan, Structured prediction, dual extraradient and Breman projections, Journal of achine Learnin Research, vol. 7, pp , [2] N. urata, T. Takenouchi, T. Kanamori, and S. uchi, Information eometry of U-Boost and Breman diverence, Neural Computation, vol. 6, pp , [3]. Collins, R.. Schapire, and Y. Siner, Loistic reression, AdaBoost and Breman distances, achine Learnin, vol. 48, pp , [4] J. Kivinen and. Warmuth, Relative loss bounds for multidimensional reression problems, achine Learnin, vol. 45, no. 3, pp , 200. [5] J. Lafferty, Additive models, boostin, and inference for eneralized diverences, Proc. of Conf. on Learnin Theory (COLT), 999. [6] S. Della Pietra, V. Della Pietra, and J. Lafferty, Duality and auxiliary functions for breman distances, Carneie ellon University Technical Report CU-CS-0-09R, 200. [7] L. K. Jones and C. L. Byrne, General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis, I Trans. on Information Theory, vol. 36, pp , 990. [8]. R. Gupta, S. Srivastava, and L. Cazzanti, Optimal estimation for nearest neihbor classifiers, Univ. of Washinton Dept. of lectrical nineerin Technical Report , 2006, available at idl.ee.washinton.edu. [9] S. Srivastava,. R. Gupta, and B. A. Friyik, Bayesian quadratic discriminant analysis, Journal of achine Learnin Research, vol. 8, pp , [0] A. Banerjee, An analysis of loistic models: xponential family connections and online performance, SIA Intl. Conf. on Data inin, [] A. Banerjee, S. eruu, I. S. Dhillon, and J. Ghosh, Clusterin with Breman diverences, Journal of achine Learnin Research, vol. 6, pp , [2] R. Nock and F. Nielsen, On weihtin clusterin, I Trans. on Pattern Analysis and achine Intellience, vol. 28, no. 8, pp , [3] G. LeBesenerais, J. Bercher, and G. Demoment, A new look at entropy for solvin linear inverse problems, I Trans. on Information Theory, vol. 45, pp , 999. [4] Y. Altun and A. Smola, Unifyin diverence minimization and statistical inference via convex duality, Proc. of Conf. on Learnin Theory (COLT), [5]. C. Pardo and I. Vajda, About distances of discrete distributions satisfyin the data processin theorem of information theory, I Trans. on Information Theory, vol. 43, no. 4, pp , 997. [6] A. Banerjee,. Guo, and H. Wan, On the optimality of conditional expectation as a Breman predictor, I Trans. on Information Theory, vol. 5, no. 7, pp , [7] I. Csiszár, Generalized projections for non-neative functions, Acta athematica Hunarica, vol. 68, pp. 6 85, 995. [8] B. A. Friyik, S. Srivastava, and. R. Gupta, An introduction to functional derivatives, Univ. of Washinton Dept. of lectrical ninerin Technical Report , Available at idl.ee.washinton.edu/publications.php. [9] I.. Gelfand and S. V. Fomin, Calculus of Variations. USA: Dover, [20] T. Rockafellar, Interals which are convex functionals, Pacific Journal of athematics, vol. 24, no. 3, pp , 968. [2] T. Hastie, R. Tibshirani, and J. Friedman, The lements of Statistical Learnin. New York: Spriner, 200. [22]. L. Lehmann and G. Casella, Theory of Point stimation. New York: Spriner, 998. [23] S. Srivastava and. R. Gupta, Distribution-based Bayesian minimum expected risk for discriminant analysis, Proc. of the I Intl. Symposium on Information Theory, [24] T. Cover and J. Thomas, lements of Information Theory. United States of America: John Wiley and Sons, 99. [25] R.. Kass, The eometry of asymptotic inference, Statistical Science, vol. 4, no. 3, pp , 989. [26] S. Amari and H. Naaoka, ethods of Information Geometry. New York: Oxford University Press, [27] G. Lebanon, Axiomatic eometry of conditional models, I Trans. on Information Theory, vol. 5, no. 4, pp , [28] C. H. dwards, Advanced Calculus of Several Variables. New York: Dover, 995.

Functional Bregman Divergence and Bayesian Estimation of Distributions Béla A. Frigyik, Santosh Srivastava, and Maya R. Gupta

5130 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 11, NOVEMBER 2008 Functional Bregman Divergence and Bayesian Estimation of Distributions Béla A. Frigyik, Santosh Srivastava, and Maya R. Gupta