STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

Size: px

Start display at page:

Download "STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING"

Nathaniel Sullivan
5 years ago
Views:

1 STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING Mark A. Kon Department of Mathematics an Statistics Boston University Boston, MA Anrzej Przybyszewski McGill University, Montreal, Canaa, an University of Massachusetts, Worcester, MA Leszek Plaskota Department of Mathematics Warsaw University Warsaw, Polan ABSTRACT We show that maximum a posteriori (MAP) statistical methos can be use in nonparametric machine learning problems in the same way as their current applications in parametric statistical problems, an give some examples of applications. This MAPN (MAP for nonparametric machine learning) paraigm can also reprouce much more transparently the same results as regularization methos in machine learning, spline algorithms in continuous complexity theory, an Baysian minimum risk methos. KEY WORDS machine learning, Bayesian statistics 1 Introuction Machine learning an articial neural network theory often eal with the problem of learning an input-output (i-o) function from examples, i.e., from partial information ([1], [2], [3], [4], [5],[6], [7]). Given an unknown i-o function f(x), along with examples Nf = (f(x 1 ); : : : ; f(x n )), the goal is to learn f. In this paper we escribe an extension of stanar maximum a posteriori (MAP) methos in parametric statistics to this (nonparametric) machine learning problem. Consier the problem ([6], [7]) of recovering f from a hypothesis space F of possible functions using information Nf, or more general information Nf = (L 1 f; : : : ; L n f), with L i general functionals on f (e.g., Fourier coefcients). This problem occurs in machine learning, statistical learning theory ([6], [7])), information-base complexity, nonparametric Bayesian statistics ([8],[9]), optimal recovery [10] an ata mining [11]. Since we will assume little about the reaer's knowlege of this area, we will inclue some basic examples an enitions. To give an example of this type of nonparametric machine learning, we might be seeking a function f(x) [12] which represents a relationship between inputs an outputs in a chemical mixture. Suppose we are builing a control system in which homeostatic parameters, incluing temperature, humiity an amounts of chemical components of an inustrial mixture can be controlle as input variables. Suppose we want to control the output variable y, which is the ratio of strength to brittleness of the plastic prouce from the mixture. We want to buil a machine which has the above input variables x = (x 1 ; x 2 ; :::; x n ) an whose output preicts the correct ratio y. The machine will use experimental ata points y = f(x) to learn from previous runs of the equipment. We may alreay have a prior moel for f base on simple assumptions on the relationships of the variables. We then want to combine this prior information with that from the several runs we have mae of our experiment. Learning an unknown i-o function f from a high imensional hypothesis space F is a nonparametric statistical problem inference from the ata Nf is one from a very large set of possibilities. We will show that stanar parametric MAP estimation algorithms can be extene irectly to a MAP for nonparametric machine learning (MAPN). The metho presente here is simple an intuitively appealing in spite of the high imensionality of the problems, an its estimates uner stanar hypotheses coincie with those obtaine by other methos, e.g., optimal recovery [10], information base complexity [3], an statistical learning theory [7], as will be emonstrate below. MAPN is a Bayesian learning algorithm which assumes prior knowlege given by a probability istribution for the unknown function f 2 F, representing information about f (we assume here that F is a norme linear space). An example of Bayesian prior knowlege of the type mentione above woul be the stipulation that the probability istribution on F is a Gaussian measure centere at f. Examples of common a priori measures on a hypothesis space F inclue Gaussian an elliptically contoure measures ([3], [13], [8]). The most important new element in this extension of MAP to hypothesis spaces of functions is a proof [14] that it is possible to ene ensity functions %(f) corresponing to measures in a way analogous to how this is one in

2 nite imensional estimation see below. The MAPN algorithm, as oes MAP, will then use the ensity function (f) corresponing to this measure [14], an maximize it over the set N 1 (z) of functions f 2 F consistent with the ata z, yieling the MAPN estimate (or o the same with an assumption of some measurement error; see below). The key issue in the evelopment of the MAPN algorithm is that, as in MAP, we require existence of a ensity function for. As is well known, such a ensity can exist for measures on nite imensional F, i.e., when the number of parameters to be estimate is nite. In many ata mining applications, however, as in the above example, an entire function f (i.e., an innite number of parameters) must be extrapolate from ata, an the corresponing in- nite imensional parameter space F presents an obstacle to extening MAP, since measures in innite imension o not amit ensity functions in the stanar sense. This is because the ensity function (f) is the erivative (with respect to Lebesgue measure) of the a priori probability measure, an Lebesgue measure fails to exist in innite imension. Thus it has up to now been natural to assume that a probability ensity for f cannot be ene or maximize if F is a function space. The metho for ening a ensity function for a prior measure on an innite imensional F in fact can be accomplishe in a way which is interestingly analogous to the nite imensional case. In the latter situation, (f) = (f) f with f Lebesgue measure (which exists only in nite imension). We show that it is possible to construct ensities for such measures also in innite imensional spaces F. More generally, we can show such ensities can even exist for - nitely aitive measures, e.g., isonormal Gaussian measure on F with covariance operator C = 1 2 I. Uner nite imensional Bayesian inference with prior ensity (f), the MAP estimate b f is the maximizer bf of (f), subject to the ata z: bf = arg max (f): f2n 1 z Likelihoo functions have some signicant avantages, such as ease of use, ease of maximization, an ease of conitioning when further information (e.g., ata N f = y) becomes available. As mentione above, the lack of a ensity (f) in in- nite imensional hypothesis spaces F is base on the lack of a Lebesgue measure. However, Lebesgue measure is require only to ene sets of same size at ifferent locations (probabilities are then compare). This can be accomplishe in other ways in innite imensional hypothesis spaces, as we show here. We remark that the innite imensional nature of MAPN shoul be viewe as summarizing an inuctive limit of nite imensional approximation methos. The extent to which the algorithm is vali is etermine by the valiity of the same metho for nite imensional approximations. Here such approximations woul entail approximating the space F of allowe i-o functions f as a nite imensional space, say consisting of gri approximations of f. The valiity of the MAPN proceure in innite imension states that there is a vali inuctive limit of MAP algorithms for nite imensional approximations of the esire f. 2 Invariant measures an ensity functions Let be a probability measure on a norme linear space F. For estimation of an unknown f 2 F we will rst consier how transformations of affect estimators. Consier the invariance properties of the measure uner a transformation T : F! F : If F is nite imensional, we can ene to be invariant with respect to T in two ways: 1. It can be measure-invariant with respect to T, so that (T 1 A) = (A) 8 meas. A 2. It can be ensity-invariant with respect to T, so that (at least if F nite imensional) (T f) = (f) for all x, where (f) = (f) = (f) is the ensity of f, with Lebesgue measure. In a Bayesian setting, let enote the a priori istribution for the unknown f, an assume we have ata Nf = y regaring f. Then the expecte error minimizing estimator of f is f B = E(fjNf = y) [3]. If T is measure-invariant then T (f B ) = T (f) B ; i.e., the transform of estimator is the estimator of the transform. Thus average-case estimation ([3]; [15]) is invariant uner T. However, this is not true for the MAP estimate. If is ensity-invariant uner T, then the MAP estimate is preserve uner T, i.e., arg max (T 1 f) = T arg max (f) : Note that in nite imension, if is ensity-invariant with respect to any T which preserves a norm kaxk, then is elliptically contoure with respect A; essentially this means is a superposition of Gaussians whose covariance operators are scalar multiples of a single covariance (Traub, et al., 1987).

3 3 Denitions an basic results Let F be nite imensional, represent prior knowlege about f 2 F, an ene the ensity (f) = (with Lebesgue measure). Then we can also ene (up to a multiplicative constant) by: (f) (g) = lim (B (f))!0 (B (g)), (1) where B (f) enotes the -ball centere at f. The rst (erivative) enition oes not exten to innite imension, but the secon one on the right of 1 oes. More generally, for cyliner (nitely aitive) probability measures on F, we can ene a ensity by p(f) p (g) = (B lim ;N (f)) R(N)!1,!0 (B ;N (g)), where B ;N (f) enotes the epsilon-cyliner at f of coimension n : B ;N (f) = ff 0 j kn(f f 0 )k 2 g; an N has nite rank, with n = R(N) =rank(n) (with some technical conitions on the sequence N of nite rank operators). Theorem (Kon, 2004): If 1 an 2 are two outer regular measures on F, an if 2 (f) exists an is Lebesgue a.e. 1 with respect to 1, then exists an is nite a.e. Then r(f) lim!0 2 (B (f)) 1 (B (f)) r(f) = 2 1 (f); almost everywhere. Corollary: If has a ensity (f) an the erivative (x g) (x) exists for all g, then (f 1 ) (f 2 ) = (f f 2 + f 1 ) (f) f=f2 For example, if is a Gaussian measure with positive covariance C, then (f) = e 1 2 kafk2, where C = A 2. By above, the case C = I (cyliner measure only) is inclue as well. 4 Density viewpoint: ensity functions as a priori information As mentione earlier, it is sometimes attractive to ene a measure representing prior knowlege about f which is only nitely aitive, e.g. a cyliner measure, to properly reect a priori information. As an example, consier a Gaussian, with covariance operator C = I. It is known that in innite imension, this is not a probability measure (i.e., it is not normalizable as a countably aitive measure). Nevertheless, it has a ensity as a cyliner measure, as inicate above. This ensity (f) = e 1 2 kfk2 (2) can reect a priori knowlege about f in a precise way. Specically, (f) reects relative preferences for ifferent choices of f. For example, the ensity (f) 1, represents an a priori Lebesgue measure for f, i.e., inicating no preference, on any size space. However, ensities (f) as a priori information can incorporate other types of information, e.g., as partial information about a probability istribution or even noncountably aitive probabilities e.g., cyliner measures (see e.g., [16]). The use of a ensity which oes not correspon to a measure can be illustrate with a simple example of inference with a priori information, which is in fact analogous to the situation in an innite imensional function space. Consier an unknown integer n. Suppose we only have the a priori knowlege that P (n is even) = 2P (n is o) Such information cannot be formulate in a single probability istribution on the integers, since such a istribution woul have to be uniform on the evens an on the os, an no such countably aitive istribution exists. A likelihoo function is neee to incorporate this information, i.e., ( 2 if n even (n) = 1 if n o Likelihoo functions play a similar role in an innite imensional space F, reecting partial knowlege about f in a noncompact space where there is no uniform probability istribution. As an example in innite imension, suppose we know only that, given two choices f 1 an f 2 of f, the ratio of their probabilities is e 1 2 kf1k2 e 1 2 kf2k2 ; it makes sense to ene a likelihoo function e 1 2 kfk2, whether or not this correspons to an actual measure. Thus likelihoo functions are a more natural way than measures of incorporating a priori information in innite imensional Bayesian settings. This can also be seen in nite but high nite imensional situations, in which a naive regularization approach to incorporating a priori information regaring an unknown f might be as follows. Common methos for inference in statistical learning theory ([6], [7],[17]) involve regularization. Thus in aition to ata y = Nf, a priori information might be: kafk shoul be small for a xe linear A, where, e.g., A is a erivative operator, in which case kafk is a Sobolev norm. For simplicity, assume A = I is the ientity an the norm is Eucliean. A naive approach might be to assume a prior istribution on the ranom variable R = kfk, say R (R) = 2 p e R2 (R > 0): (3) Note this marginal for R correspons to an n imensional istribution of the form

4 f (f) = C 1 1 kfk n 1 e kfk 2 ; (4) if we assume f (f) to be raially symmetric in f. This likelihoo function is singular at the origin an clearly vanishes as n! 1 (so that it has no innite imensional limit), seemingly implying no likelihoo methos in innite imension. Compare this to the present likelihoo function approach: we start with likelihoo function f (f) = C 2 e kafk2 (above, A = I), which irectly expresses intuition that a function f 1 is preferable to a function f 2 by kaf 1 k2 a likelihoo factor e. An ae feature is that this e kaf 2 likelihoo function is the k2 ensity of a measure on F if an only if A 2 is trace class [3]. The point here is that working with stanar probability measures can be misleaing, while expressing a priori information in likelihoo functions claries a priori assumptions in practical situations. In the innite imensional case, if we assume a priori likelihoo function, e.g., (f) = e kafk2, even if A = I, (so is the ensity of an isonormal cyliner Gaussian measure), we unerstan the connection of (f) (an hence unerlying probabilities) with a priori knowlege about likelihoos (see below). 5 Applications MAPN in Bayesian estimation: As inicate above, in innite imensional Bayesian inference the MAP estimator can be as useful as in parametric statistics. An unknown f 2 F with a Bayesian prior istribution on F now has a likelihoo function, as in parametric statistics. Gaussian prior: For example consier an innite imensional Gaussian prior measure on a Hilbert space F with covariance C. Assume without loss that C has ense range. We nee: Dening A = p C 1, can be shown to have ensity (f) = e 1 2 kafk2 [14]. Suppose we are given stanar information y = Nf = (f(x 1 ); : : : ; f(x n )). By the above, the conitional ensity (fjy) is the restriction of (f), so the MAPN estimate of f is: bf = arg max Nf=y e 1 2 kafk2 = arg min kafk Nf=y Note that this correspons to the spline estimate of f [3], as well as the regularization theory estimate of f for exact (i.e., error-free) information [17], an Bayesian minimum average error estimates base on Gaussian priors [15]. Gaussian prior with noisy information: For inexact information, assume a ranom inepenent error with ensity (y). The information moel is Then the MAPN estimate is Note y = Nf + : bf =arg max (fjy): f (fjy) = y(yjf)(f) : y (y) If we further assume that the ensity in R n of is Gaussian, i.e., () = C 3 e kbk2, with B linear an C 3 a constant, we conclue (fjy) = C 4 e kb(n(f) y)k2 e kafk2 y (y) = C 5 e kb(n(f) y)k2 e kafk2, where C 5 can epen on y. MAPN yiels b f as a maximum of e kb(n(f) y)k2 kafk 2, so bf = arg min Nf=y kafk 2 + kb(nf y)k 2 : This log likelihoo function can be minimize e.g., using Lagrange multipliers. Note this is exactly the spline solution of the same problem, an also the minimum of the regularization functional kafk 2 + kb(nf y)k 2 appearing in regularization methos for solving N f = y ([7], [17]). Note also that the same proceure can be use for estimates base on elliptically contoure measures, which for elliptically contoure measures is more ifcult to o using minimum square error methos. If likelihoo functions are use, operator methos of maximizing them which work in nite imension also typically work in innite imensions, with matrices replace by operators (e.g., the covariance a Gaussian becomes an operator). Example: In the example of the chemical mixture mentione earlier, suppose that we have an input vector x = (x 1 ; x 2 ; :::; x 20 ) representing an input mixture, with the 20 parameters representing temperature, mixing strength, along with 18 numbers representing proportions of chemical components (in moles/kg). The measure output y represents the measure ratio of strength to brittleness for the plastic prouct of the mixture. In this example, we have assume n = 400 runs of the experiment (a relatively small number given the size of the space F of possible i-o functions). For simplicity we assume that all variables x i have been re-scale so that their experimental range is the

5 interval [ 1=2; 1=2]. Thus the a priori space F of possible i-o functions will be all (square integrable) functions on the input space X = [ 1=2; 1=2] 20. As a rst approximation we assume that the target f 2 F is smooth, an thus we assume a prior probability istribution of f to be a Gaussian, favoring functions which are smooth an (for regularity purposes) not large. In aition, it may be felt (from previous experience) that the input variables x 1 ; :::; x 5 are associate with more sharp variations in y than the other 15 variables, an we expect the variation of f in these irections to be more sensitive to ata than to our a priori assumptions of smoothness. Finally, we will want the a priori smoothness an size assumptions to have more weight in regions where there are less ata in a precise parametric way. To this en our a priori esieratum will initially require "smallness" of the regularization term kf(x)k 2 k((x) [1 + (a D)] f(x))k2 ; where a = (:1; :1; :::; :1; :2; :2; :::; :2)2R 20 has its rst 5 components equal to :10 (reecting greater esire epenence of f on the variations in the rst ve variables) an its last 15 components equal to :20 (reecting smaller epenence on the last 15 variables); these parameters can be ajuste. We have also ene D = 15 ; 15 ; :::; 15 ; the high orer of this operator is necessary since its omain must con- x 15 1 x 15 2 x sist of continuous functions on R 20 ; i.e., functions which have more than 10 erivatives. Note that the norm kk above is given by kfk 2 = R f(x) 2 x, an the associ- 20 ate inner prouct is hf; gi = R f(x)g(x)x R 20 Above the function (x) is chosen to reect the ensity (x) of the sample points fx k g n k=1 at the location x. Note if (x) is large, then we wish our estimate to epen more on ata an less on a priori assumptions, an that we want (x) to be small there; the opposite is esire when (x) is small. As an initial approximation we choose (x) = (1 + (x)) 1, with the local linear ensity (x) ene by (x) m = P n k=1 e (x xk), where here the imension m = 20: We will nee to ene a omain with proper bounary conitions for the square of the above regularization operator in orer to obtain a function class F with a unique Gaussian istribution. The simplest bounary conitions are Dirichlet B.C., i.e., the requirement that f vanish on the bounary. Since this is not a natural requirement for our function class, we exten the omain space F to be square integrable functions on [ 1; 1] 20 in orer to impose such conitions sufciently far away from the "region of interest" (the vali omain of variation of the inputs, [ 1=2; 1=2] 20 [ 1; 1] 20 ). Thus F is square integrable functions in 20 variables, each in the interval [ 1; 1]. We ene the operator ad = a The Dirichlet operator B ; a 15 x 15 2 ; :::; a 15 1 x x ene by hf; Bfi = k((x) [(1 + (ad)] f(x))k 2 is given by Bf = ((x) [(1 + (ad)]) ((x) [(1 + (ad)]) f = (x) [(1 + (ad)] [(1 (ad)] (x)f(x) = (x)(1 (ad) 2 )(x)f(x) with omain of B restricte to f satisfying f(x) = 0 on the bounary, i.e., when any x i = 1: This operator has a trace class inverse B 1 = C, which is the covariance operator of a Gaussian measure P on F. We ene P to be the associate Gaussian istribution on the space F of functions f : [ 1; 1] 20! R. Note that our ata are restricte to the subset [ 1=2; 1=2] 20 [ 1; 1] 20 : The measure P is our a priori Gaussian measure on F. By the above analysis, given an unknown f 2 F an information y = Nf = (f(x 1 ); :::; f(x n )), then accoring to the above, the MAPN estimator of f is bf = arg inf Nf=y fkf(x)k2 k((x) [1 + (ad)] f(x))k2 = hf; Bfi = f; 2 (x)f(x) (x)(ad) 2 (x)f(x) g: Note that ([18], [7], [12]), the above minimizer is bf(x) = nx c k G(x; x k ); (5) k=1 where G(x; x 0 ) is the Green function for the ifferential operator B ene above. Thus G is the kernel of the covariance operator B 1 = C; which can be compute separately. Note B 1 f = (x) 1 (1 (ad) 2 ) 1 (x) 1 f: Thus G(x; x 0 ) =(x) 1 G 0 (x; x 0 )(x 0 ) 1 ; where G 0 (x) is the kernel of (1 (ad) 2 ) 1 = 1 20X a 2 i i=1! 1 30 x 30 (6) i on F, with vanishing bounary conitions. Since the bounaries of the omain [ 1; 1] 20 of F are relatively far from the region of interest [ :5; :5] 20 in which the ata lie, we approximate the Green kernel G 0 (x; x 0 ) with bounary conitions vanishing on the bounary of [ 1; 1] 20 ; by the approximation G(x; e x 0 ); the kernel of (1 (ad) 2 ) 1 on R 20 with bounary conitions at 1. This has the form 0 = 1 eg(x; x 0 ) = e G(x x 0 ) 20X i=1! 1 1 a 2 i (i i ) 30 A (x x 0 ); (7) where F enotes Fourier transform an i is the variable ual to x i. Thus our approximation of f is given by (5) (see, e.g., [12]), with c = fc i g n i=1 = G 1 (Nf) T ; where

6 G is a matrix with G ij = G(x i x j ); with T enoting transpose. Explicitly, [10] C. Micchelli an T. Rivlin, Lectures on optimal recovery, Lecture Notes in Mathematics 1129, (Berlin: Springer-Verlag, 1985) nx bf = c k G(x k=1 x k ) nx k=1 G 1 Nf T k e G(x x k ); [11] T. Hastie, R. Tibshirani an J. Frieman, The elements of statistical learning: ata mining, inference an preiction (Berlin: Springer-Verlag, 2001). where e G(x x k ) is computable from (7). 6 Conclusions We have showe that it is possible to exten MAP estimates to high an innite imensional approximations. In applications we have constructe an approximation to an unknown function f from pointwise information N f = (f(x 1 ); :::; f(x n )) by incorporating prior information regaring smoothness of f, an aapting the solution in a way which epens on the local ensity of ata points x i. References [1] T. Mitchell, Machine Learning (New York: McGraw-Hill, 1997). [2] J. Traub an H. Wozniakowski, A General Theory of Optimal Algorithms (New York: Acaemic Press, 1980). [3] J. Traub, G. Wasilkowski, an H. Woniakowski, Information-Base Complexity (Boston: Acaemic Press, 1988). [4] T. Poggio an S. Smale, The Mathematics of Learning: Dealing with Data, Notices of the AMS, 50, 2003, [12] M. Kon an L. Plaskota, Information complexity of neural networks, Neural Networks, 13, 2000, [13] J. Traub an H. Wozniakowski, Breaking intractability, Scientic American, 270, 1994, [14] M. Kon, Density functions for machine learning an optimal recovery, 2004, preprint.. [15] G. Wahba, Bayesian conence intervals for the cross-valiate smoothing spline, Journal of the Royal Statistical Society B, 45, 1983, [16] N. Frieman an J. Halpern, Plausibility measures an efault reasoning, In Thirteenth National Conf. on Articial Intelligence (New York: AAAI, 1996), [17] T. Evgeniou, M. Pontil, an T. Poggio, Regularization Networks an Support Vector Machines, Avances in Computational Mathematics, 13, 2000, [18] Micchelli, C.A. an M. Buhmann, On raial basis approximation on perioic gris, Math. Proc. Camb. Phil. Soc., 112, 1992, [19] C. A. Micchelli, Interpolation of scattere ata: Distance matrices an conitionally positive enite functions, Constructive Approximation, 2, 1986, [5] T. Poggio an C. Shelton, Machine Learning, Machine Vision, an the Brain, The AI Magazine, 20, 1999, URL: citeseer.nj.nec.com/poggio99machine.htm [6] V. Vapnik, The Nature of Statistical Learning Theory, (New York: Springer, 1995). [7] V. Vapnik, Statistical Learning Theory (New York: Wiley, 1998). [8] G. Wahba, Generalization an regularization in nonlinear learning systems, in M. Arbib (E.), Hanbook of Brain Theory an Neural Networks, (Cambrige: MIT Press, 1995) [9] G. Wahba, Support vector machines, reproucing kernel Hilbert spaces an the ranomize GACV, in B. Schoelkopf, C. Burges & A. Smola (Es.), Avances in Kernel Methos Support Vector Learning, (Cambrige: MIT Press, 1999)

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction