On the Value of Partial Information for Learning from Examples

Size: px
Start display at page:

Download "On the Value of Partial Information for Learning from Examples"

Transcription

1 JOURNAL OF COMPLEXITY 13, (1998) ARTICLE NO. CM On the Value of Partial Information for Learning from Examples Joel Ratsaby* Department of Electrical Engineering, Technion, Haifa, Israel an Vitaly Maiorov Department of Mathematics, Technion, Haifa, Israel Receive August 15, 1996 The PAC moel of learning an its extension to real value function classes provies a well-accepte theoretical framework for representing the problem of learning a target function g(x) using a ranom sample {(x i, g(x i ))} m i=1. Base on the uniform strong law of large numbers the PAC moel establishes the sample complexity, i.e., the sample size m which is sufficient for accurately estimating the target function to within high confience. Often, in aition to a ranom sample, some form of prior knowlege is available about the target. It is intuitive that increasing the amount of information shoul have the same effect on the error as increasing the sample size. But quantitatively how oes the rate of error with respect to increasing information compare to the rate of error with increasing sample size? To answer this we consier a new approach base on a combination of information-base complexity of Traub et al. an Vapnik Chervonenkis (VC) theory. In contrast to VC-theory where function classes of finite pseuo-imension are use only for statistical-base estimation, we let such classes play a ual role of functional estimation as well as approximation. This is capture in a newly introuce quantity, ρ (F), which represents a nonlinear with of a function class F. We then exten the notion of the nth minimal raius of information an efine a quantity I n, (F) which measures the minimal approximation error of the worst-case target g F by the family of function classes having pseuo-imension given partial information on g consisting of values taken by n linear operators. The error rates are calculate which leas to a quantitative notion of the value of partial information for the paraigm of learning from examples Acaemic Press * jer@ee.technion.ac.il. All corresponence shoul be maile to this author. maiorov@tx.technion.ac.il X/97 $25.00 Copyright 1998 by Acaemic Press All rights of reprouction in any form reserve.

2 510 RATSABY AND MAIOROV 1. INTRODUCTION The problem of machine learning using ranomly rawn examples has receive in recent years a significant amount of attention while serving as the basis of research in what is known as the fiel of computational learning theory. Valiant [35] introuce a learning moel base on which many interesting theoretical results pertaining to a variety of learning paraigms have been establishe. The theory is base on the pioneering work of Vapnik an Chervonenkis [36 38] on finite sample convergence rates of the uniform strong law of large numbers (SLN) over classes of functions. In its basic form it sets a framework known as the probably approximately correct (PAC) learning moel. In this moel an abstract teacher provies the learner a finite number m of i.i.. examples {(x i, g(x i ))} i=1 m ranomly rawn accoring to an unknown unerlying istribution P over X, where g is the target function to be learnt to some prespecifie arbitrary accuracy ɛ > 0 (with respect to the L 1 (P)-norm) an confience 1 δ, where δ > 0. The learner has at his iscretion a functional class referre to as the hypothesis class from which he is to etermine a function ĥ, sample-epenent, which estimates the unknown target g to within the prespecifie accuracy an confience levels. There have been numerous stuies an applications of this learning framework to ifferent learning problems (Kearns an Vazirani [18], Hanson et al. [15]). The two main variables of interest in this framework are the sample complexity which is the sample size sufficient for guaranteeing the prespecifie performance an the computational complexity of the metho use to prouce the estimator hypothesis ĥ. The bulk of the work in computational learning theory an, similarly, in the classical fiel of pattern recognition, treats the scenario in which the learner has access only to ranomly rawn samples. It is often the case, however, that some aitional knowlege about the target is available through some form of a priori constraints on the target function g. In many areas where machine learning may be applie there is a source of information, sometimes referre to as an oracle or an expert, which supplies ranom examples an even more complex forms of partial information about the target. A few instances of such learning problems inclue: (1) pattern classification. Creit car frau etection where a tree classifier (Devroye et al. [12]) is built from a training sample consisting of patterns of creit car usage in orer to learn to etect transactions that are potentially frauulent. Partial information may be represente by an existing tree which is base on human-expert knowlege. (2) preiction an financial analysis. Financial forecasting an portfolio management where an artificial neural network learns from time-series ata an is given rule-base partial knowlege translate into constraints on the weights of the neuron elements. (3) control an optimization. Learning a control process for inustrial manufacturing

3 LEARNING FROM EXAMPLES 511 where partial information represents quantitative physical constraints on the various machines an their operation. For some specific learning problems the theory preicts that partial knowlege is very significant, for instance, in statistical pattern classification or in ensity estimation, having some knowlege about the unerlying probability istributions may crucially influence the complexity of the learning problem (cf. Devroye [11]). If the istributions are known to be of a certain parametric form an exponentially large savings in sample size may be obtaine (cf. Ratsaby [28], Ratsaby an Venkatesh [30, 31]). In general, partial information may appear as knowlege about certain properties of the target function. In parametricbase estimation or preiction problems, e.g., maximum likelihoo estimation, knowlege concerning the unknown target may appear in terms of a geometric constraint on the Eucliean subset that contains the true unknown parameter. In problems of pattern recognition an statistical regression estimation, often some form of a criterion functional over the hypothesis space is efine. For instance, in artificial neural networks, the wiely use back-propagation algorithm (cf. Ripley [32]) implements a least-square-error criterion efine over a finiteimensional manifol spanne by rige-functions of the form σ(a T x + b), where σ(y) = 1/(1 + e y ). Here prior knowlege can take the form of a constraint ae on to the minimization of the criterion. In Section 3 we provie further examples where partial information is use in practice. It is intuitive that general forms of prior partial knowlege about the target an ranom sample ata are both useful. PAC provies the complexity of learning in terms of the sample sizes that are sufficient to obtain accurate estimation of g. Our motive in this paper is to stuy the complexity of learning from examples while being given prior partial information about the target. We seek the value of partial information in the PAC learning paraigm. The approach taken here is base on combining frameworks of two fiels in computer science, the first being information-base complexity (cf. Traub et al. [34]) which provies a representation of partial information while the secon, computational learning theory, furnishes the framework for learning from ranom samples. The remainer of this paper is organize as follows: In Section 2 we briefly review the PAC learning moel an Vapnik Chervonenkis theory. In Section 3 we provie motivation for the work. In Section 4 we introuce a new approximation with which measures the egree of nonlinear approximation of a functional class. It joins elementary concepts from Vapnik Chervonenkis theory an classical approximation theory. In Section 5 we briefly review some of the efinitions of information-base complexity an then introuce the minimal information-error I n, ( ). In Section 6 we combine the PAC learning error with the minimal partial information error to obtain a unifie upper boun on the error. In Section 7 we compute this upper boun for the case of learning a Sobolev target class. This yiels a quantitative trae-off between partial information an

4 512 RATSABY AND MAIOROV sample size. We then compute a lower boun on the minimal partial information error for the Sobolev class which yiels an almost optimal information operator. The Appenix inclues the proofs of all theorems in the paper. 2. OVERVIEW OF THE PROBABLY APPROXIMATELY CORRECT LEARNING MODEL Valiant [35] introuce a new complexity-base moel of learning from examples an illustrate this moel for problems of learning inicator functions over the boolean cube {0, 1} n. The moel is base on a probabilistic framework which has become known as the probably approximately correct, or PAC, moel of learning. Blumer et al. [6] extene this basic PAC moel to learning inicator functions of sets in Eucliean n. Their methos are base on the pioneering work of Vapnik an Chervonenkis [36] on finite sample convergence rates of empirical probability estimates, inepenent of the unerlying probability istribution. Haussler [16] has further extene the PAC moel to real an vectorvalue functions which is applicable to general statistical regression, ensity estimation an classification learning problems. We start with a escription of the basic PAC moel an some of the relevant results concerning the complexity of learning. A target class is a class of Borel measurable functions over a omain X containing a target function g which is to be learnt from a sample z m = {(x i, g(x i ))} i=1 m of m examples that are ranomly rawn i.i.. accoring to any fixe probability istribution P on X. Define by S the sample space for which is the set of all samples of size m over all functions f for all m 1. Fix a hypothesis class of functions on X which nee not be equal nor containe in. Alearning algorithm φ: S is a function that, given a large enough ranomly rawn sample of any target in, returns a Borel measurable function h (a hypothesis) which is with high probability a goo approximation of the target function g. Associate with each hypothesis h, is a nonnegative error value L(h), which measures its isagreement with the target function g on a ranomly rawn example an an empirical error L m (h), which measures the isagreement of h with g average over the observe m examples. Note that the notation of L(h) an L m (h) leaves the epenence on g an P implicit. For the special case of an being classes of inicator functions over sets of X = n the error of a hypothesis h is efine to be the probability (accoring to P) of its symmetric ifference with the target g; i.e., L(h) = P({x n : g(x) h(x)}). (1)

5 LEARNING FROM EXAMPLES 513 Corresponingly, the empirical error of h is efine as L m (h) = 1 m m 1 {g(xi ) h(x i )}, (2) i=1 where 1 {x A} stans for the inicator function of the set A. Forreal-value function classes an the error of a hypothesis h is taken as the expectation El(h, g) (with respect to P) of some positive real-value loss function l(h, g), e.g., quaratic loss l(h, g) =(h(x) g(x)) 2 in regression estimation, or the log likelihoo loss l(h, g) = ln(g(x)/h(x)) for ensity estimation. Similarly, the empirical error now becomes the average loss over the sample, i.e., L m (h) = (1/m) m i=1 l(h(x i ), g(x i )). We now state a formal efinition of a learning algorithm which is an extension of a efinition in Blumer et al. [6]. DEFINITION 1 (PAC-learning algorithm). Fix a target class, a hypothesis class, a loss function l(, ), an any probability istribution P on X. Denote by P m the m-fol joint probability istribution on X m. A function φ is a learning algorithm for with respect to P with sample size m m(ɛ, δ) if for all ɛ >0, 0<δ < 1, for any fixe target g, with probability 1 δ, base on a ranomly rawn sample z m, the hypothesis ĥ = φ(z m ) has an error L(ĥ) L(h ) + ɛ, where h * is an optimal hypothesis; i.e., L(h ) = inf h L(h). Formally, this is state as: P m (z m X m : L(ĥ) >L(h ) + ɛ) δ. The smallest sample size m(ɛ, δ) such that there exists a learning algorithm φ for with respect to all probability istributions is calle the sample complexity of φ or simply the sample complexity for learning by. If such a φ exists then is sai to be uniformly learnable by. We note that in the case of realvalue function classes the sample complexity epens on the error function through the particular loss function use. Algorithms φ which output a hypothesis ĥ that minimizes L m (h) over all h are calle empirical risk minimization (ERM) algorithms (cf. Vapnik [38]). The theory of uniform learnability for ERM algorithms forms the basis for the majority of the works in the fiel of computational learning theory, primarily for the reason that the sample complexity is irectly relate to a capacity quantity calle the Vapnik Chervonenkis imension of for the case of an inicator function class, or to the pseuo-imension in case of a realvalue function class. These two quantities are efine an iscusse below. Essentially the theory says that if the capacity of is finite then is uniformally learnable. We note that there are some peagogic instances of functional classes, even of infinite pseuo-imension, for which any target function can be exactly learnt by a single example of the form (x, g(x)) (cf. Bartlett et al., p. 299). For such target classes the sample complexity of learning by ERM is significantly greater than one so ERM is not an efficient form of learning. Henceforth all the results are limite to ERM learning algorithms.

6 514 RATSABY AND MAIOROV We start with the following efinition. DEFINITION 2 (Vapnik Chervonenkis imension). Given a class of inicator functions of sets in X the Vapnik Chervonenkis imension of, enote as VC( ), is efine as the largest integer m such that there exists a sample x m ={x 1,...,x m } of points in X such that the carinality of the set of boolean vectors S x m ( ) ={[h(x 1 ),..., h(x m )]: h } satisfies S x m ( ) =2 m.ifm is arbitrarily large then the VC-imension of is infinite. Remark. The quantity max x m S x m ( ), where the maximum is taken over all possible m-samples, is calle the growth function of. EXAMPLE. Let be the class of inicator functions of interval sets on X =. With a single point x 1 X we have {[h(x 1 )]: h } = 2. For two points x 1, x 2 X we have {[h(x 1 ), h(x 2 )]: h } = 4. When m = 3, for any points x 1, x 2, x 3 X we have {[h(x 1 ), h(x 2 ), h(x 3 )]: h } < 2 3 thus VC( ) =2. The main interest in the VC-imension quantity is ue to the following result on a uniform strong law of large numbers which is a variant of Theorem 6.7 in Vapnik [38]. LEMMA 1 (Uniform SLN for the inicator function class). Let g be any fixe target inicator function an let be a class of inicator functions of sets in X with VC( ) = <. Let z m ={(x i, g(x i ))} i=1 m be a sample of size m > consisting of ranomly rawn examples accoring to any fixe probability istribution P on X. Let L m (h) enote the empirical error for h base on z m an g as efine in (2). Then for arbitrary confience parameter 0 <δ<1, the eviation between the empirical error an the true error uniformly over is boune as sup h L(h) L m (h) 4 (ln(2m/) + 1) + ln(9/δ) m with probability 1 δ. Remark. The result actually hols more generally for a boolean ranom variable y Y = {0, 1} replacing the eterministic target function g(x). In such a case the sample consists of ranom pairs {(x i, y i )} i=1 m istribute accoring to any fixe joint probability istribution P over X Y. Thus a function class of finite VC-imension possesses a certain statistical smoothness property which permits simultaneous error estimation over all hypotheses in using the empirical error estimate. We note in passing that there is an interesting generalization (cf. Buescher an Kumar [7], Devroye et al. [12]) of the empirical error estimate to other smooth estimators base on the iea of empirical coverings which removes the conition of neeing a finite VC-imension.

7 LEARNING FROM EXAMPLES 515 As a irect consequence of Lemma 1 we obtain the necessary an sufficient conitions for a target class of inicator functions to be uniformly learnable by a hypothesis class. This is state next an is a slight variation of Theorem 2.1 in Blumer et al. [6]. LEMMA 2 (Uniform learnability of inicator function class). Let an be a target class an a hypothesis class, respectively, of inicator functions of sets in X. Then is uniformly learnable by if an only if the VC( )<. Moreover, if VC( ) =, where <, then for any 0 <ɛ,δ<1, the sample complexity of an algorithm φ is boune from above by c((/ɛ) log(1/δ)), for some absolute constant c > 0. We procee now to the case of real-value functions. The next efinition which generalizes the VC-imension is taken from Haussler [16] an is base on the work of Pollar [27]. Let sgn(y) be efine as 1 for y > 0 an 1 for y 0. For a Eucliean vector v m enote by sgn(v) = [sgn(v 1 ),..., sgn(v m )]. DEFINITION 3 (Pseuo-imension). Given a class of real-value functions efine on X. The pseuo-imension of, enote as im p ( ), is efine as the largest integer m such that there exists {x 1,...,x m } X an a vector v m such that the carinality of the set of boolean vectors satisfies {sgn[h(x 1 ) + v 1,...,h(x m ) + v m ]: h } = 2 m.ifm is arbitrarily large then the im p ( ) =. The next lemma appears as Theorem 4 in Haussler [16] an states that for the case of finite-imensional vector spaces of functions the pseuo-imension equals its imension. LEMMA 3. Let be a -imensional vector space of functions from a set X into. Then im p ( ) =. For several useful invariance properties of the pseuo-imension cf. Pollar [27] an Haussler [16, Theorem 5]. The main interest in the pseuo-imension arises from having the SLN hol uniformly over a real-value function class if it has a finite pseuo-imension. In orer to apply this to the PAC-framework we nee a uniform SLN result not for the hypothesis class but for a class efine by ={l(h(x), y): h, x X, y } for some fixe loss function l, since an ERM-base algorithm minimizes the empirical error, i.e., L m (h), over. While the theory presente in this paper applies to general loss functions we restrict here to the absoluteloss l(h(x), g(x)) = h(x) g(x). The next lemma is a variant of Theorem 7.3 of Vapnik [38]. THEOREM 1. Let P be any probability istribution on X an let g be a fixe target function. Let be a class of functions from X to which has a pseuo-imension 1 an for any h enote by L(h) = E h(x) g(x) an assume L(h) M for some absolute constant M > 0. Let {(x i, g(x i ))} i=1 m,

8 516 RATSABY AND MAIOROV x i X, be an i.i.. sample of size m > 16( + 1) log 2 4( + 1) rawn accoring to P. Then for arbitrary 0 <δ<1, simultaneously for every function h, the inequality 16( + 1) log 2 4( + 1)(ln(2m) + 1) + ln(9/δ) L(h) L m (h) 4M (3) m hols with probability 1 δ. The theorem is prove in Section A.1. Remark. For uniform SLN results base on other loss functions see Theorem 8 of Haussler [16]. We may take twice the right-han sie of (3) to be boune from above by the simpler expression log 2 ln m + ln(1/δ) ɛ(m,, δ) c 1 (4) m for some absolute constant c 1 > 0. Being that an ERM algorithm picks a hypothesis ĥ whose empirical error satisfies L m (ĥ) = inf h L m (h) an by Definition 1, L(h ) = inf h L(h), it follows that ɛ(m,, δ) L(ĥ) L m (ĥ) + 2 L m (h ɛ(m,, δ) ) + 2 L(h ) + ɛ(m,, δ). (5) By (5) an accoring to Definition 1 it is immeiate that ERM may be consiere as a PAC learning algorithm for. Thus we have the following lemma concerning the sufficient conition for uniform learnability of a realvalue function class. LEMMA 4 (Uniform learnability of real-value function class). Let an be the target an hypothesis classes of real-value functions, respectively, an let P be any fixe probability istribution on X. Let the loss function l(g(x), h(x)) = g(x) h(x) an assume L(h) M for all h, an g, for some absolute constant M > 0. Ifim p ( ) < then is uniformly learnable by. Moreover, if im p ( ) = < then for any ɛ > 0, 0 < δ < 1, the sample complexity of learning by is boune from above by (cm 2 ln 2 ()/ɛ 2 )(ln(m/ɛ) + ln(1/δ)), for some absolute constant c > 0. Remarks. As in the last remark above, this result can be extene to other loss functions l. In aition, Alon et al. [4] recently showe that a quantity calle

9 LEARNING FROM EXAMPLES 517 the scale-sensitive imension which is a generalization of the pseuo-imension, etermines the necessary an sufficient conition for uniform learnability. It is also worth noting that there have been several works relate to the pseuo-imension but which are use for mathematical analysis other than learning theory. As far as we are aware, Warren [39] was the earliest who consiere a quantity calle the number of connecte components of a nonlinear manifol of real-value functions, which closely resembles the growth function of Vapnik an Chervonenkis for set-inicator functions, see Definition 2. Using this he etermine lower bouns on the egree of approximation by certain nonlinear manifols. Maiorov [20] calculate this quantity an etermine the egree of approximation for the nonlinear manifol of rige functions which inclue the manifol of functions represente by artificial neural networks with one hien layer. Maiorov, Meir, an Ratsaby [21], extene his result to the egree of approximation measure by a probabilistic (n, δ)-with with respect to a uniform measure over the target class an etermine finite sample complexity bouns for moel selection using neural networks [29]. For more works concerning probabilistic withs of classes see Traub et al. [34], Maiorov an Wasilkowski [22]. Throughout the remainer of the paper we will eal with learning real-value functions while enoting explicitly a hypothesis class as one which has im p ( ) =. For any probability istribution P an target function g, the error an empirical error of a hypothesis h are efine by the L 1 (P)-metric as L(h) = E h(x) g(x), L m (h) = 1 m m h(x i ) g(x i ), (6) i=1 respectively. We iscuss next some practical motivation for our work. 3. MOTIVATION FOR A THEORY OF LEARNING WITH PARTIAL INFORMATION It was mentione in Section 1 that the notion of having partial knowlege about a solution to a problem, or more specifically about a target function, is often encountere in practice. Starting from the most elementary instances of learning in humans it is almost always the case that a learner begins with some partial information about the problem. For instance, in learning cancer iagnosis, a teacher not only provies examples of pictures of healthy cells an benign cells but also escriptive partial information such as a benign cell has color black an elongate shape, or benign cells usually appear in clusters. Similarly, for machine learning it is intuitive that partial information must be useful.

10 518 RATSABY AND MAIOROV While much of the classical theory of pattern recognition (Dua an Hart [13], Fukunaga [14]) an the more recent theory of computational learning (Kearns an Vazirani [18]) an neural networks (Ripley [32]) focus on learning from ranomnly rawn ata, there has been an emergence of interest in nonclassical forms of learning, some of which inicates that partial information in various forms which epen on the specific application is useful in practice. This is relate to the substream known as active learning, where the learner participates actively by various forms of querying to obtain information from the teacher. For instance, the notion of selective sampling (cf. Cohn et al. [8]) permits the learner to query for samples from omain-regions having high classification uncertainty. Cohn [9] uses methos base on the theory of optimal experiment esign to select ata in an on-line fashion with the aim of ecreasing the variance of an estimate. Abu-Mostafa [1 3] refers to partial information as hints an consiers them for financial preiction problems. He shows that certain types of hints which reflect invariance properties of the target function g, for instance saying that g(x) =g(x ), at some points x, x in the omain, may be incorporate into a learning error criterion. In this paper we aopt the framework of information-base complexity (cf. Traub et al. [34]) to represent partial information. In the framework whose basic efinitions are reviewe in Section 5, we limit to linear information comprise of n linear functionals L i (g), 1 i n, operating on the target function g. In orer to motivate the interest in partial information as being given by such n- imensional linear operators we give the following example of learning pattern classification using a classical nonparametric iscriminant analysis metho (cf. Fukunaga [14]). The fiel of pattern recognition treats a wie range of practical problems where an accurate ecision is to be mae concerning a stochastic pattern which is in the form of a multiimensional vector of features of an unerlying stochastic information source, for instance, eciing which of a finite number of types of stars correspons to given image ata taken by an exploratory spacecraft, or eciing which of the wors in a finite ictionary correspon to given speech ata which consist of spectral analysis information on a soun signal. Such problems have been classically moele accoring to a statistical framework where the input ata are stochastic an are represente as ranom variables with a probability istribution over the ata space. The most wiely use criterion for learning pattern recognition (or classification) is the misclassification probability on ranomly chosen ata which have not been seen uring the training stage of learning. In orer to ensure an accurate ecision it is necessary to minimize this criterion. The optimal ecision rule is one which achieves the minimum possible misclassification probability an has been classically referre to as Baye s ecision rule. We now consier an example of learning pattern recognition using ranomly rawn examples, where partial information takes the form of feature extraction.

11 LEARNING FROM EXAMPLES 519 EXAMPLE (Learning pattern classification). The setting consists of M pattern classes represente by unknown nonparametric class conitional probability ensity functions f (x j) over X = l with known correpsoning a priori class probabilities p j, 1 j M. It is well known that the optimal Bayes classifier which has the minimal misclassification probability is efine as follows: g(x) = argmax 1 j M {p j f (x j)}, where argmax j A B j enotes any element j in A such that B j B i, j i. Its misclassification probability is calle the Bayes error. For instance, suppose that M = 2 an f (x/j), j =1,2,are both l-imensional Gaussian probability ensity functions. Here the two pattern classes clearly overlap as their corresponing functions f (x 1) an f (x 2) have an overlapping probability-1 support; thus the optimal Bayes misclassification probability must be greater than zero. The Bayes classifier in this case is an inicator function over a set A ={x l : q(x) >0}, where q(x) is a secon egree polynomial over l. We henceforth let the target function, enote by g(x), be the Bayes classifier an note that it may not be unique. The target class is efine as a rich class of classifiers each of which maps X to {1,...,M}. The training sample consists of m i.i.. pairs {(x i, y i )} i=1 m, where y i {1, 2,...,M} takes the value j with probability p j, an x i is rawn accoring to the probability istribution corresponing to f (x y i ),1 i m. The learner has a hypothesis class of classifier functions mapping X to {1,...,M} which has a finite pseuo-imension. Formally, the learning problem is to approximate g by a hypothesis h in. The error of h is efine as L(h) = h g L1 (P), where P is some fixe probability istribution over. State in the PAC-framework, a target class is to be uniformly learne by ; i.e., for any fixe target g an any probability istribution P on X, fin an ĥ which epens on g an whose error L(ĥ) L(h ) + ɛ with probability 1 δ, where L(h ) = inf h g h L1 (P). As partial information consier the ubiquitous metho of feature extraction which is escribe next. In the pattern classification paraigm it is often the case that, base on a given sample {(x i, y i )} i=1 m which consists of feature vectors x i l,1 i m, one obtains a hypothesis classifer ĥ which incurs a large misclassification probability. A natural remey in such situations is to try to improve the set of features by generating a new feature vector y Y = k, k l, which epens on x, with the aim of fining a better representation for a pattern which leas to larger separation between the ifferent pattern-classes. This in turn leas to a simpler classifier g which can now be better approximate by a hypothesis h in the same class of pseuo-imension, the latter having not been rich enough before for approximating the original target g. Consequently with the same sample complexity one obtains via ERM a hypothesis ĥ which estimates g better an therefore having a misclassification probability closer to the optimal Bayes misclassification probability. Restricting to linear mappings A: X Y, classical iscriminant analysis methos (cf. Fukunaga [14, Section 9.2]; Dua an Hart [13, Chap. 4]) calculate

12 520 RATSABY AND MAIOROV the optimal new feature vector y by etermining the best linear map A* which, accoring to one of the wiely use criteria, maximizes the pattern class separability. Such criteria are efine by the known class probabilities p j, the class conitional means µ j = E(X j), an the class conitional covariance matrices C j = E((X µ j )(X µ j ) T j), 1 j M, where expectation E( j) is taken with respect to the jth class conitional probability istribution corresponing to f (x j). In reality the empirical average over the sample is use instea of taking expectation, since the unerlying probability istributions corresponing to f (x j), 1 j M, are unknown. Theoretically, the quantities µ j, C j, may be viewe as partial inirect information about the target Bayes classifier g. Such information can be represente by an n-imensional vector of linear functionals acting on f (x j), 1 j M, i.e., N([ f (x 1),..., f (x M)]) =[{µ j, s } M j=1,l s=1, {σ s, j r } M j=1,l s r=1 ], where µ j, s = X x s f (x j) x, an σs, j r = X x sx r f (x j) x, where x r, x s,1 r, s l, are elements of x. The imensionality of the information vector is n =(Ml/2)(l + 3). We have so far presente the theory for learning from examples an introuce the importance of partial information from a practical perspective. Before we procee with a theoretical treatment of learning with partial information we igress momentarily to introuce a new quant ity which is efine in the context of the mathematical fiel of approximation theory which plays an important part in our learning framework. 4. A NEW NONLINEAR APPROXIMATION WIDTH The large mathematical fiel of approximation theory is primarily involve in problems of existence, uniqueness, an characterization of the best approximation to elements of a norme linear space by various types of finiteimensional subspaces n of (cf. Pinkus [25]). Approximation of an element f is measure by the istance of the finite-imensional subspace n to f where istance is usually efine as inf g n f g, where throughout this iscussion is any well-efine norm over. The egree of approximation of a subset (possibly a nonlinear manifol) F by n is efine by the istance between F an n which is usually taken as sup f F inf g n f g. The Kolmogorov n-with is the classical istance efinition when one allows the approximating set n to vary over all possible linear subspaces of.itis efine as K n (F; ) = inf n sup f F inf g n f g. This efinition leas to the notion of the best approximating subspace n, i.e., the one whose istance from F equals K n (F; ). While linear approximation, e.g., using finite imensional subspaces of polynomials, is important an useful, there are many known spaces which can be approximate better by nonlinear subspaces, for instance, by the span of

13 LEARNING FROM EXAMPLES 521 a neural-network basis ={h(x) = n i=1 c i σ(wi T x b i ): w i l, c i, b i, 1 i n}, where σ(y) = 1/(1+e y ). In this brief overview we will follow the notation an efinitions of Devore [10]. Let M n be a mapping from n into the Banach space which associates each a n the element M n (a). Functions f are approximate by functions in the manifol n = {M n (a): a n }. The measure of approximation of f by n is naturally efine as the istance inf a n f M n (a). As above, the egree of approximation of a subset F of by n is efine as sup f F inf a n f M n (a). In analogy to the Kolmogorov n-with, it woul be tempting to efine the optimal approximation error of F by manifols of finite imension n as inf sup n f F inf a n f M n(a). However, as pointe out in [10], this with is zero for all subsets F in every separable class. To see this, consier the following example which escribes a space filling manifol: let { f k } k= be ense in an efine M 1 (a) = (a k) f k+1 + (k + 1 a) f k for k a k +1. The mapping M 1 : 1, is continuous with a corresponing one-imensional manifol 1 satisfying sup f F inf a 1 f M 1 (a) =0. Thus this measure of with of F is not natural. One possible alternative use in approximation theory is to impose a smoothness constraint on the nonlinear manifols n that are allowe in the outermost infimum. However, this exclues some interesting manifols such as splines with free knots. A more useful constraint is to limit the selection operator r, which takes an element f F to n, to be continuous. Given such operator r then the approximation of f by a manifol n is M n (r( f )). The istance between the set F an the manifol n is then efine as sup f F f M n (r( f )). The continuous nonlinear n-with of F is then efine as D n (F; ) = inf r: cont., sup n f F f M n(r( f )), where the infimum is taken over all continuous selection operators r an all manifols n. This with is consiere by Alexanrov [33] an Devore [10] an is etermine for various F an in [10]. The Alexanrov nonlinear with oes not in general reflect the egree of approximation of the more natural selection operator r which chooses the best approximation for an f F as its closest element in n, i.e., that whose istance from f equals inf g n f g, the reason being that such r is not necessarily continuous. In this paper we consier an interesting alternate efinition for a nonlinear with of a function class which oes not have this eficiency. Base on the pseuo-imension (Definition 3 in Section 2) we efine the nonlinear with ρ (F) inf inf f h, (7) h sup f F where runs over all classes (not necessarily in ) having pseuoimension. Now the natural selection operator is use, namely, the one which approximates f by an element h( f ), where f h( f ) =inf h f h. The

14 522 RATSABY AND MAIOROV constraint of using finite pseuo-imensional approximation manifols allows ropping the smoothness constraint on the manifol an the continuity constraint on the selection operator. The with ρ expresses the ability of manifols to approximate accoring to their pseuo-imension as oppose to their imensionality as in some of the classical withs. The reason that ρ is interesting from a learning theoretic aspect is that the constraint on the approximation manifol involves the pseuoimension im p ( ) which was shown in Section 2 to have a irect effect on uniform learnability, namely, a finite pseuo-imension guarantees consistent estimation. Thus ρ involves two inepenent mathematical notions, namely, the approximation ability an the statistical estimation ability of. As will be shown in the next sections, joining both notions in one quantity enables us to quantify the trae-off between information an sample complexity as applie to the learning paraigm. We halt the iscussion about ρ an refer the intereste reaer to [23] where we estimate it for a stanar Sobolev class Wp r, l,1 p, q. 5. THE MINIMAL PARTIAL INFORMATION ERROR In this section we review some basic concepts in the fiel of informationbase complexity an then exten these to efine a new quantity calle the minimal partial information error which is later use in the learning framework. Throughout this section, enotes any function norm an the istance between two function classes an is enote as ist(,, L q ) = sup a inf b a b Lq, q 1. The following formulation of partial information is taken from Traub et al. [34]. While we limit here to the case of approximating functions f we note that the theory is suitable for problems of approximating general functionals S( f ). Let N n : N n ( ) n enote a general information operator. The information N n (g) consists of n measurements taken on the target function g, or in general, any function f ; i.e., N n ( f ) = [L 1 ( f ),..., L n ( f )] where L i,1 i n, enote any functionals. We call n the carinality of information an we sometimes omit n an write N( f ). The variable y enotes an element in N n ( ). The subset Nn 1 (y) enotes all functions f which share the same information vector y, i.e., Nn 1 (y) ={f : N n( f ) = y}. We enote by Nn 1(N n(g)) the solution set which may also be written as { f : N n ( f ) = N n (g)}, which consists of all inistinguishable functions

15 LEARNING FROM EXAMPLES 523 f having the same information vector as the target g. Giveny n,itis assume that a single element enote as g y Nn 1 (y) can be constructe. In this moel information effectively partitions the target class into infinitely many subsets Nn 1(y), y n, each having a single representative g y which forms the approximation for any f N 1 (y). Denote the raius of N 1 (y) by r(n, y) = inf f sup f f (8) f N 1 (y) an call it the local raius of information N at y. The global raius of information N at y is efine as the local raius for a worst y, i.e., r(n) = sup r(n, y). y N( ) This quantity measures the intrinsic uncertainty or error which is associate with a fixe information operator N. Note that in both of these efinitions the epenence on is implicit. Let 3 be a family of functionals an consier the family 3 n which consists of all information N = [L 1,...,L k ] of carinality k n with L i 3, 1 i n. Then r(n, )= inf r(n) N n is calle the nth minimal raius of information in the family 3 an Nn = [L 1,...,L n ] is calle the nth optimal information in the class 3 iff L i an r(nn ) = r(n, ). When 3 is the family of all linear functionals then r(n, 3) becomes a slight generality of the well-known Gelfan-with of the class whose classical efinition is n ( ) = inf A n sup f A n f, where A n is any linear subspace of coimension n. In this paper we restrict to the family 3 of linear functionals an for notational simplicity we will henceforth take the information space N n ( ) = n. As alreay mentione in the efinition of r(n, y) there is a single element g y not necessarily in N 1 (y) which is selecte as an approximator for all functions f N 1 (y). Such a efinition is useful for the problem of information-base complexity since all that one is concerne with is to prouce an ɛ-approximation base on information alone. In the PAC framework, however, a major significance is place on proviing an approximator to a target g which is an element not necessarily of the target class but of some hypothesis class of finite pseuo imension by which is uniformly learnable. We therefore replace the single-representative of the subset N 1 (y) by a whole approximation class of functions y of pseuo-imension. Note that now information alone oes not point to a single ɛ-approximation element, but rather to a manifol y, possibly nonlinear, which for any f N 1 (y), in particular the target g, contains an element h *, epenent on g, such that the

16 524 RATSABY AND MAIOROV istance g h * ɛ. Having a pseuo-imension implies that with a finite ranom sample {(x i, g(x i ))} i=1 m, an ERM learning algorithm (after being shown partial information an hence pointe to the class y ) can etermine a function ĥ y whose istance from g is no farther than ɛ from the istance between h * an g with confience 1 δ. Thus base on n units of information about g an m labele examples {(x i, g(x i ))} i=1 m, an element ĥ can be foun such that g ĥ 2ɛ with probability 1 δ. The sample complexity m oes not epen on the type of hypothesis class but only on its pseuo-imension. Thus the above construction is true for any hypothesis class (or manifol) of pseuo-imension. Hence we may permit any hypothesis class of pseuo-imension to play the role of the approximation manifol y of the subset N 1 (y). This amounts to replacing the infimum in the efinition (8) of r(n, y) byinf an replacing f f by ist( f, ) = inf h f h, yieling the quantity ρ (N 1 (y)) as a new efinition for a local raius an a new quantity I n, ( ) (to be efine later) which replaces r(n, 3). We next formalize these ieas through a sequence of efinitions. We use ρ (K, L q ) to explicitly enote the norm L q use in the efinition of (7). We now efine three optimal quantities, Nn, N, an h *, all of which implicitly epen on n the unknown istribution P while h * epens also on the unknown target g. DEFINITION 4. Let the optimal linear information operator Nn of carinality n be one which minimizes the approximation error of the solution set Nn 1(y) (in the worst case over y n ) over all linear operators N n of carinality n an manifols of pseuo-imension. Formally, it is efine as one which satisfies sup y ρ (N 1 n n (y), L 1 (P)) = inf N n sup y ρ (N 1 n n (y), L 1(P)). DEFINITION 5. For a fixe optimal linear information operator Nn of carinality n efine the optimal hypothesis class y of pseuo-imension (which epens implicitly on Nn through y) as one which minimizes the approximation error of the solution set Nn 1 (y) over all manifols of pseuoimension. Formally, it is efine as one which satisfies ist(nn 1 (y), y, L 1 (P)) = ρ (Nn 1 (y), L 1 (P)). DEFINITION 6. For a fixe target g, optimal linear information operator Nn an optimal hypothesis class N n (g) efine the optimal hypothesis h to be any function which minimizes the error over N n (g), namely, N n (g) L(h ) = inf L(h). (9) h N n (g)

17 LEARNING FROM EXAMPLES 525 As mentione earlier, the main motive of the paper is to compute the value of partial information for learning in the PAC sense. We will assume that the teacher has access to unlimite (linear) information which is represente by him knowing the optimal linear information operator Nn an optimal hypothesis class y for every y n. Thus in this ieal setting proviing partial information amounts to pointing to the optimal hypothesis class N n (g) which contains an optimal hypothesis h *. We again note that information alone oes not point to h * but it is the role of learning from examples to complete the process through estimating h * using a hypothesis ĥ. The error of h * is important in its own right. It represents the minimal error for learning a particular target g given optimal information of carinality n. In line with the notion of uniform learnability (see Section 2) we efine a variant of this optimal quantity which is inepenent of the target g an probability istribution P; i.e., instea of a specific target g, we consier the worst target in an we use the L norm for approximation. This yiels the following efinition. DEFINITION 7 (Minimal partial information error). an any integers n, 1, let For any target class I n, ( ) inf N n sup y ρ (N 1 n n (y), L ), where N n runs over all linear information operators. I n, ( ) represents the minimal error for learning the worst-case target in the PAC sense (i.e., assuming an unknown unerlying probability istribution) while given optimal information of carinality n an using an optimal hypothesis class of pseuo-imension. We procee next to unify the theory of Section 2 with the concepts introuce in the current section. 6. LEARNING FROM EXAMPLES WITH OPTIMAL PARTIAL INFORMATION In Section 2 we reviewe the notion of uniform learnability of a target class by a hypothesis class of pseuo-imension <. By minimizing an empirical error base on the ranom sample, a learner obtains a hypothesis ĥ which provies a close approximation of the optimal hypothesis h * to within ɛ accuracy with confience 1 δ. Suppose that prior to learning the learner obtains optimal information Nn (g) about g. This effectively points the learner to a class N n (g) which contains a hypothesis h * as efine in (9). The error of h * is boune from above as

18 526 RATSABY AND MAIOROV L(h ) = inf L(h) (10) h Nn (g) = inf g h L1 h (P) (11) Nn (g) sup { f : N n ( f )=N n (g)} inf f h L1 h (P) (12) N n (g) = ist ( N 1 n (N n (g)), N n (g), L 1(P) ). (13) By Definition 5 this equals ρ (Nn 1 (Nn (g)), L 1(P)) an is boune from above by The latter equals ( sup ρ N 1 y n n (y), L 1 (P) ). inf N n sup y ρ (N 1 n n (y), L 1(P)) by Definition 4. This is boune from above by inf Nn sup y n ρ (Nn 1(y), L ) which from Definition 7 is I n, ( ). Subsequently, the teacher provies m i.i.. examples {(x i, g(x i ))} i=1 m ranomly rawn accoring to any probability istribution P on X. Arme with prior knowlege an a ranom sample the learner then minimizes the empirical error L m (h) over all h Nn (g), yieling an estimate ĥ of h *. We may break up the error L(ĥ) into a learning error an a minimal partial information error components ( ) L(ĥ) = L(ĥ) L(h ) + L(h ) learning error minimal partial information error {}}{{}}{ ɛ(m,, δ) + I n, ( ), (14) where the learning error, efine in (4), measures the extra error incurre by using ĥ as oppose to the optimal hypothesis h *. The important ifference from the PAC moel can be seen in comparing the upper boun of (14) with that of (5). The former epens not only on the sample size m an pseuo-imension but also on the amount n of partial information. To see how m, n, an influence the performance, i.e., the error of ĥ, we will next particularize to a specific target class. 7. SOBOLEV TARGET CLASS The preceing theory is now applie to the problem of learning a target in a Sobolev class = W r, l (M), for r, l +, M > 0, which is efine as all

19 LEARNING FROM EXAMPLES 527 functions over X = [0, 1] l having all partial erivatives up to orer r boune in the L norm by M. Formally, let k = [k 1,...,k l ] l +, k = l i=1 k i, an enote by D k f = ( k 1 + +k l )/( x k x k l l ) f, then W r, l (M) ={f : sup D k f (x) M, k r} x [0, 1] t which henceforth is referre to as W r, l or. We now state the main results an their implications. THEOREM 2. Let = W r, l, n 1, 1, be given integers an c 2 > 0 a constant inepenent of n an. Then I n, ( ) c 2 (n + ) r/l. The proof of the theorem is in Section A.2. THEOREM 3. Let the target class = W r, l an g be the unknown target function. Given an i.i.. ranom sample {(x i, g(x i ))} i=1 m of size m rawn accoring to any unknown istribution P on X. Given an optimal partial information vector Nn (g) consisting of n linear operations on g. For any 1, let N n (g) be the optimal hypothesis class of pseuo-imension. Let ĥ be the output hypothesis obtaine from running empirical error minimization over N n (g). Then for an arbitrary 0 <δ<1, the error of ĥ is boune as log 2 ln m + ln(1/δ) c 2 L(ĥ) c 1 +, (15) m (n + ) r/l where c 1, c 2 > 0 are constants inepenent of m, n, an. The proof of Theorem 3 is base on Theorem 1 an Theorem 2, both of which are prove in the Appenix. We now iscuss several epenences an trae-offs between the three complexity variables m, n, an. First, for a fixe sample size m an fixe information carinality n there is an optimal class complexity ( { } rm 2l/(l+2r) c 3 l n), (16) ln m which minimizes the upper boun on the error where c 3 > 0 is an absolute constant. The complexity is a free parameter in our learning setting an is proportional to the egree in which the estimator ĥ fits the ata while estimating the optimal hypothesis h *. The result suggests that for a given sample size

20 528 RATSABY AND MAIOROV m an partial information carinality n, there is an optimal estimator (or moel) complexity * which minimizes the error rate. Thus if a structure of hypothesis classes { } =1 is available in the learning problem, then base on fixe m an n the best choice of a hypothesis class over which the learner shoul run empirical error minimization is with * as in (16). The notion of having an optimal complexity * is closely relate to statistical moel selection (cf. Linhart an Zucchini [19], Devroye et al. [12], Ratsaby et al. [29]). For instance, in Vapnik s structural risk minimization criterion (SRM) [38] the trae-off is between m an. For a fixe m, it is possible to calculate the optimal complexity * of a hypothesis class in a neste class structure, , by minimizing an upper boun on the error L(ĥ) L m (ĥ) + ɛ(m,, δ), over all 1. The secon term ɛ(m,, δ) is commonly referre to as the penalty for ata-overfitting which one wants to balance against the empirical error. Similarly, in our result, the upper boun on the learning error reflects the cost or penalty of overfitting the ata the larger, the higher the egree of ata fit an the larger the penalty. However, here, as oppose to SRM, the boun is inepenent of the ranom sample an there is an extra parameter n that affects how m an trae off. As seen from (16), for a fixe sample size m it follows that the larger n the smaller *. This is intuitive since the more partial information, the smaller the solution set Nn 1(N n(g)) an the lower the complexity of a hypothesis class neee to approximate it. Consequently, the optimal estimator ĥ belongs to a simpler hypothesis class an oes not overfit the ata as much. We next compute the trae-off between n an m. Assuming is fixe (not necessarily at the optimal value * ) an fixing the total available information an sample size, m + n, at some constant value while minimizing the upper boun on L(ĥ) over m an n, we obtain m c 5 n (l+2r)/2l ln n for a constant c 5 > 0 which epens polynomially only on l an r. We conclue that when the imensionality l of X is smaller than twice the smoothness parameter r, the sample size m grows polynomially in n at a rate no larger than n (1+r)/l ; i.e., partial information about the target g is worth approximately a polynomial number of examples. For l >2r, n grows polynomially in m at a rate no larger than m 2 /ln m; i.e., information obtaine from examples is worth a polynomial amount of partial information. We have focuse so far on ealing with the ieal learning scenario in which the teacher has access to the optimal information operator Nn an optimal hypothesis class N. The use of such optimally efficient information was require from an n information theoretic point of view in orer to calculate the trae-off between the sample complexity m an information carinality n. But we have not specifie the form of such optimal information an hypothesis class. In the next result we state a lower boun on the minimal partial information error I n, ( ) an subsequently show that there exists an operator an a hypothesis class which almost achieve this lower boun.

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Structural Risk Minimization over Data-Dependent Hierarchies

Structural Risk Minimization over Data-Dependent Hierarchies Structural Risk Minimization over Data-Depenent Hierarchies John Shawe-Taylor Department of Computer Science Royal Holloway an Befor New College University of Lonon Egham, TW20 0EX, UK jst@cs.rhbnc.ac.uk

More information

Tractability results for weighted Banach spaces of smooth functions

Tractability results for weighted Banach spaces of smooth functions Tractability results for weighte Banach spaces of smooth functions Markus Weimar Mathematisches Institut, Universität Jena Ernst-Abbe-Platz 2, 07740 Jena, Germany email: markus.weimar@uni-jena.e March

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

Qubit channels that achieve capacity with two states

Qubit channels that achieve capacity with two states Qubit channels that achieve capacity with two states Dominic W. Berry Department of Physics, The University of Queenslan, Brisbane, Queenslan 4072, Australia Receive 22 December 2004; publishe 22 March

More information

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING Mark A. Kon Department of Mathematics an Statistics Boston University Boston, MA 02215 email: mkon@bu.eu Anrzej Przybyszewski

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

Acute sets in Euclidean spaces

Acute sets in Euclidean spaces Acute sets in Eucliean spaces Viktor Harangi April, 011 Abstract A finite set H in R is calle an acute set if any angle etermine by three points of H is acute. We examine the maximal carinality α() of

More information

arxiv: v4 [cs.ds] 7 Mar 2014

arxiv: v4 [cs.ds] 7 Mar 2014 Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning

More information

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

inflow outflow Part I. Regular tasks for MAE598/494 Task 1 MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control 19 Eigenvalues, Eigenvectors, Orinary Differential Equations, an Control This section introuces eigenvalues an eigenvectors of a matrix, an iscusses the role of the eigenvalues in etermining the behavior

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

Function Spaces. 1 Hilbert Spaces

Function Spaces. 1 Hilbert Spaces Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure

More information

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling Balancing Expecte an Worst-Case Utility in Contracting Moels with Asymmetric Information an Pooling R.B.O. erkkamp & W. van en Heuvel & A.P.M. Wagelmans Econometric Institute Report EI2018-01 9th January

More information

Some Examples. Uniform motion. Poisson processes on the real line

Some Examples. Uniform motion. Poisson processes on the real line Some Examples Our immeiate goal is to see some examples of Lévy processes, an/or infinitely-ivisible laws on. Uniform motion Choose an fix a nonranom an efine X := for all (1) Then, {X } is a [nonranom]

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Calculus and optimization

Calculus and optimization Calculus an optimization These notes essentially correspon to mathematical appenix 2 in the text. 1 Functions of a single variable Now that we have e ne functions we turn our attention to calculus. A function

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

Implicit Differentiation

Implicit Differentiation Implicit Differentiation Thus far, the functions we have been concerne with have been efine explicitly. A function is efine explicitly if the output is given irectly in terms of the input. For instance,

More information

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University

More information

Logarithmic spurious regressions

Logarithmic spurious regressions Logarithmic spurious regressions Robert M. e Jong Michigan State University February 5, 22 Abstract Spurious regressions, i.e. regressions in which an integrate process is regresse on another integrate

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

Ramsey numbers of some bipartite graphs versus complete graphs

Ramsey numbers of some bipartite graphs versus complete graphs Ramsey numbers of some bipartite graphs versus complete graphs Tao Jiang, Michael Salerno Miami University, Oxfor, OH 45056, USA Abstract. The Ramsey number r(h, K n ) is the smallest positive integer

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

PDE Notes, Lecture #11

PDE Notes, Lecture #11 PDE Notes, Lecture # from Professor Jalal Shatah s Lectures Febuary 9th, 2009 Sobolev Spaces Recall that for u L loc we can efine the weak erivative Du by Du, φ := udφ φ C0 If v L loc such that Du, φ =

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Polynomial Inclusion Functions

Polynomial Inclusion Functions Polynomial Inclusion Functions E. e Weert, E. van Kampen, Q. P. Chu, an J. A. Muler Delft University of Technology, Faculty of Aerospace Engineering, Control an Simulation Division E.eWeert@TUDelft.nl

More information

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing

More information

A Sketch of Menshikov s Theorem

A Sketch of Menshikov s Theorem A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

Discrete Mathematics

Discrete Mathematics Discrete Mathematics 309 (009) 86 869 Contents lists available at ScienceDirect Discrete Mathematics journal homepage: wwwelseviercom/locate/isc Profile vectors in the lattice of subspaces Dániel Gerbner

More information

1. Aufgabenblatt zur Vorlesung Probability Theory

1. Aufgabenblatt zur Vorlesung Probability Theory 24.10.17 1. Aufgabenblatt zur Vorlesung By (Ω, A, P ) we always enote the unerlying probability space, unless state otherwise. 1. Let r > 0, an efine f(x) = 1 [0, [ (x) exp( r x), x R. a) Show that p f

More information

arxiv: v2 [cs.ds] 11 May 2016

arxiv: v2 [cs.ds] 11 May 2016 Optimizing Star-Convex Functions Jasper C.H. Lee Paul Valiant arxiv:5.04466v2 [cs.ds] May 206 Department of Computer Science Brown University {jasperchlee,paul_valiant}@brown.eu May 3, 206 Abstract We

More information

On conditional moments of high-dimensional random vectors given lower-dimensional projections

On conditional moments of high-dimensional random vectors given lower-dimensional projections Submitte to the Bernoulli arxiv:1405.2183v2 [math.st] 6 Sep 2016 On conitional moments of high-imensional ranom vectors given lower-imensional projections LUKAS STEINBERGER an HANNES LEEB Department of

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

05 The Continuum Limit and the Wave Equation

05 The Continuum Limit and the Wave Equation Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,

More information

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Ashish Goel Michael Kapralov Sanjeev Khanna Abstract We consier the well-stuie problem of fining a perfect matching in -regular bipartite

More information

Introduction to the Vlasov-Poisson system

Introduction to the Vlasov-Poisson system Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

arxiv: v2 [physics.data-an] 5 Jul 2012

arxiv: v2 [physics.data-an] 5 Jul 2012 Submitte to the Annals of Statistics OPTIMAL TAGET PLAE RECOVERY ROM OISY MAIOLD SAMPLES arxiv:.460v physics.ata-an] 5 Jul 0 By Daniel. Kaslovsky an rançois G. Meyer University of Colorao, Bouler Constructing

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

Bayesian Estimation of the Entropy of the Multivariate Gaussian

Bayesian Estimation of the Entropy of the Multivariate Gaussian Bayesian Estimation of the Entropy of the Multivariate Gaussian Santosh Srivastava Fre Hutchinson Cancer Research Center Seattle, WA 989, USA Email: ssrivast@fhcrc.org Maya R. Gupta Department of Electrical

More information

Iterated Point-Line Configurations Grow Doubly-Exponentially

Iterated Point-Line Configurations Grow Doubly-Exponentially Iterate Point-Line Configurations Grow Doubly-Exponentially Joshua Cooper an Mark Walters July 9, 008 Abstract Begin with a set of four points in the real plane in general position. A to this collection

More information

Slide10 Haykin Chapter 14: Neurodynamics (3rd Ed. Chapter 13)

Slide10 Haykin Chapter 14: Neurodynamics (3rd Ed. Chapter 13) Slie10 Haykin Chapter 14: Neuroynamics (3r E. Chapter 13) CPSC 636-600 Instructor: Yoonsuck Choe Spring 2012 Neural Networks with Temporal Behavior Inclusion of feeback gives temporal characteristics to

More information

IPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy

IPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy IPA Derivatives for Make-to-Stock Prouction-Inventory Systems With Backorers Uner the (Rr) Policy Yihong Fan a Benamin Melame b Yao Zhao c Yorai Wari Abstract This paper aresses Infinitesimal Perturbation

More information

Expected Value of Partial Perfect Information

Expected Value of Partial Perfect Information Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo

More information

Linear Regression with Limited Observation

Linear Regression with Limited Observation Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel ehazan@ie.technion.ac.il tomerk@cs.technion.ac.il Abstract We consier the most common variants of linear

More information

Closed and Open Loop Optimal Control of Buffer and Energy of a Wireless Device

Closed and Open Loop Optimal Control of Buffer and Energy of a Wireless Device Close an Open Loop Optimal Control of Buffer an Energy of a Wireless Device V. S. Borkar School of Technology an Computer Science TIFR, umbai, Inia. borkar@tifr.res.in A. A. Kherani B. J. Prabhu INRIA

More information

Final Exam Study Guide and Practice Problems Solutions

Final Exam Study Guide and Practice Problems Solutions Final Exam Stuy Guie an Practice Problems Solutions Note: These problems are just some of the types of problems that might appear on the exam. However, to fully prepare for the exam, in aition to making

More information

Resistant Polynomials and Stronger Lower Bounds for Depth-Three Arithmetical Formulas

Resistant Polynomials and Stronger Lower Bounds for Depth-Three Arithmetical Formulas Resistant Polynomials an Stronger Lower Bouns for Depth-Three Arithmetical Formulas Maurice J. Jansen University at Buffalo Kenneth W.Regan University at Buffalo Abstract We erive quaratic lower bouns

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

An extension of Alexandrov s theorem on second derivatives of convex functions

An extension of Alexandrov s theorem on second derivatives of convex functions Avances in Mathematics 228 (211 2258 2267 www.elsevier.com/locate/aim An extension of Alexanrov s theorem on secon erivatives of convex functions Joseph H.G. Fu 1 Department of Mathematics, University

More information

Robustness and Perturbations of Minimal Bases

Robustness and Perturbations of Minimal Bases Robustness an Perturbations of Minimal Bases Paul Van Dooren an Froilán M Dopico December 9, 2016 Abstract Polynomial minimal bases of rational vector subspaces are a classical concept that plays an important

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

Agmon Kolmogorov Inequalities on l 2 (Z d )

Agmon Kolmogorov Inequalities on l 2 (Z d ) Journal of Mathematics Research; Vol. 6, No. ; 04 ISSN 96-9795 E-ISSN 96-9809 Publishe by Canaian Center of Science an Eucation Agmon Kolmogorov Inequalities on l (Z ) Arman Sahovic Mathematics Department,

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

Characterizing Real-Valued Multivariate Complex Polynomials and Their Symmetric Tensor Representations

Characterizing Real-Valued Multivariate Complex Polynomials and Their Symmetric Tensor Representations Characterizing Real-Value Multivariate Complex Polynomials an Their Symmetric Tensor Representations Bo JIANG Zhening LI Shuzhong ZHANG December 31, 2014 Abstract In this paper we stuy multivariate polynomial

More information

PLAL: Cluster-based Active Learning

PLAL: Cluster-based Active Learning JMLR: Workshop an Conference Proceeings vol 3 (13) 1 22 PLAL: Cluster-base Active Learning Ruth Urner rurner@cs.uwaterloo.ca School of Computer Science, University of Waterloo, Canaa, ON, N2L 3G1 Sharon

More information

Jointly continuous distributions and the multivariate Normal

Jointly continuous distributions and the multivariate Normal Jointly continuous istributions an the multivariate Normal Márton alázs an álint Tóth October 3, 04 This little write-up is part of important founations of probability that were left out of the unit Probability

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

II. First variation of functionals

II. First variation of functionals II. First variation of functionals The erivative of a function being zero is a necessary conition for the etremum of that function in orinary calculus. Let us now tackle the question of the equivalent

More information

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems Construction of the Electronic Raial Wave Functions an Probability Distributions of Hyrogen-like Systems Thomas S. Kuntzleman, Department of Chemistry Spring Arbor University, Spring Arbor MI 498 tkuntzle@arbor.eu

More information

Computing Derivatives

Computing Derivatives Chapter 2 Computing Derivatives 2.1 Elementary erivative rules Motivating Questions In this section, we strive to unerstan the ieas generate by the following important questions: What are alternate notations

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS Yannick DEVILLE Université Paul Sabatier Laboratoire Acoustique, Métrologie, Instrumentation Bât. 3RB2, 8 Route e Narbonne,

More information

WUCHEN LI AND STANLEY OSHER

WUCHEN LI AND STANLEY OSHER CONSTRAINED DYNAMICAL OPTIMAL TRANSPORT AND ITS LAGRANGIAN FORMULATION WUCHEN LI AND STANLEY OSHER Abstract. We propose ynamical optimal transport (OT) problems constraine in a parameterize probability

More information

Node Density and Delay in Large-Scale Wireless Networks with Unreliable Links

Node Density and Delay in Large-Scale Wireless Networks with Unreliable Links Noe Density an Delay in Large-Scale Wireless Networks with Unreliable Links Shizhen Zhao, Xinbing Wang Department of Electronic Engineering Shanghai Jiao Tong University, China Email: {shizhenzhao,xwang}@sjtu.eu.cn

More information

arxiv: v4 [math.pr] 27 Jul 2016

arxiv: v4 [math.pr] 27 Jul 2016 The Asymptotic Distribution of the Determinant of a Ranom Correlation Matrix arxiv:309768v4 mathpr] 7 Jul 06 AM Hanea a, & GF Nane b a Centre of xcellence for Biosecurity Risk Analysis, University of Melbourne,

More information

On combinatorial approaches to compressed sensing

On combinatorial approaches to compressed sensing On combinatorial approaches to compresse sensing Abolreza Abolhosseini Moghaam an Hayer Raha Department of Electrical an Computer Engineering, Michigan State University, East Lansing, MI, U.S. Emails:{abolhos,raha}@msu.eu

More information

Generalization of the persistent random walk to dimensions greater than 1

Generalization of the persistent random walk to dimensions greater than 1 PHYSICAL REVIEW E VOLUME 58, NUMBER 6 DECEMBER 1998 Generalization of the persistent ranom walk to imensions greater than 1 Marián Boguñá, Josep M. Porrà, an Jaume Masoliver Departament e Física Fonamental,

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

On colour-blind distinguishing colour pallets in regular graphs

On colour-blind distinguishing colour pallets in regular graphs J Comb Optim (2014 28:348 357 DOI 10.1007/s10878-012-9556-x On colour-blin istinguishing colour pallets in regular graphs Jakub Przybyło Publishe online: 25 October 2012 The Author(s 2012. This article

More information

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number

More information

A New Minimum Description Length

A New Minimum Description Length A New Minimum Description Length Soosan Beheshti, Munther A. Dahleh Laboratory for Information an Decision Systems Massachusetts Institute of Technology soosan@mit.eu,ahleh@lis.mit.eu Abstract The minimum

More information

Algorithms and matching lower bounds for approximately-convex optimization

Algorithms and matching lower bounds for approximately-convex optimization Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas Moelling an simulation of epenence structures in nonlife insurance with Bernstein copulas Prof. Dr. Dietmar Pfeifer Dept. of Mathematics, University of Olenburg an AON Benfiel, Hamburg Dr. Doreen Straßburger

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

arxiv: v1 [hep-lat] 19 Nov 2013

arxiv: v1 [hep-lat] 19 Nov 2013 HU-EP-13/69 SFB/CPP-13-98 DESY 13-225 Applicability of Quasi-Monte Carlo for lattice systems arxiv:1311.4726v1 [hep-lat] 19 ov 2013, a,b Tobias Hartung, c Karl Jansen, b Hernan Leovey, Anreas Griewank

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information