Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Size: px
Start display at page:

Download "Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space"

Transcription

1 Journal of Machine Learning Research 3 (2003) Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing Sciences Los Alaos National Laboratory Los Alaos, NM 87545, USA Kevin Lacker Departent of Coputer Science University of California, Berkeley CA 94720, USA Jaes Theiler Space and Reote Sensing Sciences Los Alaos National Laboratory Los Alaos, NM 87545, USA S.PERKINS@LANL.GOV LACKER@EECS.BERKELEY.EDU JT@LANL.GOV Editors: Isabelle Guyon and André Elisseeff Abstract We present a novel and flexible approach to the proble of feature selection, called grafting. Rather than considering feature selection as separate fro learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning fraework. To ake this regularized learning process sufficiently fast for large scale probles, grafting operates in an increental iterative fashion, gradually building up a feature set while training a predictor odel using gradient descent. At each iteration, a fast gradient-based heuristic is used to quickly assess which feature is ost likely to iprove the existing odel, that feature is then added to the odel, and the odel is increentally optiized using gradient descent. The algorith scales linearly with the nuber of data points and at ost quadratically with the nuber of features. Grafting can be used with a variety of predictor odel classes, both linear and non-linear, and can be used for both classification and regression. Experients are reported here on a variant of grafting for classification, using both linear and non-linear odels, and using a logistic regression-inspired loss function. Results on a variety of synthetic and real world data sets are presented. Finally the relationship between grafting, stagewise additive odelling, and boosting is explored. Keywords: Feature selection, functional gradient descent, loss functions, argin space, boosting. 1. Introduction Systes for perforing autoated feature selection have long occupied a strange position, acting as a bridge between the harsh reality of the real world, and the cozy idealistic environents inhabited by ost achine learning algoriths. No wonder then, that feature selection is often seen as soething rather separate fro learning, and altogether uch ore ad hoc and ysterious. As a result, ost feature selection ethods are rather independent of the learning systes they work with. Filter ethods, for exaple the RELIEF algorith (Kira and Rendell, 1992), use a quickly coputed heuristic to estiate the value of each feature, individually or in cobination, and use this c 2003 Sion Perkins, Kevin Lacker and Jaes Theiler.

2 PERKINS, LACKER AND THEILER to select a set of features before the underlying learning engine ever sees the data. Wrapper ethods (Kohavi and John, 1997), do at least interact with an underlying learning engine, but they typically only counicate with it through brief suaries of perforance, e.g. cross-validation scores. All the other inforation that the learning syste ight have gleaned fro the data is usually ignored when choosing features. More recently however, great efforts have been ade to expand the applicability and robustness of general learning engines, and as a result, the distinction between feature selection and learning is beginning to look a little artificial. After all, are they not just two sides of the sae overall task of learning a good odel, given a set of training data described using a large nuber of features, and without using any special doain knowledge? This observation otivates the approach presented in this paper, where we view feature selection, whether for perforance or pragatic purposes, as part of an integrated criterion optiization schee. Section 2 of this paper presents the regularized risk iniization fraework that fors the core of this schee, and Section 3 introduces a fast, increental ethod for finding optial (or soeties approxiately optial) solutions, which we call grafting. 1 Section 4 provides an epirical coparison of grafting with a nuber of other learning and feature selection techniques, and finally, Section 5 draws conclusions and contrasts grafting with related work in stagewise additive odelling and boosting. 2. Learning to Select Features In this paper, we view feature selection as just one aspect of the single proble of learning a apping based on training data described by a large nuber of features. At the core of this view is a coon odern approach to achine learning, which can be described as regularized risk iniization. In the rest of this section, we review this approach, consider how it can be adapted to include feature selection, and explain why this ight be a good idea. 2.1 Learning as Regularized Risk Miniization First, a few definitions. We assue that we are trying to find a predictor function f ( ) that aps feature 2 vectors x of fixed length n, onto a scalar output value. If we have a binary classification proble then we derive an output label y { 1, +1} with y = sign( f (x)). If we have a regression proble then we produce an output predicted value y = f (x). Since this is a achine learning paper, f (.) is derived, via a learning procedure, fro a training set consisting of randoly sapled (x, y) pairs drawn fro the distribution we re attepting to odel. We assue that f ( ) is a eber of a faily of predictor functions that are paraeterized by a set of paraeters θ. We can specify a particular eber of this faily explicitly as f θ ( ), butwe will often oit the subscript for brevity. We want to coe up with a θ that iniizes the expected risk: R(θ)= L( f θ (x),y) p(x,y)dxdy (1) 1. The nae grafting is derived fro gradient feature testing, for reasons that will becoe clear. 2. The ter feature rather than variable is used throughout this paper to cover the general case where the arguents of f ( ) are theselves functions of the raw variables describing the proble. 1334

3 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION where L( f (x),y) is a loss function that specifies how uch we are penalized for returning f (x) when the true target value is y;andp(x,y) is the joint probability density function for x and y. For classification probles the ost coon loss function is the isclassification rate: L 1 2 y sign( f (x)). In general we do not know p(x,y) in (1), so it is usually not possible to directly optiize that criterion with respect to θ. Instead, we usually work with the epirical risk R ep calculated fro the training data, with the integral in (1) replaced by a su over all data points. As is well known, directly optiizing the epirical risk can lead to overfitting, so it is coon in odern achine learning to attept to iniize a cobination of the epirical risk plus a regularization ter to penalize over-coplex solutions. That is, we attept to iniize a criterion of the for: C(θ)= 1 L( f θ (x i ),y i )+Ω(θ) (2) where Ω(θ) is a regularization function that has a high value for coplex predictors f LOSS FUNCTIONS FOR CLASSIFICATION In theory, we could siply use the error rate as the loss function in (2), but there are two ain probles with this. First, this loss function alost inevitably leads to an optiization proble that is hard to solve exactly. Second, experience and theory have shown that we obtain robust and better generalizing classifiers if we prefer classifiers that separate the data by as wide a argin as possible. Discussion of this phenoenon, and any exaples, can be found in Sola et al. (2000). We can usually iprove generalization perforance and ake the criterion easier to optiize by choosing a loss function that encourages such large-argin solutions. We define the argin for a classifier f on a single data point x with true label y { 1,+1} to be ρ = yf(x). The argin is positive if the point is correctly classified (the sign of f (x) agrees with the sign of y), and negative otherwise. Many coonly used loss functions can be conveniently defined in ters of this argin. Figure 1 illustrates a few of these. In our classification work, we use the Binoial Negative Log Likelihood loss function (Hastie et al., 2001, p. 308): L bnll = ln(1 + e ρ ) This loss function is derived fro a odel that treats f (x) as the log of the ratio of the probability that y =+1 to the probability that y = 1. The ain value of this assuption is that it allows us to calculate p(x) p(y =+1 x) fro f (x) using the following relation: p(x)= e f (x) 1 + e f (x) This loss function can also be readily generalized to a ulti-class classification proble using the ideas of ulti-class logistic regression (Hastie et al., 2001, pp ). That reference also contains a derivation of the loss function. L bnll defines the BNLL loss for a single data point. It also useful to define the total loss over the training set, also known as the epirical risk, which is the first ter in (2): L BNLL = 1 L bnll 1335

4 PERKINS, LACKER AND THEILER Exaple Loss Functions Error Rate SVM Perceptron Binoial Negative Log Likelihood 2.5 Loss ρ f Figure 1: Coonly used loss functions, plotted as a function of the argin ρ = yf(x). Shown are the SVM loss function: L sv = ax(0,1 ρ); the perceptron criterion: L per = ax(0,ρ); and the binoial negative log-likelihood: L bnll = ln(1 + e ρ ) REGULARIZATION Possibly the ost straightforward approach to achine learning involves defining a faily of odels and then selecting the odel that iniizes the epirical risk. While this is certainly an oftenused technique, it has two serious probles. The first proble is that for certain cobinations of odel faily, epirical risk function, and training data, the optiization proble can be unbounded with respect to the odel paraeters θ. For exaple, consider the linear odel defined by: f (x) n w i x i + b (3) where w =(w 1,...,w n ) T is a vector of weights, x i is the i th feature of the feature vector x,andb is a constant offset. If the training data is linearly separable, then we can increase the agnitude of w indefinitely to reduce L BNLL. The second proble is the well-known overfitting proble. Given a sufficiently flexible faily of classifiers, it is often possible to find one that has a very low epirical risk, but that generalizes very badly to previously unseen data. Both these probles can be tackled by adding a regularization ter to the epirical risk. The regularization ter (or regularizer) is a function of the odel paraeters that returns a high value for unlikely or coplex odels that are liable to generalize badly. By optiizing a su of the regularizer and the epirical risk, we achieve a trade-off between odel siplicity and epirical risk. If the balance is chosen appropriately then we can often iprove generalization perforance significantly copared with siple epirical risk iniization. The for of the regularizer depends to soe extent on the for of the odel, but here we restrict ourselves to a class of odels where the odel paraeters θ take the for of a vector of real-valued nubers of length p, which we will refer to as a weight vector w. This class of odels includes linear odels and any types of ulti-layer perceptron (MLP). 1336

5 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION Given this paraeter vector, we can define a coonly eployed faily of regularizers paraeterized by a non-negative integer q, and a vector of positive real nubers α: Ω q (w)=λ p α i w i q (4) Mebers of this regularizer faily correspond to different kinds of weighted Minkowski nor of the paraeter vector, and so Ω q is often referred to as an l q regularizer. Usually, we choose α i {0,1} so as to siply include or exclude certain eleents of w fro the regularization. 3 The essence of these regularizers is that they penalize large values of w i when α i > 0. It is easy to show that the solutions found by unconstrained iniization of (2) using an Ω q regularizer are equivalent to those found by the following constrained optiization proble: iniize s.t. 1 p L( f (x i ),y i ) (5) α i w i q γ There is a one-to-one (but not necessarily siple) correspondence between the paraeters λ and γ. This alternative forulation is useful when we consider the Ω 0 regularizer. The ost interesting ebers of the faily involve q {0,1,2}. The following is a suary of the properties and peculiarities if these three regularizers. Ω 2 This regularizer is seen in ridge regression (Hoerl and Kennard, 1970), the support vector achine (Boser et al., 1992, Schölkopf and Sola, 2002) and regularization networks (Girosi et al., 1995). Those references give various justifications for this for of regularization. One reason for preferring it over other Ω q regularizers is that it is the only one which produces the sae solution under an arbitrary rotation of the feature space axes. The l 2 nor also akes the solution to (2) bounded. As λ is increased, the agnitudes of the eleents of w will tend to decrease, but in general none will go to zero. The l 2 nor is a convex function of weights, and so if the loss function being optiized is also a convex function of the weights, then the regularized loss has a single local (and global) optiu. Ω 1 The l 1 -based regularizer is also known as the lasso. Tibshirani (1994) describes it in great detail and notes that one of its ain advantages is that it often leads to solutions where soe eleents of w are exactly zero. As λ is increased, the nuber of zero weights steadily increases. Unlike the Ω 2 regularizer, using the l 1 nor eans that an arbitrary rotation of the feature axes in general produces a different solution, so the feature axes have special status with this choice of regularizer. Like the Ω 2 regularizer, this regularizer leads to bounded solutions. Siilarly, the l 1 nor is a convex function of weights, and so if the loss function being optiized is also a convex function of the weights, then the regularized loss has a single local (and global) optiu. Ω 0 If we define 0 0 0, then this contributes a fixed penalty α i for each weight w i 0. If all α i are identical, then this is equivalent to setting a liit on the axiu nuber of non-zero weights. The l 0 nor is, however, a non-convex function of weights, and this tends to ake exact optiization of (2) coputationally expensive. 3. For instance in a linear odel we usually want to exclude the constant offset ter. 1337

6 PERKINS, LACKER AND THEILER 2.2 Feature Selection as Regularization If we have a atheatical expression for a odel in which feature vector eleents only ever appear with an associated ultiplicative weight, then the process of feature selection aounts to producing a odel in which only a subset of weights associated with features are non-zero. Of the regularizers described above, Ω 0 and Ω 1 lead to solutions with soe weights set to exactly zero. But can they be justified in ters of the standard reasons for feature selection? There are any otivations for feature selection, but we will consider two broad classes which generally encopass the reasons ost coonly given PRAGMATIC MOTIVATIONS FOR FEATURE SELECTION Often, the otivations for feature selection are pragatic. We wish to reduce training tie; reduce the tie it takes to apply a learned odel to new data; reduce the storage requireents for the odel; or iprove the intelligibility of the odel. We can interpret all of these as either constituting a fixed penalty for including a feature in the odel, or as a constraint on the axiu nuber of features in the odel. For the siplest linear odel case, where each feature appears in the odel associated with just a single weight, then it is easy to see that both of these interpretations correspond to a Ω 0 regularizer with α i > 0 only where w i is the ultiplicative weight on a feature. For ore coplex odels, where each feature ight be associated with several weights, then we can handle this with a slightly odified version of the Ω 0 regularizer: Ω 0 (w)= n α i δ i (6) where δ i = ax j si (w 0 j ),inwhichs i is the set of weight indices associated with the i th feature. If different features carry different costs, for instance if soe features are very expensive to copute, then we can adjust the α i associated with those features accordingly PERFORMANCE MOTIVATIONS FOR FEATURE SELECTION The other coon otivation for feature selection is to iprove the generalization perforance of our learned odels. In general, the ore feature diensions a odel includes, the greater its capacity, and hence the greater the tendency for it to overfit the training data the so-called curse of diensionality. But regularization techniques are intended to prevent overfitting, and the question arises: if we use regularization, do we need to do any additional feature selection? Or alternatively, can we achieve iproved generalization perforance by using a regularizer that encourages zero-weighted features in our odel, such as the Ω 1 and Ω 0 regularizers? In order to explore this issue, we perfored a siple experient to copare the generalization perforance of the sae siple linear classifier using Ω 0, Ω 1 and Ω 2 regularizers, in the presence of varying nubers of irrelevant features. We created a sequence of siple n-feature two-class proble as follows. For the first class, the n features for each training exaple are drawn independently fro a noral distribution with ean equal to -1 and standard deviation σ. The n features of the second class are generated in the sae way except a noral distribution with a ean of +1 is used. We then randoly perute the feature values between all the training exaples for all eleents of the feature vector except the first two features. This produces a training set where each feature has exactly the sae distribution, but only the first two are correlated with the class label and the other n

7 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION σ=0.5 σ=1.0 Mean isclassification rate Ω 0 Ω 1 Ω 2 Unregularized Bayes Error Mean isclassification rate Irrelevant features Irrelevant features Figure 2: Coparison of different regularization schees on a proble with varying nubers of irrelevant features and different optial Bayes error. features are irrelevant. Training sets with between 0 and 12 irrelevant features were generated, each containing 10 saples drawn fro each class. We copared probles with σ = 0.5 and σ = 1.0. The forer has a Bayes isclassification rate of , the latter has a Bayes isclassification rate of Test sets were generated in the sae way, but with 1000 saples in each class. We used a linear odel as in (3) which was trained by optiizing a regularized risk criterion (2), using the logistic regression loss function L bnll and one of the three Ω q regularizers, or using no regularization. For Ω 1 and Ω 2 we used a gradient descent algorith (Nelder and Mead, 1965), and for Ω 0 we used backward eliination (Kohavi and John, 1997), choosing to eliinate the feature that increased the loss function by the least at each step. This is a greedy procedure that ay not produce the optial answer, but exhaustive subset coparison was ipractically slow. The regularization paraeters (λ for the Ω 1 and Ω 2 regularizers, γ for the Ω 0 regularizer) were found by generating ultiple instances of each proble and searching for the values that iniized the average isclassification rate. Figure 2 copares the perforance of the various regularizers as the nuber of irrelevant features is altered. For each proble type, 200 training and test sets were randoly generated. The plots show the ean test score for each type of regularizer, and for the two different values of σ. For reasons of clarity, error bars indicating the standard error of the ean isclassification rate are not shown here, but they are sall copared to the separation between the curves. Both values of σ produce qualitatively siilar results. The unregularized and subset selection (Ω 0 ) experients perfor worst, although subset selection does relatively better when the inforative features are well-separated in the low σ case. Of the other two experients, when ost features are relevant, then Ω 2 regularization slightly outperfors Ω 1 regularization. But as the nuber of irrelevant features increases, Ω 1 regularization takes the lead. Interestingly, when ore than twothirds of the features are irrelevant in this case, the test perforance using Ω 1 regularization sees to level off, while the perforance using Ω 2 regularization continues to degrade. In conclusion, if perforance alone is the key concern, then either Ω 1 or Ω 2, rather than Ω 0, see to be the preferred regularizers. If we expect a large fraction of irrelevant features, then we ight prefer the Ω 1 regularizer. These conclusions are soewhat siilar to those reached by Tibshirani (1994). 1339

8 PERKINS, LACKER AND THEILER Criterion Ω 0 Ω 1 Ω 2 Models pragatic otivations for feature selection? Yes No No Models perforance otivations for feature selection? No Yes Yes Leads to sparse solutions? Yes Yes No Perforance when ost features are relevant? OK Good Excellent Perforance when ost features are irrelevant? Poor Good OK Convex regularizer? (doesn t add extra local optia) No Yes Yes Nuerical friendliness Poor Good Excellent Table 1: Coparison of three different Ω q regularizers. 2.3 A Unified Optiization Criterion We have argued that for a significant class of odels, described by real-valued paraeter vectors, any otivations for feature selection can be incorporated into a regularized risk iniization fraework, using a suitable cobination of Ω q regularizers. Table 1 suarizes the different qualities of the three Ω q regularizers we have considered. In general, we ight want to use all three, which leads to the following optiization criterion: C(w)= 1 L( f (x),y i )+λ 2 p α 2,i w i 2 + λ 1 p α 1,i w i + λ 0 n α 0,i δ i (7) 3. Grafting We wish to find a iniu of (7), with respect to our odel paraeters. We now consider how this ight be done, and introduce the grafting algorith, as a fast way of getting to an optial or approxiately optial solution, if λ 0 or λ 1 is non-zero. 3.1 Direct Gradient Descent Optiization Probably the ost direct ethod of solution of (7) is to perfor gradient descent with respect to the odel paraeters w, until a iniu is found. If we can copute the gradient of the loss function and the regularization ter(s) with respect to these paraeters, then we can use conjugate gradient or quasi-newton ethods. If not, then we can use a iniization ethod that doesn t require gradient inforation, such as Powell s direction set ethod. See Press et al. (1992, chap. 10) for overviews of these ethods. Unfortunately there are a nuber of probles with this approach. Firstly, the gradient descent can be quite slow. Algoriths such as conjugate gradient descent typically scale quadratically with the nuber of diensions, 4 and we are particularly interested in the doain where we have any features and so p is large. This quadratic dependence on nuber of odel weights sees particularly wasteful if we are using the Ω 0 and/or Ω 1 regularizer, and we know that only soe subset of those weights will be non-zero. The second proble is that the Ω 0 and Ω 1 regularizers do not have continuous first derivatives with respect to the odel weights. This can cause nuerical probles for general purpose gradient descent algoriths that expect these derivatives to be continuous. 4. Conjugate gradient descent requires only O(p) line iniizations, but the gradient calculation required to deterine the direction for each iniization is also typically O(p), giving total coplexity that is closer to O(p 2 ) for large n. 1340

9 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION Finally, we have the proble of local inia. The Ω 2 and Ω 1 regularizers are convex functions of the weights and so if the loss function being used is also a convex function of weights, then we have a single optiu. The Ω 0 regularizer on the other hand is not convex and introduces any local inia into the optiization proble. 3.2 Stagewise Gradient Descent Optiization If we are using the Ω 1 or Ω 0 regularizer, and we suspect that the nuber N of non-zero weights in the final odel is going to be uch less than the total nuber of weights n, then a ore efficient stagewise optiization procedure suggests itself. We call this algorith grafting. The basic plan is to begin with a odel in which alost all weights are at zero. At each iteration of the grafting procedure, we use a fast gradient-based heuristic to decide which zero weight should be adjusted away fro zero in order to decrease the optiization criterion by the axiu aount. We then perfor gradient descent using that weight and any other non-zero weights in the odel, and continue until no further progress can be ade THE BASIC GRAFTING ALGORITHM For ease of presentation, we first consider the case where λ 0 = 0in(7),butλ 1 and λ 2 are non-zero. We will return to the case where λ 0 0 later. At this stage, our discussion applies to a broad class of odels and loss functions. As described above, we assue that the odel we are using is paraeterized by a weight vector w. At any stage in the grafting process, the odel weights are divided into two disjoint sets. Those weights w i F are free to be altered as desired. The reaining weights w i Z( F ) are fixed at zero. We also assue that the output of the odel for a given training exaple is differentiable with respect to the odel weights, i.e. we can calculate f (x i )/ w j for an arbitrary feature vector x i and and an arbitrary weight w j. As explained below, after each grafting step (and before the first step) we iniize (7) with respect to the free weights, so before the k th grafting step, we have: i F C w i = 0 During the k th grafting step, we wish to ove one weight fro Z to F. It sees sensible to select the weight which is going to have the greatest effect on reducing the optiization criterion C. The gradient of the criterion with respect to an arbitrary odel weight w i is: C = 1 w i = 1 ( L f (x i ) ( L f (x i ) ) f (x i ) + 2λ 2 α 2,i w i + λ 1 α 1,i sign(w i ) w i f (x i ) w i ) ± λ 1 α 1,i (8) The contribution fro the Ω 2 ter disappears because w i = 0forallw i Z. Slightly ore subtle is the replaceent of sign(w i ) with ±1, which invites the question of what sign should be used, and whether in fact sign(0) has a well-defined value at all. Recall, however, that we are interested in 1341

10 PERKINS, LACKER AND THEILER deterining which weight, when adjusted in the appropriate direction, will decrease C at the fastest rate. Consider L TOT / w i, the derivative of the total loss with respect to w i (this is just the first ter in the above expression). Suppose that L TOT / w i > λ 1 α 1,i. Thiseansthat C/ w i > 0, regardless of the sign of w i. In this case, in order to decrease C, we will want to decrease w i.since w i starts at zero, the very first infinitesial adjustent to w i will take it negative. Therefore for our purposes we can let sign(w i )= 1. Siilarly, if L TOT / w i < λ 1 α 1,i, then we can effectively let sign(w i )=+1. Essentially, the effect of the Ω 1 derivative is siply to reduce the agnitude of C/ w i by an aount λ 1 α 1,i. The sae arguent shows that if L TOT / w i < λ 1 α 1,i then it is not possible to produce any local decrease in C by adjusting w i away fro zero. This is the essence of why the Ω 1 regularizer leads to solutions with zero-valued weights, and also provides the basis for one of the two stopping conditions discussed below. At each grafting step, we calculate the agnitude of C/ w i for each w i Z, and deterine the axiu agnitude. We then add the weight to the set of free weights F, and call a general purpose gradient descent routine to optiize (7) with respect to all the free weights. Since we know how to calculate the gradient of C/ w i, we can use an efficient quasi-newton ethod. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for a description). We start the optiization at the weight values found in the k 1 th step, so in general only the ost recently added weight changes significantly. Note that choosing the next weight based on the agnitude of C/ w i does not guarantee that it is the best weight to add at that stage. However, it is uch faster than the alternative of trying an optiization with each eber of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optial INCORPORATING THE Ω 0 REGULARIZER Use of the Ω 0 regularizer eans that transferring a weight w i fro Z to F incurs a penalty of λ 0 α 0,i δ i. This fixed penalty akes it substantially harder to deterine which weight is the ost proising one to transfer to F. 5 The heuristic we use in this case is based upon the epirical observation that in a sequence of grafting steps, the agnitude of the ost recently added weight in F typically decreases onotonically. This allows us to estiate an upper liit on the agnitude of the weight we are about to add, which in turn eans we can roughly estiate a bound on the change in C which will result fro adding a weight w i in the grafting step after a weight w j was added: C(w i ) λ 0 α 0,i δ i ( ) w j L TOT w i λ 1α 1,i λ 2 α 2,i w j (9) Picking the best weight to add then aounts to choosing w i Z that iniizes (9). Note that if λ 0 = 0and {w i,w j } Z : α 2,i = α 2, j, then this heuristic is equivalent to the siple heuristic previously discussed STOPPING CONDITIONS If only the Ω 2 regularizer is being used, then in general the odel will contain no zero-valued weights, and the grafting procedure will not terinate until Z is epty. In this case there is no advantage in using grafting over full gradient descent optiization. 5. One exception is when all weights in Z incur the sae Ω 0 penalty, as is the case with a siple linear odel where we siply penalize the nuber of included features. In this case it is reasonable to use the standard heuristic. 1342

11 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION If we are using the Ω 1 regularizer, then we can reach a point where: w i Z L TOT w i λ 1α 1,i At this point it is not possible to ake any further decrease in C by either oving a weight fro Z to F, or by adjusting any weights in F and so we are at a local (and perhaps global) iniu, and can terinate the grafting procedure. If we are using the Ω 0 regularizer, then we ay reach a point where C increases after adding w i to F. We ust then set w i to zero, reove it fro F, and undo the last optiization step (it is convenient to keep a copy of the previous odel around in order to avoid an extra optiization step). We then have a choice. It is possible that a different choice of w i ight lead to a decrease in C, so we could try the optiization step again with the w i associated with the next lowest value of C(w i ). This cycle could be repeated until all reaining weights in Z have been eliinated, and the algorith then terinates. Alternatively we can just terinate the algorith the first tie this happens, recognizing that with the Ω 0 regularizer, our solution will be a greedy approxiation to the optial solution at best. The latter approach is the one we recoend in ost cases OPTIMALITY If we have a convex loss function (as a function of weights) and are using just the Ω 2 and/or Ω 1 regularizers (which are theselves convex functions of the weights), then there is only one iniu of (7). Exaination of the stopping conditions above reveal that the grafting algorith is guaranteed to stop at a local optiu, and so grafting is guaranteed to find the global optiu in these cases. As we have seen by now, use of the Ω 0 regularizer akes it uch harder to find an optial solution. The grafting procedure with non-zero λ 0 aounts to a greedy heuristic forward subset selection ethod, which sacrifices optiality in return for fast learning. Whether this is good enough for the proble at hand depends on the situation. One should note however, that as λ 0 is ade saller and saller relative to λ 1, then the chances of ending up in a sub-optial situation decrease. Hence we are inclined to ake λ 0 fairly sall in ost cases COMPUTATIONAL COMPLEXITY We have claied that grafting is substantially faster than full gradient descent. We will now exaine this clai ore carefully. If there are p weights in our weight vector, then full gradient descent requires soe ultiple of p line iniizations to optiize our criterion, let s say cp iniizations. 6 Deterining the direction requires p derivatives C/ w i to be coputed. The coputation of each derivative is doinated by the coputation of L TOT / w i, which is siply a weighted su of siple derivatives f (x j )/ w i. The line iniizations theselves require a few O() function evaluations, but if p is large, then this is a inor contribution. If we denote the tie taken to calculate one siple derivative as τ, then the total tie taken for full gradient descent is cp 2 τ. Under grafting we will select soe nuber s weights before the algorith terinates. Since we select one weight at each grafting step we take s steps. The k th step consists of two phases. First we evaluate C/ w i for each of the (p k) weights in Z. As noted above, the derivative calculation takes τ, and so the tie devoted to gradient testing over s steps is spτ. The 6. Here, c = 1 if our criterion is a perfect quadratic for, and c > 1otherwise. 1343

12 PERKINS, LACKER AND THEILER second phase involves optiizing with respect to the k free weights. At ost this ight take ck line iniizations, but it should take less than this since ost of the free weights will be close to their optial values. To copensate for using a constant c that is probably too high here, we again ignore the tie taken for the line iniizations theselves (soe sall ultiple of τ). Therefore the tie taken for optiization at the k thstepis ck 2 τ.overs steps, this is 1 3 cs3 τ. Putting this together the total grafting run tie is (sp cs3 )τ. If we assue that s p then it is clear that the grafting algorith should be substantially quicker than the cp 2 τ required for full gradient descent. Also note that the full gradient descent algorith has to deal with discontinuities in the gradient which can slow it down significantly. By keeping zero-valued weights out of the optiization steps, grafting avoids this difficulty NORMALIZATION In order to ake the gradient agnitude heuristic a fair coparison, it is usually iportant to noralize all features so that they have approxiately the sae scale. Before we begin, we linearly scale all feature vectors so that each feature has a ean value of zero and has a standard deviation of one. It is of course iportant to scale testing data using the sae scaling paraeters derived fro the training data. 3.3 Grafting Exaples It is helpful to illustrate the grafting algorith in ore detail for soe particular odels and loss functions. Here we will concern ourselves only with binary classification probles and so a suitable loss function to use is the binoial negative log likelihood L bnll. For siplicity, we will also assue only Ω 1 and Ω 0 regularization are used, and that all α 1,i {0,1} LINEAR MODEL We first consider linear odels with n + 1 weights, of the for: f (x)= n w i x i + w 0 If we define the argin for a given training pair (x i,y i ) as ρ i = y i f (x i ), then the following is the regularized optiization criterion: C(w)= 1 + e (1 ρ n i )+λ 1 w i + λ 0 s (10) where s is the nuber of selected features. Note that the constant offset ter w 0 does not appear in the regularizer since we do not want to penalize a ere translation of the linear discriinant surface. The derivatives we need (ignoring the Ω 0 ter for now) are: C = 1 w j where x i, j is the j th coponent of x i. = 1 L bnll ρ i ρ i w j ± λw j e ρ y ix i i, j ± λw j 1344

13 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION It is instructive to interpret this derivative in a geoetric way. First, we iagine an -diensional argin space, which has one diension for each training point. If we think of each of the coordinate axes as representing the argins ρ i for the current odel on the point x i, then the total loss function can be thought of as a function over this space, and we can calculate the full gradient of that function: ( L ρ L =, L,..., L ) T ρ 1 ρ 2 ρ In the sae argin space we can also iagine the feature argin vector r j : Given this we can write: r j =(y 1 x 1, j,y 2 x 2, j,...,y x, j ) T C w j = ρ L r j ± λw j Since the grafting heuristic for selecting the next weight to add to the odel is only interested in the agnitude of this derivative, and since the regularizer coponent always acts to reduce this agnitude, we can see that picking the next weight aounts to choosing the feature argin vector that is ost well-aligned with the direction of steepest descent of the loss function in argin space. For the linear odel, we initialize F to contain just w 0 and perfor an initial optiization. We then proceed in the usual grafting fashion, picking one new weight to add to F at each grafting step until an Ω 0 or Ω 1 stopping condition is reached. In this case, each weight corresponds to one feature. The siplicity of the linear odel eans that it soeties cannot fit the training data very well. But this proble is soewhat reduced when we have any features since in general the extra features will tend to ake the proble ore linearly separable MULTI LAYER PERCEPTRON MODEL For a ore powerful odel, we ight use an MLP with h hidden units having sigoid transfer functions and a linear output unit with unit output weights. We can write this MLP function as: ) f (x)= h j=1 g ( n w ( j) i x i + w ( j) 0 + w (0) 0 In this odel w ( j) i is the weight fro the i th feature to the j th hidden unit. The sigoid transfer function g( ) is defined as: g(x)= e x 1 which is the usual neural net sigoid function, scaled vertically so that g(0)=0, g(+ )=+1and g( )= 1, which akes the net slightly better behaved under grafting. The optiization criterion is alost identical to (10) with the different f ( ) substituted. All the weights in the MLP odel are included equally in the regularization ter, except for the bias weights on each node which are excluded, and the constant weights on the output node. Rather than use a fixed nuber of hidden units, the grafting procedure we describe here allows hidden units to be added as the grafting process continues. In the MLP odel, each weight added corresponds to 1345

14 PERKINS, LACKER AND THEILER a new connection in the network. We begin with F containing just the output bias weight w (0) 0 and a network with no hidden units, and perfor an initial optiization with respect to just that bias weight. At each grafting step we decide how to grow the network in one of two possible ways: Adding a new hidden unit: If there are k 1 hidden units already, this involves adding a hidden node, along with a new connection to the output node, and a new connection fro the hidden node to a feature input, with weight w (k) i. The question becoes: which feature should be connected to the new hidden unit? The derivative we need is: C w (k) i = 1 j= e ρ y jx i j,i ± λw (k) i Adding an input connection to an existing hidden unit: Each of the h hidden units ay be connected to any of the n input features and any of these connections are candidates for adding to the odel (if they are not present already). If we are considering adding a connection fro the i th input feature to the k th hidden unit, then the relevant derivative is: ) C = 1 1 w (k) l x i,l + w (k) 0 ± λw (k) i with: w (k) i j=1 1 + e ρ i y jx j,i g ( n l=1 2e x g (x)= (1 + e x ) 2 After t grafting steps, there are n possibilities to consider for adding a new hidden unit, and (hn t) possibilities to consider for adding a connection to a new hidden unit. Following the usual grafting procedure, we calculate the derivatives for all these candidates and pick the one with the largest agnitude. The corresponding weight is added to F and we reoptiize. The cycle is repeated until one of the stopping conditions is et. If a new hidden unit is added, then we also need to include the associated bias weight, initialized at zero, in F. 3.4 Variants A nuber of variants to the basic grafting procedure are possible. One interesting alternative is not to attept a full optiization of the full set F after each grafting step. Instead only the ost recently added weight and perhaps the bias weights are adjusted. This akes each grafting step faster, at the expense of a loss in accuracy and the strong possibility of ending up at a solution that is not even a local optiu. In practice, if only a sall fraction of the possible weights are non-zero when grafting finishes, then the tie spent checking gradients to deterine the next weight to add often doinates the run tie, and so saving a little effort on the optiization akes little difference. Grafting can also be readily extended to regression probles through the use of a suitable loss function, such as the squared error loss function. This has not yet been ipleented. 4. Grafting Experients In this section, we copare the perforance of grafting and a nuber of other different approaches to feature selection, on a set of synthetic and real world test probles. For siplicity, we concentrate entirely on binary classification probles in this paper. 1346

15 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION 4.1 The Datasets Five datasets were used in these experients, labeled A through E. Each dataset consists of a training set and a test set. Datasets A, B and C are synthetic probles, and are all instances of the sae basic task described below. Datasets D and E are real world probles, taken fro the online UCI Machine Learning Repository (Blake and Merz, 1998). The three synthetic probles are variations of a task we call the threshold ax (TM) proble. In the ost basic version of this proble, the feature space contains n r inforative features, each of which is uniforly distributed between -1 and +1. The output label for a given point is defined as: { +1 if ax(xi ) > 2 y = (1 1/nr) 1 ; i = 1...n r -1 Otherwise The y = 1 points occupy a large hypercube wedged into one corner of the larger hypercube containing all the points. The y =+1 points fill the reaining space. The constant in the above expression is chosen so that half the feature space belongs to each class. Variations of this basic proble are derived by adding n i irrelevant features uniforly distributed between -1 and +1, and n c redundant features which are just copies of the inforative features. The TM proble is designed so that each of the inforative features only provides a little inforation, but by using all of the together, the proble is copletely separable. In addition, the optial discriinating surface for the proble is very non-linear, but the proble is asyetric so a linear discriinant should be able to do at least better than rando. Ten instantiations of the training and testing sets of each of the three synthetic probles were generated to obtain soe statistics on relative algorith perforance. In ore detail, the datasets are: Dataset A The TM proble, with n r = 10, n c = 0andn i = 90. Both the training set and the test set contain 1000 points each. This dataset explores the effect of irrelevant features in the TM proble. Dataset B The TM proble, with n r = 10, n c = 90 and n i = 0. Both the training set and the test set contain 1000 points each. This dataset explores the effect of redundant features in the TM proble. Dataset C The TM proble, with n r = 10, n c = 490 and n i = 500. The training set contains only 100 training points, despite the proble diensionality of The test set contains 1000 points. This dataset explores an extree situation in which there are any ore features than training points. Dataset D The Multiple Features database fro the UCI repository. This is actually a handwritten digit recognition task, where digitized iages of digits have been represented using 649 features of various types. The task tackled here is to distinguish the digit 0 fro all other digits. The training and test sets both consist of 1000 points. The features were all scaled to have zero ean and unit variance before being used here. Dataset E The Arrhythia database fro the UCI repository. The task here is to distinguish noral fro abnoral heartbeat behavior fro ECG data described by 279 nueric and binary attributes. The data was slightly odified fro the original to ake it easier to use. Feature nuber 14 ( J ) was issing in ost of the records, so it was reoved fro all the records. Of the

16 PERKINS, LACKER AND THEILER instances in the database, 32 had other issing attribute values and so those instances were also reoved, leaving 420 instances described by 278 attributes. These were divided into a training set of 280 points, and a test set of 140 points. All the datasets used in these experients can be found online at: sies/data/jlr03/ 4.2 The Algoriths Eight different algoriths, which we denote by the letters (a) through (h), were copared on the five datasets described above. Except where described below, all ipleentations relied on Matlab (including the Matlab Optiization Toolbox). (a) The linear grafting algorith described in Section 3.3, and using both Ω 0 and Ω 1 regularization. (b) The MLP grafting algorith described in Section 3.3, and using both Ω 0 and Ω 1 regularization. (c) Siple gradient descent to fit an Ω 1 regularized linear cobination of all the input features. After gradient descent is coplete, any weights with a agnitude less than 10 4 of the axiu agnitude in the weight vector, are pruned in order to obtain the subset selection driven by the Ω 1 regularization. (d) Siple gradient descent to fit an Ω 1 regularized, fully connected MLP of a siilar for to that learned by the MLP grafting algorith, with the exception that the nuber of hidden nodes is fixed at 10 nodes (and of course the connectivity is uch higher). After gradient descent is coplete, any weights with a agnitude less than 10 4 of the axiu agnitude in the weight vector, are pruned in order to obtain the subset selection driven by the Ω 1 regularization. (e) Linear SVM, which effectively uses Ω 2 regularization and a slightly different loss function fro that used by the grafting ipleentations. The SVM ipleentation we used is libsv (Chang and Lin, 2001). (f) Gaussian RBF kernel SVM, using the default libsv kernel paraeters. (g) Gaussian RBF kernel SVM as above, but in conjunction with wrapper feature subset selection. At each feature selection step, all possible features are considered for addition to the current feature set, and 3-fold cross-validation is used to select the best one. This process is repeated until we have selected 10 features, and a final RBF SVM is trained using just those features. This is essentially greedy forward subset selection. (h) Gaussian RBF kernel SVM, but in conjunction with a filter feature subset selection. For each dataset, we siply take the 10 features that are ost highly correlated with the label and train our SVM using those features. Note that an efficient ipleentation of grafting ust directly exploit the sparsity of the odels being trained. Our Matlab ipleentations take care to do this. 1348

17 GRAFTING: FAST, INCREMENTAL FEATURE SELECTION Linear Graft/GD MLP Graft/GD Linear SVM RBF SVM (λ 1 /λ 0 ) (λ 1 /λ 0 ) (C) (C) A / / B 10 4 / / C 0.2/ / D 0.005/ / E 0.05/ / Table 2: Regularization paraeters used in experients for each dataset, chosen using five-fold cross-validation. GD stands for Gradient Descent. 4.3 Experiental Details Each algorith was applied to each of the datasets. The exception to this was the fully connected MLP (algorith (d)), which failed to converge for several of the larger datasets. We suspect that this was caused by the large nuber of weights close to zero in the odel, and the discontinuous derivative of the regularizer at this point. For the three synthetic probles the training runs were repeated for ten different rando instantiations of the training and test sets, to assess sensitivity to sall changes in the dataset. The regularization paraeters used in these experients were chosen using five-fold cross validation on each of the training sets. Algoriths (a) and (c) share the sae paraeters; as do algoriths (b) and (d); and algoriths (f), (g) and (h). Note that the grafting algoriths, (a) and (b), use both Ω 1 and Ω 0 regularization, requiring paraeters λ 1 and λ 0, while the corresponding siple gradient descent algoriths, (c) and (d), use only Ω 1 regularization. This is due to the difficulty of incorporating Ω 0 regularization into a standard gradient descent algorith. The paraeters are listed in Table 2. For the Matlab ipleentations (algoriths (a), (b), (c) and (d)) the training tie on each dataset was recorded to see if grafting gave a speedup over siple gradient descent. A direct speed coparison between the SVM algoriths (written in C) and the Matlab ipleentations was not attepted at this tie, due to the inherent efficiency differences between Matlab and C code. We also recorded the nuber of features selected (not the sae as the nuber of weights in general), for those algoriths that select a subset of the features (all algorith except (e) and (f)). In addition, for the three synthetic probles, we can directly calculate a easure of how useful the selected feature subsets are. Recall that in each of these probles there are n r inforative features, not including redundant duplicates. We can define a feature set saliency easure: s = n g n r + n g n f 1 where n g is the nuber of good features selected (i.e. inforative features, but not including any duplicate redundant features), and n f is the total nuber of features selected. The saliency evaluates to 1 if all n r good features are selected and no others are selected. It evaluates to -1 if no good features are selected, and it evaluates to approxiately zero if all the good features are selected, but only along with any irrelevant or redundant features. Note that the nuber of selected features for the linear and MLP odels trained by siple gradient descent (algoriths (c) and (d)) relies on a soewhat subjective pruning of low valued weights, as described above. Therefore the easureents of nubers of selected features, and 1349

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Ph 20.3 Numerical Solution of Ordinary Differential Equations Ph 20.3 Nuerical Solution of Ordinary Differential Equations Due: Week 5 -v20170314- This Assignent So far, your assignents have tried to failiarize you with the hardware and software in the Physics Coputing

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer. UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

Physically Based Modeling CS Notes Spring 1997 Particle Collision and Contact

Physically Based Modeling CS Notes Spring 1997 Particle Collision and Contact Physically Based Modeling CS 15-863 Notes Spring 1997 Particle Collision and Contact 1 Collisions with Springs Suppose we wanted to ipleent a particle siulator with a floor : a solid horizontal plane which

More information

Now multiply the left-hand-side by ω and the right-hand side by dδ/dt (recall ω= dδ/dt) to get:

Now multiply the left-hand-side by ω and the right-hand side by dδ/dt (recall ω= dδ/dt) to get: Equal Area Criterion.0 Developent of equal area criterion As in previous notes, all powers are in per-unit. I want to show you the equal area criterion a little differently than the book does it. Let s

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

The Transactional Nature of Quantum Information

The Transactional Nature of Quantum Information The Transactional Nature of Quantu Inforation Subhash Kak Departent of Coputer Science Oklahoa State University Stillwater, OK 7478 ABSTRACT Inforation, in its counications sense, is a transactional property.

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

When Short Runs Beat Long Runs

When Short Runs Beat Long Runs When Short Runs Beat Long Runs Sean Luke George Mason University http://www.cs.gu.edu/ sean/ Abstract What will yield the best results: doing one run n generations long or doing runs n/ generations long

More information

The Methods of Solution for Constrained Nonlinear Programming

The Methods of Solution for Constrained Nonlinear Programming Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 01-06 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.co The Methods of Solution for Constrained

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Figure 1: Equivalent electric (RC) circuit of a neurons membrane Exercise: Leaky integrate and fire odel of neural spike generation This exercise investigates a siplified odel of how neurons spike in response to current inputs, one of the ost fundaental properties of

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Analyzing Simulation Results

Analyzing Simulation Results Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient

More information

ma x = -bv x + F rod.

ma x = -bv x + F rod. Notes on Dynaical Systes Dynaics is the study of change. The priary ingredients of a dynaical syste are its state and its rule of change (also soeties called the dynaic). Dynaical systes can be continuous

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x) 7Applying Nelder Mead s Optiization Algorith APPLYING NELDER MEAD S OPTIMIZATION ALGORITHM FOR MULTIPLE GLOBAL MINIMA Abstract Ştefan ŞTEFĂNESCU * The iterative deterinistic optiization ethod could not

More information

A method to determine relative stroke detection efficiencies from multiplicity distributions

A method to determine relative stroke detection efficiencies from multiplicity distributions A ethod to deterine relative stroke detection eiciencies ro ultiplicity distributions Schulz W. and Cuins K. 2. Austrian Lightning Detection and Inoration Syste (ALDIS), Kahlenberger Str.2A, 90 Vienna,

More information

Kinematics and dynamics, a computational approach

Kinematics and dynamics, a computational approach Kineatics and dynaics, a coputational approach We begin the discussion of nuerical approaches to echanics with the definition for the velocity r r ( t t) r ( t) v( t) li li or r( t t) r( t) v( t) t for

More information

26 Impulse and Momentum

26 Impulse and Momentum 6 Ipulse and Moentu First, a Few More Words on Work and Energy, for Coparison Purposes Iagine a gigantic air hockey table with a whole bunch of pucks of various asses, none of which experiences any friction

More information

The Wilson Model of Cortical Neurons Richard B. Wells

The Wilson Model of Cortical Neurons Richard B. Wells The Wilson Model of Cortical Neurons Richard B. Wells I. Refineents on the odgkin-uxley Model The years since odgkin s and uxley s pioneering work have produced a nuber of derivative odgkin-uxley-like

More information

VI. Backpropagation Neural Networks (BPNN)

VI. Backpropagation Neural Networks (BPNN) VI. Backpropagation Neural Networks (BPNN) Review of Adaline Newton s ethod Backpropagation algorith definition derivative coputation weight/bias coputation function approxiation exaple network generalization

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE Proceeding of the ASME 9 International Manufacturing Science and Engineering Conference MSEC9 October 4-7, 9, West Lafayette, Indiana, USA MSEC9-8466 MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL

More information

An improved self-adaptive harmony search algorithm for joint replenishment problems

An improved self-adaptive harmony search algorithm for joint replenishment problems An iproved self-adaptive harony search algorith for joint replenishent probles Lin Wang School of Manageent, Huazhong University of Science & Technology zhoulearner@gail.co Xiaojian Zhou School of Manageent,

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

lecture 36: Linear Multistep Mehods: Zero Stability

lecture 36: Linear Multistep Mehods: Zero Stability 95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

Birthday Paradox Calculations and Approximation

Birthday Paradox Calculations and Approximation Birthday Paradox Calculations and Approxiation Joshua E. Hill InfoGard Laboratories -March- v. Birthday Proble In the birthday proble, we have a group of n randoly selected people. If we assue that birthdays

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

Topic 5a Introduction to Curve Fitting & Linear Regression

Topic 5a Introduction to Curve Fitting & Linear Regression /7/08 Course Instructor Dr. Rayond C. Rup Oice: A 337 Phone: (95) 747 6958 E ail: rcrup@utep.edu opic 5a Introduction to Curve Fitting & Linear Regression EE 4386/530 Coputational ethods in EE Outline

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Machine Learning: Fisher s Linear Discriminant. Lecture 05 Machine Learning: Fisher s Linear Discriinant Lecture 05 Razvan C. Bunescu chool of Electrical Engineering and Coputer cience bunescu@ohio.edu Lecture 05 upervised Learning ask learn an (unkon) function

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

INTELLECTUAL DATA ANALYSIS IN AIRCRAFT DESIGN

INTELLECTUAL DATA ANALYSIS IN AIRCRAFT DESIGN INTELLECTUAL DATA ANALYSIS IN AIRCRAFT DESIGN V.A. Koarov 1, S.A. Piyavskiy 2 1 Saara National Research University, Saara, Russia 2 Saara State Architectural University, Saara, Russia Abstract. This article

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information