Online Regression Competitive with Changing Predictors

Size: px
Start display at page:

Download "Online Regression Competitive with Changing Predictors"

Transcription

1 Online Regression Competitive with Changing Predictors Steven Busuttil and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, United Kingdom Abstract This paper deals with the problem of making predictions in the online mode of learning where the dependence of the outcome y t on the signal x t can change with time The Aggregating Algorithm AA is a technique that optimally merges experts from a pool, so that the resulting strategy suffers a cumulative loss that is almost as good as that of the best expert in the pool We apply the AA to the case where the experts are all the linear predictors that can change with time KAARCh is the kernel version of the resulting algorithm In the kernel case, the experts are all the decision rules in some reproducing kernel Hilbert space that can change over time We show that KAARCh suffers a cumulative square loss that is almost as good as that of any expert that does not change very rapidly Introduction We consider the online protocol where on each trial t =, 2, the learner observes a signal x t and attempts to predict the outcome y t, which is shown to the learner later The performance of the learner is measured by means of the cumulative square loss The Aggregating Algorithm AA, introduced by Vovk in [] and [2], allows us to merge experts from large pools to obtain optimal strategies Such an optimal strategy performs nearly as good as the best expert from the class in terms of the cumulative loss In [3] the AA is applied to merge all constant linear regressors, ie, experts θ predicting θ x t it is assumed that x t and θ are drawn from R n The resulting Aggregating Algorithm for Regression AAR also known as the Vovk-Azoury- Warmuth forecaster, see [4, Sect 8] performs almost as well as the best regressor θ In [5] the kernel version of AAR, known as the Kernel AAR KAAR, is introduced and a bound on its performance is derived see also [6, Sect 8] From a computational point of view the algorithm is similar to Ridge Regression We summarise the results concerning AAR and KAAR in Sect 23 In this paper, AA is applied to merge a wider class of predictors We let θ vary between trials Consider a sequence θ, θ 2, ; let it make the prediction θ + θ θ t x t on trial t We merge all predictors of this type and obtain an algorithm which is again computationally similar to Ridge Regression We

2 call the new algorithm Aggregating Algorithm for Regression with Changing dependencies AARCh and its kernelised version KAARCh Clearly, our class of experts is very large and we cannot compete in a reasonable sense with every expert from this class However in Sects 4 and 5 we show that KAARCh can perform almost as well as any regressor if the latter is not changing very rapidly, ie, if each θ t is small or only a few are nonzero A similar problem is considered in [7], [8], and [9] for classification and regression In these publications, this problem is referred to as the non-stationary or shifting target problem and the corresponding bounds are called shifting bounds The work by Herbster and Warmuth in [7] is closest to ours However, their methods are based on Gradient Descent and therefore their bounds are of a different type For instance, since our approach is based on the Aggregating Algorithm we get a coefficent for the term representing the cumulative loss of the experts equal to see Theorems 3 and 4, whereas those in the bounds of [7, Theorems 4 6] are greater than In practice, KAARCh can be used to predict parameters that change slowly with time KAARCh is more computationally expensive than the techniques described in [7], with time and space complexities that grow with time This is not desirable in an algorithm designed for online learning; however, a practical implementation is described in [0] Essentially, KAARCh is made to forget older examples that do not affect the prediction too much In [0] empirical experiments are carried out on an artificial dataset and on the real world problem of predicting the implied volatility of options the name KAARCh was inspired by the popular GARCH model for predicting volatility in finance 2 Background In this section we introduce some preliminaries and related material required for our main results As usual, all vectors are identified with one-column matrices and B stands for the transpose of matrix B We will not be specifying the size of simple matrices like the identity matrix I when this is clear from the context 2 Protocol and Loss We can define online regression by the following protocol At every moment in time t =, 2,, the value of a signal x t X arrives Statistician or Learner S observes x t and then outputs a prediction γ t R Finally, the outcome y t R arrives This can be summarised by the following scheme: for t =, 2, do S observes x t X S outputs γ t R S observes y t R end for

3 The set X is a signal space which is assumed to be known to Statistician in advance We will be referring to a signal-outcome pair as an example The performance of S is measured by the sum of squared discrepancies between the predictions and the outcomes known as square loss Therefore on trial t Statistician S suffers loss y t γ t 2 Thus after T trials, the total loss of S is L T S = y t γ t 2 Clearly, a smaller value of L T S means a better predictive performance 22 Linear and Kernel Predictors If X R n we can consider simple linear regressors of the form θ R n Given a signal x X, such a regressor makes a prediction θ x Linear methods are easy to manipulate mathematically but their use in the real world is limited since they can only model simple dependencies One solution to this could be to map the data to some high dimensional feature space and then find a simple solution there This however, can lead to what is known as the curse of dimensionality where both the computational and generalisation performance degrades as the number of features grow [] The kernel trick first used in this context in [2] is now a widely used technique which can make a linear algorithm operate in feature space without the inherent complexities Informally, a kernel is a dot product in feature space Typically, to transform a linear method into a nonlinear one, the linear algorithm is first formulated in such a way that all signals appear only in dot products known as the dual form Then these dot-products are replaced by kernels For a function k : X X R to be a kernel it has to be symmetric, and for all l and all x,, x l X, the kernel matrix K = kx i, x j i,j, i, j =,, l must be positive semi-definite have nonnegative eigenvalues For every kernel there exists a unique reproducing kernel Hilbert space RKHS F such that k is the reproducing kernel of F In fact, there is a mapping φ : X F such that kernels can be defined as kx, z = φx, φz A RKHS on a set X is a separable and complete Hilbert space of real valued functions on X comprised of linear combinations of k of the form fx = l c i kv i, x, i= where l is a positive integer, c i R and v i, x X, and their limits We will be referring to any function in the RKHS F as D Intuitively Dx is a decision rule in F that produces a prediction for the object x We will be measuring the complexity of D by its norm D in F For more information on kernels and RKHS see, for example, [3] and [4]

4 23 The Aggregating Algorithm AA We now give an overview of the Aggregating Algorithm AA mostly following [3, Sects and 2] Let Ω be an outcome space, Γ be a prediction space and Θ be a possibly infinite pool of experts We consider the following game between Statistician or Learner S, Nature, and Θ: for t =, 2, do Every expert θ Θ makes a prediction γ θ t Γ Statistician S observes all γ θ t Statistician S outputs a prediction γ t Γ Nature outputs ω t Ω end for Given a fixed loss function λ : Ω Γ [0, ], Statistician aims to suffer a cumulative loss L T S = λω t, γ t that is not much larger than the loss L T θ = λω t, γ θ t of the best expert θ Θ The AA takes two parameters, a prior probability distribution P 0 in the pool of experts Θ and a learning rate η > 0 Let β = e η We will first describe the Aggregating Pseudo Algorithm APA that does not output actual predictions but generalised predictions A generalised prediction g : Ω R is a mapping giving a value of loss for each possible outcome At every step t, the APA updates the experts weights so that those that suffered large loss during the previous step have their weights reduced: P t dθ = β λωt,γθ t P t dθ, θ Θ At time t, the APA chooses a generalised prediction by g t ω = log β β λω,γθ t Pt dθ, Θ where Pt dθ are the normalised weights Pt dθ = P t dθ/p t Θ This guarantees that for any learning rate η > 0, prior P 0, and T =, 2, see [3, Lemma ] L T APA = log β β L T θ P 0 dθ To get a prediction from the generalised prediction g t ω note that we use ω since we do not yet know the real outcome of step t, ω t the AA uses a substitution function Σ mapping generalised predictions into Γ A substitution function may introduce extra loss; however, in many cases perfect substitution is possible Θ

5 We say that the loss function λ is η-mixable if there is a substitution function Σ such that λω t, Σg t ω g t ω t 2 on every step t, all experts predictions and all outcomes The loss function λ is mixable if it is η-mixable for some η > 0 Suppose that our loss function is η-mixable Using 2 and we can obtain the following upper bound on the cumulative loss of the AA: L T AA log β β L T θ P 0 dθ In particular, when the pool of experts is finite and all experts are assigned equal prior weights, we get, for any θ Θ Θ L T AA L T θ + ln m η where m is the size of the pool of experts This bound can be shown to be optimal in a very strong sense for all algorithms attempting to merge experts predictions see [2], The Square Loss Game In this paper we are concerned with the bounded square loss game see [3, Sect 24], where Ω = [ Y, Y ], Y R, Γ = R, and λω, γ = ω γ 2 The square loss game is η-mixable if and only if η /2Y 2 A perfect substitution function for this game is γ = g Y gy 4Y 3 The Aggregating Algorithm for Regression AAR The AA was applied to the problem of linear regression resulting in the Aggregating Algorithm for Regression AAR AAR merges all the linear predictors that map signals to outcomes [3, Sect 3] a Gaussian prior is assumed on the pool of experts AAR makes a prediction at time T by γ AAR = ỹ X X X + ai x T, where X = x, x 2,, x T and ỹ = y, y 2,, y T, 0 The main property of AAR is that it is optimal in the sense that the total loss it suffers is only a little worse than that of any linear predictor By the latter we mean a strategy that predicts θ x t on every trial t, where θ R n is some fixed vector The set of all linear predictors may be identified with R n Theorem [3, Theorem ] For any a > 0 and any point in time T, L T AAR inf L T θ + a θ 2 + Y 2 ln det x t x t + I θ a

6 The Kernel Aggregating Algorithm for Regression KAAR KAAR, the kernel version of AAR introduced in [5], makes a prediction for the signal x T by γkaar = ỹ K + ai k, where kx, x kx, x T K = kx T, x kx T, x T, and k = kx, x T kx T, x T Like AAR, KAAR has an optimality property KAAR performs little worse than any decision rule D in the RKHS induced by a kernel function k Theorem 2 [5, Theorem ] and [6, Sect 8] Let k be a kernel on a space X and D be any decision rule in the RKHS induced by k Then for every a > 0 and any point in time T, L T KAAR L T D + a D 2 + Y 2 ln det a K + I Corollary [6, Sect 8] Under the same conditions of Theorem 2 let c = sup x X kx, x Then for every a > 0, every d > 0, every decision rule D such that D d and any point in time T, we get L T KAAR L T D + ad 2 + Y 2 c 2 T a If, moreover, T is known in advance, one can choose a = Y c/d T and get L T KAAR L T D + 2Y cd T 3 Algorithm For our new method, we apply the Aggregating Algorithm AA to the regression problem where the experts can change with time We call this method the Aggregating Algorithm for Regression with Changing dependencies AARCh Subsequently, we will kernelise this method to get Kernel AARCh KAARCh Throughout this section we will be using the lemmas given in the appendix 3 AARCh: Primal Form The main idea behind AARCh is to apply the Aggregating Algorithm to the case where the pool of experts is made up of all linear predictors that can change independently with time We assume that outcomes are bounded by Y, therefore, for any t, y t [ Y, Y ] we do not require our algorithm to know Y We are interested in the square loss, therefore we will be using optimal η = /2Y 2 and substitution function 3

7 An expert is a sequence θ, θ 2,, that at time T predicts x T θ + θ θ T, where for any t, θ t R n and x T R n To apply the AA to this problem we need to define a lower triangular block matrix L, and θ which is a concatenation of all the θ t for t = T, such that I 0 0 I I Lθ = I I I 0 I I I I θ θ 2 θ T θ T θ θ + θ 2 = θ + θ θ T θ + θ θ T + θ T The matrices I and 0 in L are the n n identity and all-zero matrices respectively We also need to define z t which is x t padded with zeros in the following way [ ] 0 0 z t = }{{} x t 0 0 }{{}, nt nt t so that z tlθ = x tθ + θ θ t Let a t > 0, t =,, T, be arbitrary constants Consider the prior distribution P 0 in the set R nt of possible weights θ with the Gaussian density T n/2 η nt/2 P 0 dθ = a t e η P T at θt 2 dθ dθ T π η T T n/2 = a t e ηθ Aθ dθ, π where, letting I and 0 be as above, we have a I 0 0 A = 0 a 2 I a T I The loss of θ over the first T trials is L T θ = y t z tlθ 2 T T = θ L z t z t Lθ 2 y t z t Lθ + The sum θ + + θ t corresponds to the predictor u t in [7] yt 2

8 Therefore, the loss of the APA is recall that β = e η L T APA = log β R nt β L T θ P 0 dθ = log β R nt η T T n/2 a t π e ηθ L P T ztz tlθ 2 P T ytz tlθ+ P T y2 t +θ Aθ dθ η T T n/2 = log β a t π R nt e ηθ L P T ztz t L+Aθ+2ηP T ytz tlθ η P T y2 t dθ Given the generalised prediction g T ω which is the APA s loss with variable ω R replacing y T and using substitution function 3, the AA s prediction is γ T = 4Y log β g T Y β β g T Y = 4Y log β R nt e ηθ L P T ztz t L+Aθ+2ηP T ytz t L Y z T Lθ dθ R nt e ηθ L P T ztz t L+Aθ+2ηP T ytz t L+Y z T Lθ dθ Let Q θ = θ Q 2 θ = θ L T T L z t z tl + A z t z tl + A θ 2 θ 2 T y t z tl Y z T L θ, and T y t z tl + Y z T L θ By Lemma e η min θ R nt Q θ γ T = 4Y log β e η min θ R nt Q 2θ = min Q θ min Q 2 θ 4Y θ R nt θ R nt Finally, by using Lemma 2 we get γ T = 4Y F L = y t z tl T L z t z tl + A, 2 y t z tl, 2Y z T L T z t z tl + A L z T 4

9 32 AARCh: Dual Form Let us define z a I 0 0 z 2 Z =, A = 0 a 2 I 0 z T 0 0, and ỹ = a T I We can rewrite 4 in matrix notation to get y y T 0 γ T = ỹ ZL L Z ZL + A L z T = ỹ ZL A A L Z ZL A + I A L z T = ỹ ZL A A L Z ZL A + I A L z T We can now get a dual formulation of this by using Lemma 3: ZLA γ T = ỹ L Z + I ZLA L z T 5 33 KAARCh Since in 5 signals appear only in dot products, we can use the kernel trick to introduce nonlinearity In this case we get Kernel AARCh KAARCh that at time T makes a prediction where K = mini,j γ T = ỹ K + I k, a t kx i, x j, for i, j =,, T, ie a kx, x a kx, x 2 a kx, x T a kx 2, x a + a 2 kx 2, x 2 a + a 2 kx 2, x T K = a kx T, x a + a 2 kx T, x 2 a + + a T kx T, x T and k i = a t kx i, x T, for i =,, T, ie k = i i,j a kx, x T a + a 2 kx 2, x T a + + a T kx T, x T,

10 4 Upper Bounds In this section we use the Aggregating Algorithm s properties to derive upper bounds on the cumulative square loss suffered by AARCh and KAARCh, compared to that of any expert in the pool 4 AARCh Loss Upper Bound Theorem 3 For any point in time T and any a t > 0, t =,, T, L T AARCh inf L T θ + a t θ t 2 θ A + Y 2 ln det L z t z tl A + I Proof Given the Aggregating Algorithm s properties, we know that 6 L T AARCh log β R nt β L T θ P 0 dθ η T T n/2 = log β a t π e ηθ L P T ztz t L+Aθ 2P T ytztlθ+p T y2 t dθ R nt By Lemma this is equal to inf L T θ + θ Aθ + log β η T T n/2 π nt/2 θ a t π det ηl T z tz tl + ηa = inf L T θ + a t θ t 2 η n T T + log β a t θ det ηl T z tz tl + ηa = inf L T θ + a t θ t 2 + T θ 2 log an t β det L T z tz tl + A = inf θ L T θ + a t θ t 2 A A 2 log det L T z tz tl A A + I β T an t = inf θ L T θ + A a t θ t 2 + Y 2 ln det L z t z tl A + I

11 42 KAARCh Loss Upper Bound The following generalises Theorem 3 Note that we cannot repeat the proof for the linear case directly since it involves the evaluation of an integral over the space R nt Theorem 4 Let k be a kernel on a space X, let D t, t = T, be any decision rules in the RKHS F induced by k and let D = D, D 2,, D T Then, for any point in time T and every a t > 0, t =,, T, L T KAARCh L T D + a t D t 2 + Y 2 ln det K + I 7 Proof It will be sufficient to prove this for D t of the form where l t are positive integers, c t i f t x = c t i kv t i, x, l t i= R, and v t i, x X we use t to show that these parameters can be different for each f t This is because such finite sums are dense in the RKHS F If we take f = f, f 2,, f T, 7 becomes L T KAARCh L T f + where L T f = t l a t i,j= c t i i= c t j l t y t c t i kv t i, x t kvt i, v t j + Y 2 ln det K + I, In the special case when X = R n and kv i, v j = v i v j for every v i, v j X, 8 follows directly from 6 Indeed, a kernel predictor f t reduces to the linear predictor θ t = l t i= ct i v t i and the term l t i,j= ct i c t j k v t i, v t j equals the squared quadratic norm of θ t Finally, by Sylvester s determinant identity see also Lemma 4 for an independent proof of this we know that det K + I = det ZLA L Z + I = det A L Z ZL A + I The general case can be obtained by using finite dimensional approximations Recall that inherent in every kernel is a function φ that maps objects to the RKHS F, which is isomorphic to l 2 = {α = α, α 2, i= α2 i converges} Let us consider the sequence on subspaces R R 2 F The set R s = {α, α 2,, α s, 0, 0, } may be identified with R s Let p s : F R s be the 2 8

12 projection operator p s α = α, α 2,, α s, 0, 0,, φ s : X R s be φ s = p s φ, and k s be given by k s v, v 2 = φ s v, φ s v 2, where v, v 2 X Inequality 8 holds for k s since R s has a finite dimension If 8 is violated, then its counterpart with some large s is violated too and this observation completes the proof 5 Discussion In this section we shall analyse upper bound 7 in order to obtain an equivalent of Corollary Our goal is to show that KAARCh s cumulative loss is less or equal to that of a wide class of experts plus a term of the order ot Estimating the determinant of a positive definite matrix by the product of its diagonal elements see [5, Sect 20, Theorem 7] and using the inequality ln + x x in our case x is small, and therefore the resulting bound is tight, we get Y 2 ln det t K + I Y 2 ln + c 2 Y 2 c 2 T = Y 2 c 2 T t a i= i a i= i T t + a t, where c = sup x X kx, x It is natural to single out the first decision rule D and the corresponding coefficient a from the rest We may think of it as corresponding to the choice of the principal dependency; let the rest of D t t = 2,, T be small correction terms Let us take equal a 2 = = a t = a We get L T KAARCh L T D + a D 2 + Y 2 c 2 T + a a t=2 D t 2 + Y 2 c 2 T T 2a 9 If we bound the norm of D by d and assume that T is known in advance, a may be chosen as in Corollary The second term in the right hand side of 9 T can thus be bounded by O If we assume that T t=2 D t 2 st, then the estimate is minimised by a = Y 2 c 2 T T /2sT and the third term in the right hand side of 9 can be bounded by O T st We therefore get the following corollary:

13 Corollary 2 Under the conditions of Theorem 4, let T be known in advance and c = sup x X kx, x For every every d > 0 and every function st, if D d and T t=2 D t 2 st, then a t, for t =,, T, can be chosen so that L T KAARCh L T D + 2Y cd T + 2Y c st T T /2 If st = o, then L T KAARCh L T D + ot The estimate st = o can be achieved in two natural ways First, one can assume that each D t, for t = 2,, T, is small Corollary 3 Under the conditions of Theorem 4, let T be known in advance For every positive d, d, and ε, if D d and, for t = 2,, T, then D t d T 05+ε, L T KAARCh L T D + O T max05, ε = L T D + ot Secondly, one may assume that there are only a few nonzero D t, for t = 2,, T In this case, the nonzero D t can have greater flexibility Acknowledgements We thank Volodya Vovk and Alex Gammerman for valuable discussions We are grateful to Michael Vyugin for suggesting the problem of predicting implied volatility which inspired this work References Vovk, V: Aggregating strategies In Fulk, M, Case, J, eds: Proceedings of the 3rd Annual Workshop on Computational Learning Theory, Morgan Kaufmann Vovk, V: A game of prediction with expert advice Journal of Computer and System Sciences Vovk, V: Competitive on-line statistics International Statistical Review Cesa-Bianchi, N, Lugosi, G: Prediction, Learning, and Games Cambridge University Press Gammerman, A, Kalnishkan, Y, Vovk, V: On-line prediction with kernels and the complexity approximation principle In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press Vovk, V: On-line regression competitive with reproducing kernel Hilbert spaces Technical Report arxiv:cslg/05058 version 2, arxivorg Herbster, M, Warmuth, MK: Tracking the best linear predictor Journal of Machine Learning Research

14 8 Kivinen, J, Smola, AJ, Williamson, RC: Online learning with kernels IEEE Transactions on Signal Processing Cavallanti, G, Cesa-Bianchi, N, Gentile, C: Tracking the best hyperplane with a simple budget perceptron Machine Learning to appear 0 Busuttil, S, Kalnishkan, Y: Weighted kernel regression for predicting changing dependencies In: Proceedings of the 8th European Conference on Machine Learning ECML 2007 to appear Cristianini, N, Shawe-Taylor, J: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods Cambridge University Press, UK Aizerman, M, Braverman, E, Rozonoer, L: Theoretical foundations of the potential function method in pattern recognition learning Automation and Remote Control Aronszajn, N: Theory of reproducing kernels Transactions of the American Mathematical Society Schölkopf, B, Smola, AJ: Learning with Kernels Support Vector Machines, Regularization, Optimization and Beyond The MIT Press, USA Beckenbach, EF, Bellman, R: Inequalities Springer 96 Appendix Lemma Let Qθ = θ Aθ + b θ + c, where θ, b R n, c is a scalar and A is a symmetric positive definite n n matrix Then Rn e Qθ dθ = e Q0 π n/2 det A, where Q 0 = min θ R n Qθ Proof Let θ 0 arg min Q Take ξ = θ θ 0 and Qξ = Qξ + θ 0 It is easy to see that the quadratic part of Q is ξ Aξ Since 0 arg min Q, the form has no linear term Indeed, in the vicinity of 0 the linear term dominates over the quadratic term; if Q has a non-zero linear term, it cannot have a minimum at 0 Since Q 0 = min ξ R n Qξ, we can conclude that the constant term in Q is Q0 Thus Qξ = ξ Aξ + Q 0 It remains to show that R n e ξ Aξ dξ = π n/2 / det A This can be proved by considering a basis where A diagonalises or see [5, Sect 27, Theorem 3] Lemma 2 Let F A, b, x = min θ R nθ Aθ + b θ + x θ min θ R nθ Aθ + b θ x θ, where b, x R n and A is a symmetric positive definite n n matrix Then F A, b, x = b A x Proof It can be shown by differentiation that the first minimum is achieved at θ = 2 A b + x and the second minimum at θ 2 = 2 A b x The substitution proves the lemma

15 Lemma 3 Given a matrix A, a scalar a and I identity matrices of the appropriate size, AA + ai A = AA A + ai Proof AA + ai A = AA + ai AA A + aia A + ai = AA + ai AA A + aaa A + ai = AA + ai AA + aiaa A + ai = AA A + ai Lemma 4 For every matrix M the equality deti + M M = deti + MM holds where I are identity matrices of the correct size Proof Suppose that M is an n m matrix Thus I + MM and I + M M are n n and m m matrices respectively Without loss of generality, we may assume that n m otherwise we swap M and M Let the columns of M be m vectors x,, x m R n We have MM = n i= x ix i Let us see how the operator MM acts on a vector x R n By associativity, x i x i x = x i xx i, where x ix is a scalar Therefore, if U is the span of x, x 2,, x m, then MM R n U In a similar way, it follows that I + MM U U On the other hand, if x is orthogonal to x i, then x i x i x = x i xx i = 0 Hence MM U = 0, where U is the orthogonal complement to U with respect to R n Consequently, I + MM U = I by B V we denote the restriction of an operator B to a subspace V Therefore I + MM U U One can see that both U and U are invariant subspaces of I+MM If we choose bases in U and in U and then concatenate them, we get a basis of R n In this basis the matrix of I + MM has the form [ ] A 0, 0 I where A is the matrix of I + MM U It remains to evaluate deta First let us consider the case of linearly independent x, x 2,, x m They form a basis of U and we may use it to calculate the determinant of the operator I + MM U However, I + MM x i = x i + m x jx i x j and thus the matrix of the operator I + MM U in the basis x, x 2,, x m is I + M M The case of linearly dependent x, x 2,, x m follows by continuity Indeed, m vectors in an n-dimensional space with n m may be approximated by m independent vectors to any degree of precision and the determinant is a continuous function of the elements of a matrix j=

The Aggregating Algorithm and Regression

The Aggregating Algorithm and Regression The Aggregating Algorithm and Regression Steven Busuttil Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, United Kingdom 2008 A dissertation submitted

More information

An Identity for Kernel Ridge Regression

An Identity for Kernel Ridge Regression An Identity for Kernel Ridge Regression Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20

More information

Linear Probability Forecasting

Linear Probability Forecasting Linear Probability Forecasting Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW0 0EX, UK {fedor,yura}@cs.rhul.ac.uk

More information

IMPROVING THE AGGREGATING ALGORITHM FOR REGRESSION

IMPROVING THE AGGREGATING ALGORITHM FOR REGRESSION IMPROVING THE AGGREGATING ALGORITHM FOR REGRESSION Steven Busuttil, Yuri Kalnishkan and Alex Gammerman Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK.

More information

An Introduction to Kernel Methods 1

An Introduction to Kernel Methods 1 An Introduction to Kernel Methods 1 Yuri Kalnishkan Technical Report CLRC TR 09 01 May 2009 Department of Computer Science Egham, Surrey TW20 0EX, England 1 This paper has been written for wiki project

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

SVMs: nonlinearity through kernels

SVMs: nonlinearity through kernels Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Defensive forecasting for optimal prediction with expert advice

Defensive forecasting for optimal prediction with expert advice Defensive forecasting for optimal prediction with expert advice Vladimir Vovk $25 Peter Paul $0 $50 Peter $0 Paul $100 The Game-Theoretic Probability and Finance Project Working Paper #20 August 11, 2007

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Online Aggregation of Unbounded Signed Losses Using Shifting Experts

Online Aggregation of Unbounded Signed Losses Using Shifting Experts Proceedings of Machine Learning Research 60: 5, 207 Conformal and Probabilistic Prediction and Applications Online Aggregation of Unbounded Signed Losses Using Shifting Experts Vladimir V. V yugin Institute

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

CS281B/Stat241B. Statistical Learning Theory. Lecture 14. CS281B/Stat241B. Statistical Learning Theory. Lecture 14. Wouter M. Koolen Convex losses Exp-concave losses Mixable losses The gradient trick Specialists 1 Menu Today we solve new online learning problems

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly. C PROPERTIES OF MATRICES 697 to whether the permutation i 1 i 2 i N is even or odd, respectively Note that I =1 Thus, for a 2 2 matrix, the determinant takes the form A = a 11 a 12 = a a 21 a 11 a 22 a

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Immediate Reward Reinforcement Learning for Projective Kernel Methods ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods

More information

Minimax Fixed-Design Linear Regression

Minimax Fixed-Design Linear Regression JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up! Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Sandor Szedma 1 and John Shawe-Taylor 1 1 - Electronics and Computer Science, ISIS Group University of Southampton, SO17 1BJ, United

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Kernels MIT Course Notes

Kernels MIT Course Notes Kernels MIT 15.097 Course Notes Cynthia Rudin Credits: Bartlett, Schölkopf and Smola, Cristianini and Shawe-Taylor The kernel trick that I m going to show you applies much more broadly than SVM, but we

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Applied inductive learning - Lecture 7

Applied inductive learning - Lecture 7 Applied inductive learning - Lecture 7 Louis Wehenkel & Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - November 5, 2012 Find slides: http://montefiore.ulg.ac.be/

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Using Additive Expert Ensembles to Cope with Concept Drift

Using Additive Expert Ensembles to Cope with Concept Drift Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers. Chapter 3 Duality in Banach Space Modern optimization theory largely centers around the interplay of a normed vector space and its corresponding dual. The notion of duality is important for the following

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Evaluation of Support Vector Machines and Minimax Probability. Machines for Weather Prediction. Stephen Sullivan

Evaluation of Support Vector Machines and Minimax Probability. Machines for Weather Prediction. Stephen Sullivan Generated using version 3.0 of the official AMS L A TEX template Evaluation of Support Vector Machines and Minimax Probability Machines for Weather Prediction Stephen Sullivan UCAR - University Corporation

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information