Online Regression Competitive with Changing Predictors

Similar documents
The Aggregating Algorithm and Regression

An Identity for Kernel Ridge Regression

Linear Probability Forecasting

IMPROVING THE AGGREGATING ALGORITHM FOR REGRESSION

An Introduction to Kernel Methods 1

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Adaptive Sampling Under Low Noise Conditions 1

Worst-Case Bounds for Gaussian Process Models

Online Gradient Descent Learning Algorithms

Support Vector Machine Classification via Parameterless Robust Linear Programming

SVMs: nonlinearity through kernels

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Lecture 16: Perceptron and Exponential Weights Algorithm

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Online Learning and Sequential Decision Making

Full-information Online Learning

Defensive forecasting for optimal prediction with expert advice

Online Convex Optimization

The Online Approach to Machine Learning

10-701/ Recitation : Kernels

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Methods. Barnabás Póczos

Online Kernel PCA with Entropic Matrix Updates

Online Learning and Online Convex Optimization

Linear Dependency Between and the Input Noise in -Support Vector Regression

Convergence of Eigenspaces in Kernel Principal Component Analysis

Online Aggregation of Unbounded Signed Losses Using Shifting Experts

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Kernel Methods. Charles Elkan October 17, 2007

Perceptron Mistake Bounds

Kernel Learning via Random Fourier Representations

Principal Component Analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

9.2 Support Vector Machines 159

Sample width for multi-category classifiers

Support Vector Machines for Classification: A Statistical Portrait

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Online Bounds for Bayesian Algorithms

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Brief Introduction to Machine Learning

Online Passive-Aggressive Algorithms

Kernel Methods. Machine Learning A W VO

Kernels for Multi task Learning

Linear & nonlinear classifiers

Kernel Methods and Support Vector Machines

Adaptive Online Gradient Descent

Kernel Methods. Outline

Support Vector Machine (SVM) and Kernel Methods

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

Approximate Kernel Methods

Max Margin-Classifier

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Minimax Fixed-Design Linear Regression

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Perceptron Revisited: Linear Separators. Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Review: Support vector machines. Machine learning techniques and image analysis

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Lecture Support Vector Machine (SVM) Classifiers

U Logo Use Guidelines

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

CIS 520: Machine Learning Oct 09, Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Bits of Machine Learning Part 1: Supervised Learning

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Support Vector Machine

Kernels MIT Course Notes

Online Passive-Aggressive Algorithms

Discriminative Models

Kernel methods, kernel SVM and ridge regression

Generalization theory

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Applied inductive learning - Lecture 7

Discriminative Models

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Statistical Machine Learning from Data

Using Additive Expert Ensembles to Cope with Concept Drift

(Kernels +) Support Vector Machines

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

Support Vector Machines. Maximizing the Margin

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Evaluation of Support Vector Machines and Minimax Probability. Machines for Weather Prediction. Stephen Sullivan

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Transcription:

Online Regression Competitive with Changing Predictors Steven Busuttil and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, United Kingdom {steven,yura}@csrhulacuk Abstract This paper deals with the problem of making predictions in the online mode of learning where the dependence of the outcome y t on the signal x t can change with time The Aggregating Algorithm AA is a technique that optimally merges experts from a pool, so that the resulting strategy suffers a cumulative loss that is almost as good as that of the best expert in the pool We apply the AA to the case where the experts are all the linear predictors that can change with time KAARCh is the kernel version of the resulting algorithm In the kernel case, the experts are all the decision rules in some reproducing kernel Hilbert space that can change over time We show that KAARCh suffers a cumulative square loss that is almost as good as that of any expert that does not change very rapidly Introduction We consider the online protocol where on each trial t =, 2, the learner observes a signal x t and attempts to predict the outcome y t, which is shown to the learner later The performance of the learner is measured by means of the cumulative square loss The Aggregating Algorithm AA, introduced by Vovk in [] and [2], allows us to merge experts from large pools to obtain optimal strategies Such an optimal strategy performs nearly as good as the best expert from the class in terms of the cumulative loss In [3] the AA is applied to merge all constant linear regressors, ie, experts θ predicting θ x t it is assumed that x t and θ are drawn from R n The resulting Aggregating Algorithm for Regression AAR also known as the Vovk-Azoury- Warmuth forecaster, see [4, Sect 8] performs almost as well as the best regressor θ In [5] the kernel version of AAR, known as the Kernel AAR KAAR, is introduced and a bound on its performance is derived see also [6, Sect 8] From a computational point of view the algorithm is similar to Ridge Regression We summarise the results concerning AAR and KAAR in Sect 23 In this paper, AA is applied to merge a wider class of predictors We let θ vary between trials Consider a sequence θ, θ 2, ; let it make the prediction θ + θ 2 + + θ t x t on trial t We merge all predictors of this type and obtain an algorithm which is again computationally similar to Ridge Regression We

call the new algorithm Aggregating Algorithm for Regression with Changing dependencies AARCh and its kernelised version KAARCh Clearly, our class of experts is very large and we cannot compete in a reasonable sense with every expert from this class However in Sects 4 and 5 we show that KAARCh can perform almost as well as any regressor if the latter is not changing very rapidly, ie, if each θ t is small or only a few are nonzero A similar problem is considered in [7], [8], and [9] for classification and regression In these publications, this problem is referred to as the non-stationary or shifting target problem and the corresponding bounds are called shifting bounds The work by Herbster and Warmuth in [7] is closest to ours However, their methods are based on Gradient Descent and therefore their bounds are of a different type For instance, since our approach is based on the Aggregating Algorithm we get a coefficent for the term representing the cumulative loss of the experts equal to see Theorems 3 and 4, whereas those in the bounds of [7, Theorems 4 6] are greater than In practice, KAARCh can be used to predict parameters that change slowly with time KAARCh is more computationally expensive than the techniques described in [7], with time and space complexities that grow with time This is not desirable in an algorithm designed for online learning; however, a practical implementation is described in [0] Essentially, KAARCh is made to forget older examples that do not affect the prediction too much In [0] empirical experiments are carried out on an artificial dataset and on the real world problem of predicting the implied volatility of options the name KAARCh was inspired by the popular GARCH model for predicting volatility in finance 2 Background In this section we introduce some preliminaries and related material required for our main results As usual, all vectors are identified with one-column matrices and B stands for the transpose of matrix B We will not be specifying the size of simple matrices like the identity matrix I when this is clear from the context 2 Protocol and Loss We can define online regression by the following protocol At every moment in time t =, 2,, the value of a signal x t X arrives Statistician or Learner S observes x t and then outputs a prediction γ t R Finally, the outcome y t R arrives This can be summarised by the following scheme: for t =, 2, do S observes x t X S outputs γ t R S observes y t R end for

The set X is a signal space which is assumed to be known to Statistician in advance We will be referring to a signal-outcome pair as an example The performance of S is measured by the sum of squared discrepancies between the predictions and the outcomes known as square loss Therefore on trial t Statistician S suffers loss y t γ t 2 Thus after T trials, the total loss of S is L T S = y t γ t 2 Clearly, a smaller value of L T S means a better predictive performance 22 Linear and Kernel Predictors If X R n we can consider simple linear regressors of the form θ R n Given a signal x X, such a regressor makes a prediction θ x Linear methods are easy to manipulate mathematically but their use in the real world is limited since they can only model simple dependencies One solution to this could be to map the data to some high dimensional feature space and then find a simple solution there This however, can lead to what is known as the curse of dimensionality where both the computational and generalisation performance degrades as the number of features grow [] The kernel trick first used in this context in [2] is now a widely used technique which can make a linear algorithm operate in feature space without the inherent complexities Informally, a kernel is a dot product in feature space Typically, to transform a linear method into a nonlinear one, the linear algorithm is first formulated in such a way that all signals appear only in dot products known as the dual form Then these dot-products are replaced by kernels For a function k : X X R to be a kernel it has to be symmetric, and for all l and all x,, x l X, the kernel matrix K = kx i, x j i,j, i, j =,, l must be positive semi-definite have nonnegative eigenvalues For every kernel there exists a unique reproducing kernel Hilbert space RKHS F such that k is the reproducing kernel of F In fact, there is a mapping φ : X F such that kernels can be defined as kx, z = φx, φz A RKHS on a set X is a separable and complete Hilbert space of real valued functions on X comprised of linear combinations of k of the form fx = l c i kv i, x, i= where l is a positive integer, c i R and v i, x X, and their limits We will be referring to any function in the RKHS F as D Intuitively Dx is a decision rule in F that produces a prediction for the object x We will be measuring the complexity of D by its norm D in F For more information on kernels and RKHS see, for example, [3] and [4]

23 The Aggregating Algorithm AA We now give an overview of the Aggregating Algorithm AA mostly following [3, Sects and 2] Let Ω be an outcome space, Γ be a prediction space and Θ be a possibly infinite pool of experts We consider the following game between Statistician or Learner S, Nature, and Θ: for t =, 2, do Every expert θ Θ makes a prediction γ θ t Γ Statistician S observes all γ θ t Statistician S outputs a prediction γ t Γ Nature outputs ω t Ω end for Given a fixed loss function λ : Ω Γ [0, ], Statistician aims to suffer a cumulative loss L T S = λω t, γ t that is not much larger than the loss L T θ = λω t, γ θ t of the best expert θ Θ The AA takes two parameters, a prior probability distribution P 0 in the pool of experts Θ and a learning rate η > 0 Let β = e η We will first describe the Aggregating Pseudo Algorithm APA that does not output actual predictions but generalised predictions A generalised prediction g : Ω R is a mapping giving a value of loss for each possible outcome At every step t, the APA updates the experts weights so that those that suffered large loss during the previous step have their weights reduced: P t dθ = β λωt,γθ t P t dθ, θ Θ At time t, the APA chooses a generalised prediction by g t ω = log β β λω,γθ t Pt dθ, Θ where Pt dθ are the normalised weights Pt dθ = P t dθ/p t Θ This guarantees that for any learning rate η > 0, prior P 0, and T =, 2, see [3, Lemma ] L T APA = log β β L T θ P 0 dθ To get a prediction from the generalised prediction g t ω note that we use ω since we do not yet know the real outcome of step t, ω t the AA uses a substitution function Σ mapping generalised predictions into Γ A substitution function may introduce extra loss; however, in many cases perfect substitution is possible Θ

We say that the loss function λ is η-mixable if there is a substitution function Σ such that λω t, Σg t ω g t ω t 2 on every step t, all experts predictions and all outcomes The loss function λ is mixable if it is η-mixable for some η > 0 Suppose that our loss function is η-mixable Using 2 and we can obtain the following upper bound on the cumulative loss of the AA: L T AA log β β L T θ P 0 dθ In particular, when the pool of experts is finite and all experts are assigned equal prior weights, we get, for any θ Θ Θ L T AA L T θ + ln m η where m is the size of the pool of experts This bound can be shown to be optimal in a very strong sense for all algorithms attempting to merge experts predictions see [2], The Square Loss Game In this paper we are concerned with the bounded square loss game see [3, Sect 24], where Ω = [ Y, Y ], Y R, Γ = R, and λω, γ = ω γ 2 The square loss game is η-mixable if and only if η /2Y 2 A perfect substitution function for this game is γ = g Y gy 4Y 3 The Aggregating Algorithm for Regression AAR The AA was applied to the problem of linear regression resulting in the Aggregating Algorithm for Regression AAR AAR merges all the linear predictors that map signals to outcomes [3, Sect 3] a Gaussian prior is assumed on the pool of experts AAR makes a prediction at time T by γ AAR = ỹ X X X + ai x T, where X = x, x 2,, x T and ỹ = y, y 2,, y T, 0 The main property of AAR is that it is optimal in the sense that the total loss it suffers is only a little worse than that of any linear predictor By the latter we mean a strategy that predicts θ x t on every trial t, where θ R n is some fixed vector The set of all linear predictors may be identified with R n Theorem [3, Theorem ] For any a > 0 and any point in time T, L T AAR inf L T θ + a θ 2 + Y 2 ln det x t x t + I θ a

The Kernel Aggregating Algorithm for Regression KAAR KAAR, the kernel version of AAR introduced in [5], makes a prediction for the signal x T by γkaar = ỹ K + ai k, where kx, x kx, x T K = kx T, x kx T, x T, and k = kx, x T kx T, x T Like AAR, KAAR has an optimality property KAAR performs little worse than any decision rule D in the RKHS induced by a kernel function k Theorem 2 [5, Theorem ] and [6, Sect 8] Let k be a kernel on a space X and D be any decision rule in the RKHS induced by k Then for every a > 0 and any point in time T, L T KAAR L T D + a D 2 + Y 2 ln det a K + I Corollary [6, Sect 8] Under the same conditions of Theorem 2 let c = sup x X kx, x Then for every a > 0, every d > 0, every decision rule D such that D d and any point in time T, we get L T KAAR L T D + ad 2 + Y 2 c 2 T a If, moreover, T is known in advance, one can choose a = Y c/d T and get L T KAAR L T D + 2Y cd T 3 Algorithm For our new method, we apply the Aggregating Algorithm AA to the regression problem where the experts can change with time We call this method the Aggregating Algorithm for Regression with Changing dependencies AARCh Subsequently, we will kernelise this method to get Kernel AARCh KAARCh Throughout this section we will be using the lemmas given in the appendix 3 AARCh: Primal Form The main idea behind AARCh is to apply the Aggregating Algorithm to the case where the pool of experts is made up of all linear predictors that can change independently with time We assume that outcomes are bounded by Y, therefore, for any t, y t [ Y, Y ] we do not require our algorithm to know Y We are interested in the square loss, therefore we will be using optimal η = /2Y 2 and substitution function 3

An expert is a sequence θ, θ 2,, that at time T predicts x T θ + θ 2 + + θ T, where for any t, θ t R n and x T R n To apply the AA to this problem we need to define a lower triangular block matrix L, and θ which is a concatenation of all the θ t for t = T, such that I 0 0 I I Lθ = I I I 0 I I I I θ θ 2 θ T θ T θ θ + θ 2 = θ + θ 2 + + θ T θ + θ 2 + + θ T + θ T The matrices I and 0 in L are the n n identity and all-zero matrices respectively We also need to define z t which is x t padded with zeros in the following way [ ] 0 0 z t = }{{} x t 0 0 }{{}, nt nt t so that z tlθ = x tθ + θ 2 + + θ t Let a t > 0, t =,, T, be arbitrary constants Consider the prior distribution P 0 in the set R nt of possible weights θ with the Gaussian density T n/2 η nt/2 P 0 dθ = a t e η P T at θt 2 dθ dθ T π η T T n/2 = a t e ηθ Aθ dθ, π where, letting I and 0 be as above, we have a I 0 0 A = 0 a 2 I 0 0 0 a T I The loss of θ over the first T trials is L T θ = y t z tlθ 2 T T = θ L z t z t Lθ 2 y t z t Lθ + The sum θ + + θ t corresponds to the predictor u t in [7] yt 2

Therefore, the loss of the APA is recall that β = e η L T APA = log β R nt β L T θ P 0 dθ = log β R nt η T T n/2 a t π e ηθ L P T ztz tlθ 2 P T ytz tlθ+ P T y2 t +θ Aθ dθ η T T n/2 = log β a t π R nt e ηθ L P T ztz t L+Aθ+2ηP T ytz tlθ η P T y2 t dθ Given the generalised prediction g T ω which is the APA s loss with variable ω R replacing y T and using substitution function 3, the AA s prediction is γ T = 4Y log β g T Y β β g T Y = 4Y log β R nt e ηθ L P T ztz t L+Aθ+2ηP T ytz t L Y z T Lθ dθ R nt e ηθ L P T ztz t L+Aθ+2ηP T ytz t L+Y z T Lθ dθ Let Q θ = θ Q 2 θ = θ L T T L z t z tl + A z t z tl + A θ 2 θ 2 T y t z tl Y z T L θ, and T y t z tl + Y z T L θ By Lemma e η min θ R nt Q θ γ T = 4Y log β e η min θ R nt Q 2θ = min Q θ min Q 2 θ 4Y θ R nt θ R nt Finally, by using Lemma 2 we get γ T = 4Y F L = y t z tl T L z t z tl + A, 2 y t z tl, 2Y z T L T z t z tl + A L z T 4

32 AARCh: Dual Form Let us define z a I 0 0 z 2 Z =, A = 0 a 2 I 0 z T 0 0, and ỹ = a T I We can rewrite 4 in matrix notation to get y y T 0 γ T = ỹ ZL L Z ZL + A L z T = ỹ ZL A A L Z ZL A + I A L z T = ỹ ZL A A L Z ZL A + I A L z T We can now get a dual formulation of this by using Lemma 3: ZLA γ T = ỹ L Z + I ZLA L z T 5 33 KAARCh Since in 5 signals appear only in dot products, we can use the kernel trick to introduce nonlinearity In this case we get Kernel AARCh KAARCh that at time T makes a prediction where K = mini,j γ T = ỹ K + I k, a t kx i, x j, for i, j =,, T, ie a kx, x a kx, x 2 a kx, x T a kx 2, x a + a 2 kx 2, x 2 a + a 2 kx 2, x T K = a kx T, x a + a 2 kx T, x 2 a + + a T kx T, x T and k i = a t kx i, x T, for i =,, T, ie k = i i,j a kx, x T a + a 2 kx 2, x T a + + a T kx T, x T,

4 Upper Bounds In this section we use the Aggregating Algorithm s properties to derive upper bounds on the cumulative square loss suffered by AARCh and KAARCh, compared to that of any expert in the pool 4 AARCh Loss Upper Bound Theorem 3 For any point in time T and any a t > 0, t =,, T, L T AARCh inf L T θ + a t θ t 2 θ A + Y 2 ln det L z t z tl A + I Proof Given the Aggregating Algorithm s properties, we know that 6 L T AARCh log β R nt β L T θ P 0 dθ η T T n/2 = log β a t π e ηθ L P T ztz t L+Aθ 2P T ytztlθ+p T y2 t dθ R nt By Lemma this is equal to inf L T θ + θ Aθ + log β η T T n/2 π nt/2 θ a t π det ηl T z tz tl + ηa = inf L T θ + a t θ t 2 η n T T + log β a t θ det ηl T z tz tl + ηa = inf L T θ + a t θ t 2 + T θ 2 log an t β det L T z tz tl + A = inf θ L T θ + a t θ t 2 A A 2 log det L T z tz tl A A + I β T an t = inf θ L T θ + A a t θ t 2 + Y 2 ln det L z t z tl A + I

42 KAARCh Loss Upper Bound The following generalises Theorem 3 Note that we cannot repeat the proof for the linear case directly since it involves the evaluation of an integral over the space R nt Theorem 4 Let k be a kernel on a space X, let D t, t = T, be any decision rules in the RKHS F induced by k and let D = D, D 2,, D T Then, for any point in time T and every a t > 0, t =,, T, L T KAARCh L T D + a t D t 2 + Y 2 ln det K + I 7 Proof It will be sufficient to prove this for D t of the form where l t are positive integers, c t i f t x = c t i kv t i, x, l t i= R, and v t i, x X we use t to show that these parameters can be different for each f t This is because such finite sums are dense in the RKHS F If we take f = f, f 2,, f T, 7 becomes L T KAARCh L T f + where L T f = t l a t i,j= c t i i= c t j l t y t c t i kv t i, x t kvt i, v t j + Y 2 ln det K + I, In the special case when X = R n and kv i, v j = v i v j for every v i, v j X, 8 follows directly from 6 Indeed, a kernel predictor f t reduces to the linear predictor θ t = l t i= ct i v t i and the term l t i,j= ct i c t j k v t i, v t j equals the squared quadratic norm of θ t Finally, by Sylvester s determinant identity see also Lemma 4 for an independent proof of this we know that det K + I = det ZLA L Z + I = det A L Z ZL A + I The general case can be obtained by using finite dimensional approximations Recall that inherent in every kernel is a function φ that maps objects to the RKHS F, which is isomorphic to l 2 = {α = α, α 2, i= α2 i converges} Let us consider the sequence on subspaces R R 2 F The set R s = {α, α 2,, α s, 0, 0, } may be identified with R s Let p s : F R s be the 2 8

projection operator p s α = α, α 2,, α s, 0, 0,, φ s : X R s be φ s = p s φ, and k s be given by k s v, v 2 = φ s v, φ s v 2, where v, v 2 X Inequality 8 holds for k s since R s has a finite dimension If 8 is violated, then its counterpart with some large s is violated too and this observation completes the proof 5 Discussion In this section we shall analyse upper bound 7 in order to obtain an equivalent of Corollary Our goal is to show that KAARCh s cumulative loss is less or equal to that of a wide class of experts plus a term of the order ot Estimating the determinant of a positive definite matrix by the product of its diagonal elements see [5, Sect 20, Theorem 7] and using the inequality ln + x x in our case x is small, and therefore the resulting bound is tight, we get Y 2 ln det t K + I Y 2 ln + c 2 Y 2 c 2 T = Y 2 c 2 T t a i= i a i= i T t + a t, where c = sup x X kx, x It is natural to single out the first decision rule D and the corresponding coefficient a from the rest We may think of it as corresponding to the choice of the principal dependency; let the rest of D t t = 2,, T be small correction terms Let us take equal a 2 = = a t = a We get L T KAARCh L T D + a D 2 + Y 2 c 2 T + a a t=2 D t 2 + Y 2 c 2 T T 2a 9 If we bound the norm of D by d and assume that T is known in advance, a may be chosen as in Corollary The second term in the right hand side of 9 T can thus be bounded by O If we assume that T t=2 D t 2 st, then the estimate is minimised by a = Y 2 c 2 T T /2sT and the third term in the right hand side of 9 can be bounded by O T st We therefore get the following corollary:

Corollary 2 Under the conditions of Theorem 4, let T be known in advance and c = sup x X kx, x For every every d > 0 and every function st, if D d and T t=2 D t 2 st, then a t, for t =,, T, can be chosen so that L T KAARCh L T D + 2Y cd T + 2Y c st T T /2 If st = o, then L T KAARCh L T D + ot The estimate st = o can be achieved in two natural ways First, one can assume that each D t, for t = 2,, T, is small Corollary 3 Under the conditions of Theorem 4, let T be known in advance For every positive d, d, and ε, if D d and, for t = 2,, T, then D t d T 05+ε, L T KAARCh L T D + O T max05, ε = L T D + ot Secondly, one may assume that there are only a few nonzero D t, for t = 2,, T In this case, the nonzero D t can have greater flexibility Acknowledgements We thank Volodya Vovk and Alex Gammerman for valuable discussions We are grateful to Michael Vyugin for suggesting the problem of predicting implied volatility which inspired this work References Vovk, V: Aggregating strategies In Fulk, M, Case, J, eds: Proceedings of the 3rd Annual Workshop on Computational Learning Theory, Morgan Kaufmann 990 37 383 2 Vovk, V: A game of prediction with expert advice Journal of Computer and System Sciences 56 998 53 73 3 Vovk, V: Competitive on-line statistics International Statistical Review 692 200 23 248 4 Cesa-Bianchi, N, Lugosi, G: Prediction, Learning, and Games Cambridge University Press 2006 5 Gammerman, A, Kalnishkan, Y, Vovk, V: On-line prediction with kernels and the complexity approximation principle In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press 2004 70 76 6 Vovk, V: On-line regression competitive with reproducing kernel Hilbert spaces Technical Report arxiv:cslg/05058 version 2, arxivorg 2006 7 Herbster, M, Warmuth, MK: Tracking the best linear predictor Journal of Machine Learning Research 200 28 309

8 Kivinen, J, Smola, AJ, Williamson, RC: Online learning with kernels IEEE Transactions on Signal Processing 528 2004 265 276 9 Cavallanti, G, Cesa-Bianchi, N, Gentile, C: Tracking the best hyperplane with a simple budget perceptron Machine Learning to appear 0 Busuttil, S, Kalnishkan, Y: Weighted kernel regression for predicting changing dependencies In: Proceedings of the 8th European Conference on Machine Learning ECML 2007 to appear Cristianini, N, Shawe-Taylor, J: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods Cambridge University Press, UK 2000 2 Aizerman, M, Braverman, E, Rozonoer, L: Theoretical foundations of the potential function method in pattern recognition learning Automation and Remote Control 25 964 82 837 3 Aronszajn, N: Theory of reproducing kernels Transactions of the American Mathematical Society 68 950 337 404 4 Schölkopf, B, Smola, AJ: Learning with Kernels Support Vector Machines, Regularization, Optimization and Beyond The MIT Press, USA 2002 5 Beckenbach, EF, Bellman, R: Inequalities Springer 96 Appendix Lemma Let Qθ = θ Aθ + b θ + c, where θ, b R n, c is a scalar and A is a symmetric positive definite n n matrix Then Rn e Qθ dθ = e Q0 π n/2 det A, where Q 0 = min θ R n Qθ Proof Let θ 0 arg min Q Take ξ = θ θ 0 and Qξ = Qξ + θ 0 It is easy to see that the quadratic part of Q is ξ Aξ Since 0 arg min Q, the form has no linear term Indeed, in the vicinity of 0 the linear term dominates over the quadratic term; if Q has a non-zero linear term, it cannot have a minimum at 0 Since Q 0 = min ξ R n Qξ, we can conclude that the constant term in Q is Q0 Thus Qξ = ξ Aξ + Q 0 It remains to show that R n e ξ Aξ dξ = π n/2 / det A This can be proved by considering a basis where A diagonalises or see [5, Sect 27, Theorem 3] Lemma 2 Let F A, b, x = min θ R nθ Aθ + b θ + x θ min θ R nθ Aθ + b θ x θ, where b, x R n and A is a symmetric positive definite n n matrix Then F A, b, x = b A x Proof It can be shown by differentiation that the first minimum is achieved at θ = 2 A b + x and the second minimum at θ 2 = 2 A b x The substitution proves the lemma

Lemma 3 Given a matrix A, a scalar a and I identity matrices of the appropriate size, AA + ai A = AA A + ai Proof AA + ai A = AA + ai AA A + aia A + ai = AA + ai AA A + aaa A + ai = AA + ai AA + aiaa A + ai = AA A + ai Lemma 4 For every matrix M the equality deti + M M = deti + MM holds where I are identity matrices of the correct size Proof Suppose that M is an n m matrix Thus I + MM and I + M M are n n and m m matrices respectively Without loss of generality, we may assume that n m otherwise we swap M and M Let the columns of M be m vectors x,, x m R n We have MM = n i= x ix i Let us see how the operator MM acts on a vector x R n By associativity, x i x i x = x i xx i, where x ix is a scalar Therefore, if U is the span of x, x 2,, x m, then MM R n U In a similar way, it follows that I + MM U U On the other hand, if x is orthogonal to x i, then x i x i x = x i xx i = 0 Hence MM U = 0, where U is the orthogonal complement to U with respect to R n Consequently, I + MM U = I by B V we denote the restriction of an operator B to a subspace V Therefore I + MM U U One can see that both U and U are invariant subspaces of I+MM If we choose bases in U and in U and then concatenate them, we get a basis of R n In this basis the matrix of I + MM has the form [ ] A 0, 0 I where A is the matrix of I + MM U It remains to evaluate deta First let us consider the case of linearly independent x, x 2,, x m They form a basis of U and we may use it to calculate the determinant of the operator I + MM U However, I + MM x i = x i + m x jx i x j and thus the matrix of the operator I + MM U in the basis x, x 2,, x m is I + M M The case of linearly dependent x, x 2,, x m follows by continuity Indeed, m vectors in an n-dimensional space with n m may be approximated by m independent vectors to any degree of precision and the determinant is a continuous function of the elements of a matrix j=