Finite-Memory Universal Prediction of Individual Continuous Sequences

Size: px

Start display at page:

Download "Finite-Memory Universal Prediction of Individual Continuous Sequences"

Collin Johnson
5 years ago
Views:

1 THE IBY AND ALADAR FLEISCHMAN FACULTY OF ENGINEERING THE ZANDMAN-SLANER SCHOOL OF GRADUATE STUDIES THE SCHOOL OF ELECTRICAL ENGINEERING Finite-Memory Universal Prediction of Individual Continuous Sequences Thesis submitted toward the degree of Master of Science in Electrical and Electronic Engineering by Ronen Dar June, 011

2 THE IBY AND ALADAR FLEISCHMAN FACULTY OF ENGINEERING THE ZANDMAN-SLANER SCHOOL OF GRADUATE STUDIES THE SCHOOL OF ELECTRICAL ENGINEERING Finite-Memory Universal Prediction of Individual Continuous Sequences Thesis submitted toward the degree of Master of Science in Electrical and Electronic Engineering by Ronen Dar This research was carried out at the School of Electrical Engineering, Tel-Aviv University Advisor: Prof. Meir Feder June, 011

3 Acknowledgements I would like to offer my sincerest gratitude to my advisor, Prof. Meir Feder, for his great support. His enthusiasm and inspiration made this thesis possible. The work with Meir was a pure delight, making me a better student and researcher. I would also like to thank the Yitzhak and Chaya Weinstein Research Institute for Signal Processing for the generous support.

4 Abstract In this work we consider the problem of universal prediction of individual continuous sequences with square-error loss, using a deterministic finite-state machine (FSM). The goal is to attain universally the performance of the best constant predictor tuned to the sequence, which predicts the empirical mean and incurs the empirical variance as the loss. This work analyzes the tradeoff between the number of states of the universal FSM and the excess loss (regret). We first present the Exponential Decaying Memory (EDM) machine, used in the past for predicting binary sequences, and show bounds on its performance. Then we look explicitly for the optimal machine with a small number of states. We consider a class of machines denoted the Degenerated Tracking Memory (DTM) machines, find the optimal DTM machine and show that it outperforms any other machine for a small number of states. Incidentally, we prove a lower bound indicating that even with large number of states the regret of the DTM machines does not vanish. Finally, we study the problem of universal machines with large number of states. We prove a lower bound on the achievable regret of any FSM, a lower bound that defines the best rate in which the regret can vanish with the number of states. Then we propose a new machine, the Enhanced Exponential Decaying Memory, which attains the bound and outperforms the EDM machine for any number of states.

5 Contents 1 Introduction Previous Results 4.1 Prediction in the Continuous Case Prediction in the Binary Case Problem Formulation Definitions Guidelines Exponential Decaying Memory Machine EDM Predictor for Individual Continuous Sequences Bounds on the Regret of the EDM Machine Designing an Optimal Machine with Small Number of States Single State Two States Three States Degenerated Tracking Memory Machine Introduction to the Class of DTM Machines Building the Optimal DTM Machine Lower Bound - The Limitation of the DTM Machines ii

6 5.4.4 Numerical Results On the Asymptotic Behavior of Finite-Memory Tracking Machines Lower bound on the achievable regret of any k-states machine Enhanced Exponential Decaying Memory Machine Designing the E-EDM Machine The Regret of The E-EDM Machine Number of States in the E-EDM machine Numerical Results Summary and Conclusions 51 A Lower Bound on the Maximal Regret of the EDM Machine 53 B Proof of Lemma iii

7 List of Figures 4.1 Down sub steps allocation Two states machine Optimal two states machine Three states machine Optimal three states machine Example of a DTM machine DTM machine minimal circle Regret vs. number of states - DTM and EDM machines Low number of states - DTM machine performance E-EDM machine - spacing Gap Mixed - Straight sequences Worst case spacing loss - an example Regret vs. number of states - E-EDM, EDM machines and the lower bound. 50 B.1 Spacing gap in the E-EDM machine B. Minimal circle between segments - splitting a down-step

8 Chapter 1 Introduction The subject of this thesis is universal prediction of individual continuous sequences using deterministic finite state machine w.r.t the square-error loss compared to the constant predictors class. The goal is to attain universally the performance of the best constant predictor tuned to the sequence, which predicts the empirical mean and incurs the empirical variance as the loss. Consider a continuous-valued individual sequence x 1, x,..., x t,..., where each sample is assumed to be bounded in the interval [a, b] but otherwise arbitrary with no underlying statistics. At each time t, after observing x t 1, a predictor guesses the next outcome ˆx t+1, and incurs a square error prediction loss (x t+1 ˆx t+1 ). Suppose one can tune a (non-universal) predictor to the sequence, from a given class of predictors. For example, the best constant predictor for a given sequence, i.e. a predictor that uses a constant prediction for all the sequence outcomes, is the empirical mean x = 1 n n x t. The square error loss incurred by this predictor is the sequence s empirical variance 1 n n (x t x). Thus, for a given sequence x n 1, the excess loss of a universal predictor over the best constant predictor, termed the regret of the sequence, is: regret = 1 n n (x t ˆx t ) 1 n n (x t x). In the individual setting, we analyze the performance of a universal predictor by the maximal excess loss, that is, the incurred regret of the worst sequence. It was shown that the Recursive Least Squares (RLS) algorithm generates a universal predictor that attains the performance of the best (non-universal) L-order linear predictor tuned to the sequence. When specialized to the case of zero-order, i.e. the case where the non-universal predictor is the constant empirical mean predictor, the resulting universal predictor is the Cumulative Moving Average (CMA): ˆx t+1 = (1 1 t + 1 )ˆx t + 1 t + 1 x t, where ˆx t is the prediction at time t. The regret of this predictor tends to zero with the sequence length n. Note that while the reference non-universal constant predictor needs

9 a single state, the universal CMA predictor requires an ever growing amount of memory. What happens when the universal predictor is constrained to be a finite k-state machine? This basic problem was left unexplored so far. Our work provides a solution for the problem, presenting such universal predictors attaining a vanishing regret when a large memory is allowed, but also maintain an optimal tradeoff between the regret and the number of states used by the universal predictor. In this thesis we present the following main results: We propose the Exponential Decaying Memory (EDM) machine that was presented in the past for predicting individual binary sequences as a finite-memory universal predictor for continuous sequences. We prove bounds on the worst regret in the continuous case. A new class of machines named the Degenerated Tracking Memory (DTM) is defined and a schematic algorithm for constructing the optimal DTM machine is further presented. We prove a lower bound on the achievable regret of the optimal DTM machine and show that it outperforms any other known deterministic machine for a small number of states. A lower bound on the achievable regret of any deterministic k-state predictor is presented. A new finite state machine termed the Enhanced Exponential Decaying Memory is presented for any vanishing desired regret. We prove that this machine outperforms any other known machine and approaches the lower bound. 3

10 Chapter Previous Results Universal prediction was studied extensively throughout the years. A summarizing overview of universal prediction in general is well given in [10]. In this chapter we present the most relevant results to the problem of universal prediction for individual sequences..1 Prediction in the Continuous Case Some aspects of the universal prediction problem for individual continuous sequences with square error loss were already explored by Merhav and Feder in [9]. That work actually considered a more general case and showed that the Recursive Least Squares (RLS) algorithm [], [3] generates a universal predictor that attains the performance of the best (non-universal) L-order linear predictor [8] tuned to the sequence. When specialized to the case of zero-order, i.e. the case where the non-universal predictor is the constant empirical mean predictor, the resulting universal predictor is the Cumulative Moving Average (CMA): ˆx t+1 = (1 1 t + 1 )ˆx t + 1 t + 1 x t, (.1) where ˆx t is the prediction at time t. The regret of this predictor tends to zero with the sequence length n. Note that while the reference non-universal constant predictor needs a single state, the universal predictor (.1) requires an ever growing amount of memory. What happens when the universal predictor is constrained to be a finite k-state machine? Universal estimation and prediction problems where the estimator/predictor is a k-state machine have been explored extensively in the past years. Cover [1] studied hypothesis testing problem where the tester has a finite memory. Hellman [4] studied the problem of estimating the mean of a Gaussian (or more generally stochastic) sequence using a finite state machine. This problem is closely related to our problem and may be considered as a stochastic version of it: if one assumes that the data is Gaussian than predicting it with a minimal mean square error essentially boils to estimating its mean. More recently, the finitememory universal portfolio selection problem (that dealt with continuous-valued sequences but considered a very unique loss function) was also explored [16]. Yet, the basic problem of finite-memory universal prediction of continuous-valued, individual sequences with square 4

11 error loss was left unexplored so far.. Prediction in the Binary Case The case of universal prediction for binary individual sequences using finite-state machine was explored in the last decade. The first deterministic machine for the individual binary sequences case w.r.t the log-loss function was presented by Rajwan and Feder [15]. The presented machine counts the number of ones and reset its self after a predefined number of input bits. It was shown that the counter and reset machine achieves redundancy of O( log(k) k ) compared to the constant predictors class, where k is the number of states. In [13] Meron and Feder lower bounded the redundancy of any k states deterministic FSM by O( 1). Later [1], a tighter lower bound of k O(k 4/5 ) was presented. Another important milestone in finding the optimal deterministic FSM for the binary case was achieved by Meron and Feder in [1] - the Exponential Decaying Memory (EDM) machine was presented. The EDM machine consists k states uniformly distributed over the probability axis [0, 1]. The machine predicts at each time the probability assigned to the current state, while the transition between states is as follows: ˆx t+1 = Q ( (1 k /3 )ˆx t + k /3 x t ), (.) where ˆx t is the current state, x t is the input bit and Q(x) is the quantization function to the nearest state. It was shown that the EDM machine achieves asymptotic redundancy of O(k /3 ). In [6], Ingber and Feder confront the question what is the optimal deterministic FSM for the non-asymptotic case?. The semi-counter machine was defined: Symmetric array of k states {S 1,..., S k }. At each time the machine predicts the value assigned to the current state. For the lower half of states, meaning for i = 1,..., M = k, the transition function is as follows: ϕ(i, 0) = i 1 1 < i M, ϕ(1, 0) = 1, ϕ(i, 1) = i + δ i 1 i M, (.3) where in a q semi-counter 0 < δ i q. Meaning, if at time t the machine is at state i and the input sample is x, the next state is ϕ(i, x). The transition function for the upper half of states is defined symmetrically. A method was given for finding the optimal transition length and the optimal probability assignment for each state. It was also proved that the optimal semi-counter machine is the best deterministic FSM when using low number of states. Furthermore, a lower bound on the achievable redundancy by any semi-counter machine was given, 0.0bit. 5

12 In [7] the Piecewise Linear Predictor (PLP) was presented. First a profile P (R) for any redundancy R is defined - set of rational numbers {r 1,..., r M } that is being constructed in the following manner: Start with P (R) = {0, 1 }. Find the pair of rational numbers (r i, r i+1 ) in the profile P (R) that maximize the following term: R gap (r 1, r ) = D(r 1 MP (r 1, r )), (.4) MP (r 1, r ) = 1 1+ h(r ) h(r 1 ) r r 1, (.5) where h( ) denotes the binary entropy function and D( ) denotes the binary divergence function. Add a new rational number r to the profile where r is selected to be the rational number with the minimal denominator that satisfy r i < r < r i+1. Repeat this process until: max i R gap (r i, r i+1 ) < R. (.6) Given any desired redundancy R, the PLP is being constructed by finding the corresponded profile P (R), dividing the [0, 1] axis into segments where each segment matches one of the rational numbers in the profile. States are being linearly spaced in each segment with a different slope according to the correspond rational number. It was shown that the PLP outperforms the EDM machine and achieves asymptotic performance which is very close to the lower bound of O(k 4/5 ). The PLP is the best asymptotic binary deterministic predictor known so far. 6

13 Chapter 3 Problem Formulation In this chapter we want to give basic definitions and guidelines for the problem discussed in this thesis, finite-memory universal prediction for individual (bounded) continuous sequences with square error loss when comparing to the class of constant predictors. 3.1 Definitions In this section we give definitions that will be used throughout this work. Definition 3.1. The loss of any predictor b that predicts the next outcome of an individual continuous sequence of length n, {x} n, is defined as the averaged square error between the prediction and the actual outcome at each time t: L b (x n 1) = 1 n n (x t ˆx t ), (3.1) where {ˆx} n are the predictions. Definition 3.. The regret of any predictor b is defined as the excess loss of the predictor over the best predictor from a reference class B in the worst case scenario: { R n (B, b) = max Lb (x n 1) min L b(x n 1) }. (3.) x n 1 b B Taking n to infinity defines the actual regret: R(B, b) = lim n R n (B, b). (3.3) Finite-state machine (FSM) is a commonly used model for sequential machines with a limited amount of storage. A sequential decision strategy based on an k states FSM is a triple E = (S, f, g), where S is a finite set of states with k elements, f : S B is the output function, and g : X S S is the next state function. When an input sequence {x t } n 1 is fed into E, starting with an initial state S 0 S, this FSM produces a sequence of output strategies {ˆx t } n 1 while going through a sequence of states {S t } n 1 according to the 7

14 recursive rules: S t = g(x t 1, S t 1 ) ˆx t = f(s t ), (3.4) where S t is the state of E at time t. For the case considered in this work, predicting continuous sequences, we formulate the following definitions: Definition 3.3. A deterministic finite-state machine that predicts the next outcome of a continuous sequence is defined as follows: Array of k states where {S 1,..., S k } denote the value assigned to each state. The prediction of the machine at time t, ˆx t, is the value assigned to the current state. A threshold set is defined for each state i: T i = {T i, md,i 1, T i, md,i,..., T i,mu,i }, (3.5) where m u,i and m d,i are the maximum up and down-steps from state i, correspondingly. The transition thresholds are non-decreasing and non-intersecting satisfying: Ti,j = [a, b], (3.6) where the samples of any sequence are assumed to be bounded in the interval [a, b]. The transition function for state i is as follows: i m d,i if T i, md,i 1 x < T i, md,i i m d,i + 1 if T i, md,i x < T i, md,i +1 ϕ(i, x) =. i + m u,i 1 if T i,mu,i x < T i,mu,i 1 i + m u,i if T i,mu,i 1 x < T i,mu,i, 1 i k For example, the machine jumps j states if at time t the machine is at state i and the input sample satisfies: T i,j 1 x t < T i,j. (3.7) 3. Guidelines In this section we present few basic theorems that will be used as fundamental guidelines throughout this thesis. Theorem 3.1. The regret of any predictor b w.r.t to the square-error function comparing to 8

15 the constant predictor class satisfies: { R n (B, b) = max 1 x n n 1 where x is the average of the sequence. n (x t ˆx t ) (x t x) }, (3.8) Proof. We need to prove that the best constant predictor for all sequences is the one that predicts the average of the sequence, meaning: min L b(x n 1 1) = min n b B b B = 1 n n (x t ˆx( b)) n (x t x), (3.9) where B denotes the constant predictors class and ˆx( b) denotes the constant prediction of predictor b. x is a local minima of L b(x n 1) for any {x} n. Since L b(x n 1) is a convex function, the local minima is global and single. Throughout this work we are going to discuss predictors designed for input samples that are bounded in the interval [0, 1]. The results in this work can be easily expand to predictors for input samples that are bounded in any interval [a, b] using the following theorem. Theorem 3.. Any FSM designed to achieve regret R for sequences bounded in the interval [0, 1], can be transformed into FSM that achieves regret (b a) R for sequences bounded in the interval [a, b], where a, b R, by applying the following transformation: Assume the states of the original FSM are {S 1,..., S k }. Each state S i is transformed into a + (b a)s i in the new FSM. Assume the thresholds set of the original FSM are {T 1,..., T k }. Each thresholds set T i is transformed into a + (b a)t i in the new FSM. Proof. Let {x} n be a sequence bounded in [a, b] and let {ˆx} n be the predictions of the new FSM suggested above. Let us examine the regret: regret new = 1 n n (x t ˆx t ) (x t x). (3.10) Let us apply the following transformation: x t = a + (b a) x t, (3.11) ˆx t = a + (b a)ˆ x t. (3.1) 9

16 Note that the sequence { x} n is bounded in [0, 1] and {ˆ x} n are the predictions of the original FSM over the sequence { x} n (both sequences apply the same sequence of jumps). Therefore, the regret of the new FSM satisfies: regret new = (b a) 1 n n ( x t ˆ x t ) ( x t x) = (b a) regret original. (3.13) Since the transformation is bijective, a corollary of Theorem 3. is that the optimal deterministic FSM for sequences bounded in the interval [0, 1] can be transformed to an optimal deterministic FSM for sequences bounded in the interval [a, b]. Definition 3.4. A circle is a cyclic closed set of states\predictions {ˆx t } L i=1 and input samples {x t } L that rotate the machine between these states. A minimal circle is a circle that does not contain the same state more than once. Theorem 3.3. The worst sequence for a given FSM takes the machine to a minimal circle and rotates in it endlessly. Proof. The proof is given in details in [14] where it is shown that the worst binary sequence for a given FSM w.r.t the log-loss function endlessly rotates the machine in a minimal circle. The proof is based on partitioning any long enough prediction sequence ˆx 1,..., ˆx n of a given FSM for an input sequence x 1,...x n into minimal circles and a negligible residual monotonic sequence. By using the convexity of the log-loss function, the regret over the entire sequence is upper bounded by the weighted average of the regrets of these minimal circles. Therefore a sequence that endlessly rotates the machine in the minimal circle with the highest regret will have a greater regret. Concluding that the worst sequence is the one that endlessly rotates the machine in the minimal circle with the highest regret. The proof for our case, the worst continuous sequence w.r.t the square loss function, is identical. Note that Theorem 3.3 implies that to analyze the regret of a given FSM, one should examine only the regrets of all sequences that rotate the machine in a minimal circle. In sequel we will use Theorem 3.3 to design FS machines by verifying that all minimal circles have a regret smaller than the desired regret. Also note that there is an infinite number of sequences that can rotate a machine in a minimal circle (any input sample x that satisfies T i,j 1 x < T i,j induces a j states jump from state i). In this work we will refer the regret of a minimal circle as the regret of the sequence that endlessly rotates the machine in the minimal circle and achieve the highest regret. 10

17 Chapter 4 Exponential Decaying Memory Machine Meron and Feder had presented in [11] the Exponential Decaying Memory (EDM) machine as a deterministic finite state machine for predicting the next outcome of an individual binary sequence. They further showed that the EDM machine with k states achieves asymptotic regret of O(k /3 ) compared to the constant predictors class with respect to the log-loss (code length) and square-error functions. Here we prove that also for the problem discussed in this work, predicting the next outcome of an individual em continuous sequence, the k-states EDM machine achieves asymptotic regret of O(k /3 ) compared to the constant predictors class with square-error loss. Moreover, we present lower and upper bounds on the maximal regret incurred by the worst sequence. 4.1 EDM Predictor for Individual Continuous Sequences Definition 4.1. The Exponential Decaying Memory machine is defined by k states {S 1,..., S k } distributed uniformly over [k 1/3, 1 k 1/3 ] axis. Thus, the state gap satisfies: = 1 k 1/3 k 1 k 1. (4.1) At each time t, the machine predicts the value assigned to the current state, denote ˆx t. The transition function between states is defined by: ˆx t+1 = Q(ˆx t (1 k /3 ) + x t k /3 ), (4.) where Q is the quantization function to the nearest state: Q(y) = ˆx t+1, (4.3) where y satisfies ˆx t+1 1 y < ˆx t (4.4) 11

18 Using Equation (4.) we note that the machine jumps j states from state ˆx t = S i if Q(S i + (x t S i )k /3 ) = S i + j, (4.5) or equivalently S i + (j 1 ) k/3 x t < S i + (j + 1 ) k/3. (4.6) Following Equation (4.6) we can define the transition thresholds for the EDM machine: Definition 4.. For any state i, the transition thresholds {T i, md,i,..., T i,mu,i } satisfy: T i,j = S i + (j + 1 ) k/3 m d,i j m u,i, (4.7) where m d,i and m u,i are the first integers satisfying: S i (m d,i + 1 ) k/3 0 S i + (m u,i + 1 ) k/3 1 Proposition 4.1. The farthest possible jump (either up or down) from any state of the EDM machine is upper bounded by k /3. Proof. From Equation (4.) we note that at time t the machine jumps (x t ˆx t )k /3 + δ t (4.8) states, where δ t is the quantization addition to the nearest state, satisfying: 1 δ t < 1. (4.9) Therefore, the farthest possible jump by the EDM machine at any time t can be upper bounded: (xt ˆx t )k /3 + δ t xt ˆx t k /3 + δ t where in the last inequality we applied Equation (4.1). (1 k 1/3 )k /3 + 1 k /3, (4.10) 4. Bounds on the Regret of the EDM Machine In this section we present lower and upper bounds on the maximal regret incurred by the worst continuous sequence. Thus, in a k-states EDM machine, the regret of any continuous sequence is smaller than the upper bound, and the maximal regret is at least the lower bound. 1

19 Theorem 4.1. The maximal regret of the k-states EDM machine (denoted U EDMk ), attained by the worst continuous sequence, is asymptotically bounded by 1 k /3 + O(k 1 ) max R(U EDMk, x n 1) 17 4 k /3. x n 1 Proof. Theorem 3.3 implies that the sequence that maximize the regret of any FSM is the one that takes it to the minimal circle with the highest regret, and rotates in it endlessly. Here we prove that any sequence that rotates the EDM machine in a minimal circle incur a regret smaller than 17 4 k /3. Thus, this upper bound holds also for the worst sequence. We examine the regret of L length sequence {x t } L that rotates the EDM machine in a minimal circle of L states {ˆx t } L. The input sample at each time t can be written as follows: x t = ˆx t + (P t + δ t )k /3, (4.11) where P t Z denotes the number of states crossed by the machine at time t (negative for down-steps and positive for up-steps). δ t is a quantization addition that satisfies 1 δ t < 1, (4.1) and has no impact on the jump at time t, i.e. has no impact on the prediction at time t + 1. Since the input samples rotate the machine in a minimal circle (last sample returns the machine to the starting state) we note that the number of states crossed on the way up and down is equal, hence: P t = 0. (4.13) Now, denote the average of the sequence by x = 1 L L x t. Thus, the regret of any sequence that endlessly rotates the EDM machine in a minimal circle satisfies: R(U EDMk, x L 1 ) = 1 L = 1 L = 1 L = 1 L (x t ˆx t ) (x t x) (P t + δ t ) k 4/3 (ˆx t + (P t + δ t )k /3 1 L δt k 4/3 P t k /3 (ˆx t 1 L (ˆx t + δ tk /3 )) (ˆx t + δ tk /3 )) (ˆx t + δ t k /3 1 L ( δ t k 4/3 P t k /3ˆx ) ( t + 1 L (ˆx t + δ t k /3 ) ) 1 L (ˆx t + δ tk /3 )) (ˆx t + δ t k /3 ), where in the last equality we applied Equation (4.13). Since the square-error function is 13

20 convex, applying Jensen s inequality results: R(U EDMk, x L 1 ) 1 L δt k 4/3 1 L P t k /3ˆx t. (4.14) The first term on the right hand side of Equation (4.14) depends only on the quantization of the input samples, δ t, thus we term it quantization loss. The second term depends on the spacing gap between states,, thus we term it spacing loss. Hence, the regret of the sequence is upper bounded by a loss incurred by the quantization of the input samples and a loss incurred by the quantization of the states values, i.e. the prediction values. By applying Equation (4.1) we bound the quantization loss: quantization loss = 1 L δt k 4/3 1 4 k /3. (4.15) We finalize the proof for the upper bound by proving that the spacing loss can be upper bounded by: spacing loss = 1 P L t k /3ˆx t 4k /3. (4.16) First we note that P t, the number of states crossed at time t, is positive for up-steps and negative for down-steps. As a corollary of Proposition 4.1, P t can be bounded by the farthest possible jump: P t k /3 k1/3 (4.17) states. Define sub-step as a one state step that is associated with a full step, i.e. a step of P > 0 states originated from state ˆx consist P sub-steps (each crosses a different state), all associated with the origin state, ˆx. Now assume an up-step of P u > 0 states from state ˆx u. Since we examine minimal circle, there are always corresponding down-steps that cross the same P u states on the way down. Therefore, it is possible to assign for each up-step of P u states, P u down sub-steps that cross the same states. Let D(ˆx u, P u ) be the set of down sub-steps assigned to an up-step of P u states originated from state ˆx u. We can use Proposition 4.1 to conclude that all down sub-steps in D(ˆx u, P u ) originated from a state that is not higher than: We further note that by definition: ˆx u + P u + k /3. (4.18) D(ˆx u, P u ) = P u. (4.19) 14

21 Figure 4.1: Minimal circle of two up-steps and two down-steps (solid lines). SS j are the down sub steps where D(S i, 3) = {SS 3, SS 4, SS 5 }, D(S i+3, ) = {SS 1, SS }. Note that sub steps SS 1, SS, SS 3 associated with origin state S i+5 while sub steps SS 4, SS 5 associated with origin state S i+. Thus we show that 1 L P tˆx t = 1 P L tˆx t + 1 P L t ˆx t t {up steps} = 1 L t {up steps} t {down steps} ( Ptˆx t + j D(ˆx t,p t) ˆx j ), (4.0) where ˆx j is the origin state of sub-step j (meaning, a down-step originated from state ˆx j contains sub-state j). By applying Equation (4.18) we get: 1 L ( P tˆx t 1 L Ptˆx t + (ˆx t + P t + k /3 ) ). (4.1) t {up steps} j D(ˆx t,p t) Now we can apply Equation (4.19): 1 L ( P tˆx t 1 L Ptˆx t + P t (ˆx t + P t + k /3 ) ) t {up steps} = 1 L t {up steps} P t (P t + k /3 ). (4.) Applying k 1 and bounding P t using Equation (4.17) results: 1 L P tˆx t k 1/3. (4.3) 15

22 Thus, the spacing loss satisfies: spacing loss = k /3 ( 1 L P tˆx t ) 4k /3. (4.4) The proof for the lower bound is given in Appendix A where we show that there is a sequence that endlessly rotates the k-states EDM machine in a minimal circle, incurring a regret of 1 k /3 +O(k 1 ). Thus, the regret of the worst sequence is at least 1 k /3 +O(k 1 ). 16

23 Chapter 5 Designing an Optimal Machine with Small Number of States In this chapter we want to design a finite-state universal predictor for a small number of states. We start by designing the optimal machine for a single, two and three states trying to understand the properties of the problem. We continue with defining a new class of machines, named the Degenerated Tracking Memory (DTM) machines, presenting an algorithm for constructing the optimal DTM machine and proving a lower bound on the achievable regret. At last, we present numerical results showing that the optimal DTM machine outperforms the EDM machine. 5.1 Single State We start with designing the optimal single state prediction machine. The machine predicts a constant prediction, denote S. Theorem 5.1. The optimal single state universal predictor predicts constant S = 1 achieves regret smaller than R = 1 for any sequence. 4 and Proof. Let us examine the regret of the predictor: regret = 1 n 1 n n (x t S) (x t x) n (x t S). (5.1) A constant sequence, meaning x t = x t, attains Equation (5.1) with equality. Therefore, we want to find the solution for the following minmax problem: min S max {regret} = min max (x 0 x 1 S). (5.) 0 x 1 17 S

24 If we choose S < 1, the maximal regret is: If we choose S > 1, the maximal regret is: max {regret} = max (x 0 x 1 0 x 1 S) = (1 S) > 1. (5.3) 4 max {regret} = max (x 0 x 1 0 x 1 S) = (0 S) > 1. (5.4) 4 Therefore, the optimal solution is S = 1 attaining regret R = Two States We continue with designing the optimal two states universal prediction machine. Denote the regret of the machine by R and the two states by S 1 and S. Figure 5.1: Two states machine described geometrically over the [0, 1] axis. There are two possible minimal circles: 1. Staying at the same state (zero-step circle).. Jumping from one state to the other (two steps circle). The first minimal circle can be viewed as a quantization error while the second minimal circle can be viewed as a prediction error. Theorem 5.. The optimal two states machine satisfies: S 1 = R While the transition function satisfies: S = 1 R { 1 if x < R ϕ(1, x) = otherwise { 1 if x < 1 R ϕ(, x) = otherwise 18

25 Proof. Since the two states are both saturated (jump is allowed only in one direction), higher S 1 and lower S reduces the regret of the second minimal circle (two steps circle). The zerostep minimal circle (staying at the same state) satisfies: hence, the states have to satisfy: regret = (x t S i ) R x t [S i R, S i + R], (5.5) S 1 R 0 S + R 1 (5.6) Thus, the optimal S 1 and S are the highest and lowest possible values, correspondingly, satisfying: S 1 = R S = 1 R (5.7) Now, the transition function consist only one threshold. For S 1, the thresholds is S 1 + R. Higher threshold results regret bigger than R, lower threshold can only maximize the regret of the two steps minimal circle. Therefore, this is the optimal transition threshold. Theorem 5.3. The maximal regret of the optimal two states machine, incurred by the worst sequence, is R = Proof. Theorem 5. implies that the zero-step minimal circle achieves regret R. We now want to find the sequence that maximize the regret of the two step minimal circle. Denote the samples that rotate the machine between the two states by x 1 for the up-step sample and x for the down-step sample. The regret of the two step minimal circle is: regret = 1 ( (x1 S 1 ) + (x S ) (x 1 x) (x x) ), (5.8) where x is the average of the sequence. It can be shown that Equation (5.8) can be also written as follows: regret = ( 1 (x 1 + x ) S 1 ) (S S 1 )x + 1 (S S 1). (5.9) Finding x 1 and x that maximize above, is a maximization problem over a two variables convex function with two constraints resulting from the transition function proved to be optimal in Theorem 5.: R x x 1 R. (5.10) 19

26 We can write: max{regret} = 1 x 1,x (S S1) + max{max{( 1 x 1 (x 1 + x ) S 1 ) } (S S 1 )x } x = 1 (S S 1) + max x {( 1 (1 + x ) S 1 ) (S S 1 )x }, (5.11) where in Equation (5.11) we applied x 1 = 1 to maximize the regret since 1 (x 1 + x ) S 1 for any x 1, x. We note that now the maximization problem is over a convex function with a singular minimum point at x = 1 R. Thus, x = 0 maximize the regret, concluding that the maximal regret is attained with x 1 = 1 and x = 0. Since the machine achieves regret R, the minimal circle have to satisfy: R 1 ( (x1 S 1 ) + (x S ) (x 1 x) (x x) ) x1 =1,x =0 = (1 R) ( 1 ). (5.1) Solving above results: R ( 3 8 ) = (5.13) We can now summarize the optimal two states machine: State values are: S 1 = 3 8, S = 5 8 The states transition function satisfies: { 1 if x < 3 ϕ(1, x) = 4 otherwise { 1 if x < 1 ϕ(, x) = 4 otherwise Comments: 1. For any desired regret satisfying 0.5 R 0.14, the optimal two states machine is the optimal solution in sense of minimum number of states.. The optimal two states machine for predicting continuous sequences is identical to the binary case (with square-error loss function), in sense of states value and maximal regret. 0

27 1.5 1 Lower state Higher state 0.5 Jump Axis Figure 5.: Optimal two states machine described geometrically over the [0, 1] axis along with the transition thresholds for each state. 5.3 Three States We continue with designing the optimal three states universal prediction machine. Denote the regret of the machine by R and the three states by {S 1, S, S 3 }. Note that now also a two states jump from S 1 to S 3 is possible. Figure 5.3: Three states machine described geometrically over the [0, 1] axis. Theorem 5.4. The optimal three states machine allows only a single state jump. Proof. From symmetry aspects, if applying a two states jump from S 1, also a two states down-jump have to be applied to S 3. This means that the machine can be rotated between S 1 and S 3 with x 1 = 1 and x = 0 as in the two states machine. This minimal circle results 1

28 an identical regret to the two states machine case, 0.14, since is attained with samples 0 and 1. Hence, if applying a two states jump, no improvement can be gained (in the regret sense) over the two states machine by adding another state. Since no two states jump is allowed, there are three possible minimal circles: 1. zero-step circle (staying at the same state).. Rotating the machine between S 1 and S. 3. Rotating the machine between S and S 3. Theorem 5.5. The optimal three states machine satisfies: S 1 = R S = 1 While the transition function satisfies: S 3 = 1 R { 1 if x < R ϕ(1, x) = otherwise 1 if x < 1 R ϕ(, x) = if 1 R x < 1 + R 3 otherwise { if x < 1 R ϕ(3, x) = 3 otherwise Proof. From symmetry aspects we can conclude that S = 1. In the same manner as shown in Proof 5., it can be shown that by choosing the regret of the zero-step minimal circle of the saturated states to be R we minimize the regret of the two steps minimal circles. Thus, S 1 = R and S 3 = 1 R are the optimal saturated states. In the same manner as shown in Proof 5., the optimal transition thresholds for state i are R away from S i, otherwise the regret of the two step minimal circles can be increased. Theorem 5.6. The maximal regret of the optimal three states machine, incurred by the worst sequence, is R = Proof. Theorem 5.5 implies that the zero-step minimal circle achieves regret R. From symmetry aspects, the regrets of the two steps minimal circles are equal. Let us analyze the minimal circle between S 1 and S. Denote the up-step sample by x 1 and the down-step sample by x. We now want to find the sequence that maximize the regret of the minimal circle between S 1 and S : regret = 1 ( (x1 S 1 ) + (x S ) (x 1 x) (x x) ), (5.14)

29 where x is the average of the sequence. It can be shown that Equation (5.14) can be also written as follows: regret = ( 1 (x 1 + x ) S 1 ) (S S 1 )x + 1 (S S 1). (5.15) Finding x 1 and x that maximize above, is a maximization problem over a two variables convex function with two constraints resulting from the transition function proved to be optimal in Theorem 5.5: We can write: R x x 1 R (5.16) max{regret} = 1 x 1,x (S S1) + max{max{( 1 x 1 (x 1 + x ) S 1 ) } (S S 1 )x } x = 1 (S S 1) + max x {( 1 (1 + x ) S 1 ) (S S 1 )x }, (5.17) where in Equation (5.17) we applied x 1 = 1 to maximize the regret since 1 (x 1 + x ) S 1 for any x 1, x. We note that now the maximization problem is over a convex function with a singular minimum point at x = 0. Therefore, x = 1 R maximize the regret. Thus, the maximal regret is attained with x 1 = 1 and x = 1 R. Since the machine achieves regret R, the regret of the minimal circle have to satisfy: R 1 ( (x1 S 1 ) + (x S ) (x 1 x) (x x) ) x1 =1,x = 1 R = 1 ( (1 R) + R ( R). (5.18) Solving above results: R (5.19) We can now summarize the optimal three states machine: State values are: S 1 = 0.385, S = , S 3 =

30 1.5 1 Lower state Middle state Higher state 0.5 Jump Axis Figure 5.4: Optimal three states machine described geometrically over the [0, 1] axis along with the transition thresholds for each state. The states transition function satisfies: { 1 if x < ϕ(1, x) = otherwise 1 if x < ϕ(, x) = if x < otherwise { if x < ϕ(3, x) = 3 otherwise Comments: 1. For any desired regret satisfying 0.14 R , the optimal three states machine is the optimal solution in sense of minimum number of states.. The optimal three states machine in the continuous sequences case is different than in the binary case. The need to track continues samples and not only binary samples degrades the maximal regret that is achievable with three states to from in the binary case. 4

31 3. Figure 5.4 depict the states and the transition thresholds over the [0, 1] axis. One can notice the hysteresis characteristics of the machine, providing a memory or inertia to the finite-state predictor. An extreme input sample is needed for the machine to jump from the current state, that is, to change the prediction value. 5.4 Degenerated Tracking Memory Machine After designing the optimal finite-state machine for a single, two and three states, we want to find the optimal deterministic finite-state machine for k states, where in this chapter we assume k is relatively small. In this section we introduce the class of Degenerated Tracking Memory (DTM) machines and present an algorithm for constructing the optimal DTM machine. We further present a lower bound on the achievable regret of the machine Introduction to the Class of DTM Machines Definition 5.1. The class of all Degenerated Tracking Memory machines with k states is of the form: An array of k states. We denote the states in the lower half {S kl,..., S 1 } (in descending order where S 1 is the nearest state to 1 and S i 1 for all 1 i k l). We denote the states in the upper half { S 1,..., S ku } (in ascending order where S 1 is the nearest state to 1 and S i > 1 for all 1 i k u), where k l + k u = k. The maximum down-step in the lower half, i.e. from states {S kl,..., S 1 }, is no more than a single state jump. The maximum up-step in the upper half, i.e. from states { S 1,..., S ku } is no more than a single state jump. A transition between the lower and upper halves is allowed only from and to the nearest states to 1, S 1 and S 1 (implying that the maximum up-jump (down-jump) from S 1 ( S 1 ) is a single state jump). An example for a DTM machine is depict in Figure 5.5. Figure 5.5: An example of a DTM machine - note that a transition between the lower and upper halves is allowed only from (and to) S 1 and S 1. Solid lines represent the maximum up or down jumps from each state. In a DTM machine only a single state down-jump (up-jump) from all states in the lower (upper) half is allowed. In addition, the transition between the lower and upper halves is allowed only from and to the nearest states to 1, S 1 and S 1, implying that the maximum up-jump (down-jump) from {S kl,..., S } ({ S,..., S ku }) is up to S 1 ( S 1 ). These constraints facilitate the algorithm for constructing the optimal DTM machine. 5

32 5.4. Building the Optimal DTM Machine We present here a schematic algorithm for constructing the optimal DTM machine. Given a desired regret, R d, the task of finding the optimal DTM machine can be viewed as a covering problem, that is, assigning the smallest number of states in the interval [0, 1], achieving a regret smaller than R d for all sequences. We note that in an optimal k-state machine, the upper half of the states is the mirror image of the lower half. The symmetry property arises from the fact that any sequence x 1,..., x n can be transformed into the symmetric sequence 1 x 1,..., 1 x n. Both sequences achieve the same regret if full symmetry between the lower and upper halves is applied. Thus, assuming that the lower half is optimal in sense of achieving the desired regret with the smallest number of states, the upper half must be the reflection of the lower half to achieve optimality. Note that this property allows us to design the optimal DTM machine only for the lower half. The algorithm we present here recursively finds the optimal states allocation and their transition thresholds. Suppose states {S i 1,..., S 1 } in the lower half (in descending order where S 1 is the nearest state to 1) and their transition thresholds set {T i 1,..., T 1 } are given and satisfying regret smaller than R d for all minimal circles between them. Our algorithm generates the optimal S i, i.e. the optimal allocation for state i, and a thresholds set, T i, satisfying regret smaller than R d for all minimal circles starting at that state. We start by finding S 1, the nearest state to 1 in the lower half, in the optimal DTM machine. Lemma 5.1. In the optimal k-states DTM machine for a given desired regret R d, S 1 = 1 if k is odd and { S 1 = max 1 R d + 1, + R 4 d R d + } R d + 1 if k is even. Proof. From symmetry aspects S 1 = 1 in the optimal DTM machine with odd number of states, otherwise there are more states in one of the halves and the symmetry property presented above does not hold. In the optimal DTM machine with even k states, the nearest state to 1 in the upper half, S 1, is the mirror image of S 1, hence S 1 = 1 S 1. By definition, only a single state up-jump is allowed from S 1 and only a single state down-jump is allowed from S 1. Thus, the machine can be rotated between these states, constructing a two steps minimal circle. Denote by x 1 and x the samples that induce the up and down jumps, correspondingly. These samples must satisfy the transition thresholds, i.e. S 1 + R d x x S 1 R d = 1 S 1 R d. (5.0) Since the regret is a convex function, the samples x 1 and x that bring the regret of the minimal circle to maximum are at the edges of the constraint regions, therefore in a two steps minimal circle there are four combinations that need to be analyzed: 6

33 1. x 1 = 1, x = 0: regret = 1 ] [(1 S 1 ) + (0 S 0 ) (1 1 ) (0 1 ) = (1 S 1 ) ( 1 ). (5.1) Thus, to achieve regret smaller than R d we claim for: 1 R d S R d (1). x 1 = 1, x = S 1 R d : regret = 1 [(1 S 1 ) + R d (1 x) ( S 1 ] R d x) = 1 [(1 S 1 ) + R d 1 (S 1 + ] R d ) = 1 [ 1 S 1 S 1 ( + ] R d ) R d. (5.) Thus, to achieve regret smaller than R d we claim for: + R d R d + R d + 1 S 1 + R d + R d + R d + 1. () 3. x 1 = S 1 + R d, x = 0 - same result as in. 4. x 1 = S 1 + R d, x = S 1 R d : regret = 1 = 1 [R d (S 1 + R d x) (1 S 1 R d x) ] [ R d 1 S 1 + S 1 (1 ] R d ). (5.3) Thus, to achieve regret smaller than R d we claim for: which is being satisfied for any S 1. S 1 S 1 (1 R d ) R d R d 0, (5.4) Therefore S 1 must satisfy two constraints, (1) and (). We also want to choose the lowest S 1 that satisfy both constraints, hence taking the maximum between the lower bounds in (1) and (). Note that S 1 must satisfy S 1 1 which does not hold for low enough R d, implying a lower bound on the achievable regret of the optimal DTM machine (see next section). 7

34 Now, after presenting the starting state of the algorithm, we present the complete algorithm for constructing the optimal DTM machine: 1. Set i = 1 and the corresponded starting state S 1 for odd or even number of states (see Lemma 5.1). Set the maximum up-step from the starting state m u,1 = 1.. Set the next state index i = i For all 1 m i 1 (where m denotes the maximum up-step from state i) find the minimal S i,m with valid thresholds set T i,m (in sequel we present the algorithm for finding the thresholds set). 4. Choose the minimal S i,m among all possible maximum up-steps, that is: m u,i = arg min 1 m i 1 S i,m S i = S i,mu,i T i = T i,mu,i. Thus we have set the parameters of state i: assigned value S i, maximum up-jump of m u,i states and transition thresholds T i. 5. If S i > R d go to step (). 6. Set the upper half of the states to be the mirror image of the lower half. Explanations and Comments: For a given desired regret R d, one should run the algorithm presented above twice - for odd and even number of states with the corresponded starting state, S 1. The optimal DTM machine is the one with the least states among the two (differ by a single state). Note that a transition thresholds for state 1 need to be given - a single state up-jump if the input sample satisfies x S 1 + R d and a single state down-jump if the input sample satisfies x S 1 R d. These are the optimal transition thresholds since as the interval for transition is wider the number of possible worst sequences in other minimal circles increases. Note that these transition thresholds achieve the maximal regret R d for zero-step minimal circle (staying at S 1 ). A valid thresholds set for state i is a set of transition thresholds that satisfy regret smaller than R d for all minimal circles starting at state i. We now want to find an optimal transition thresholds for state i. Note that in DTM machines the thresholds for up-steps from state i do not have an impact on regrets of minimal circles other than those starting at state i. Thus, the optimality of these thresholds is only in sense of satisfying regret smaller than R d for minimal circles starting at state i. The optimal transition thresholds for the down-step are {0, S i R d } since we want to allow the smallest 8

35 interval for input samples that induce a down-step (the smallest impact on minimal circles that starts in a different state). Now, consider states {S i 1,..., S 1 } in the lower half and their transition thresholds set {T i 1,..., T 1 } are given and satisfying regret smaller than R d for all minimal circles between them. Suppose also S i and m are given, where m denotes the maximum up-step from state i. Note that there are m + 1 minimal circles starting at state i (depict in Figure 5.6): zero-step minimal circle (staying at state i). For any j m + 1, a minimal circle of j steps - one up-step (of j 1 states), j 1 down-steps (of a single state). Also note that these m + 1 minimal circles are within the lower half, that is within the states {S i 1,..., S 1 } (since by definition a transition from the lower half to the upper half is allowed only from S 1 ). Let x j 1 be the samples that endlessly rotate the machine in a j steps minimal circle, where x 1 induces the up-step from state i and x j induce the down-steps. Since the regret is convex in the input samples, the samples x j that bring the regret to maximum are at the edges of the transition regions, that is, satisfying x t = ˆx t R d or x t = 0 t j. (5.5) Figure 5.6: m + 1 possible minimal circles starting at S i, where m is the maximum up-step from state i. Suppose that for a given x j, the regret of the sequence is smaller than R d if x 1 satisfies C l (x j ) x 1 C h (x j ). (5.6) Thus, by Equation (5.5), x 1 must satisfy the constraint given in (5.6) for j 1 known combinations of x j to satisfy regret smaller than R d for any sequence that rotates the machine in this minimal circle. Hence, C l = max x j A C l (x j ) x 1 min x j A C h (x j ) = C h (5.7) satisfies all the constrains, where A is the set of j 1 combinations of x j according to Equation (5.5). Since x 1 must also satisfy the transition thresholds of state i, i.e. T i,j x 1 T i,j 1, (5.8) 9

36 we can conclude that the transition thresholds must satisfy C l T i,j, T i,j 1 C h. (5.9) Going over all minimal circles, j m + 1, results bounds on all transition thresholds (upper and lower bound on each threshold). Thus, if a thresholds set can be found to satisfy all bounds and to cover the interval [S i + R d, 1], we finished. Otherwise, no valid thresholds can be found for the given S i and m. Lemma 5.. Consider a sequence x j 1 that rotates a DTM machine in a j steps minimal circle starting at state i. Given states S i,..., S i j+1, the regret is smaller than R d if x 1 satisfies: C l (x j ) x 1 C h (x j ), where C l (x j ) and C h (x j ) satisfy: j C l (x j ) = S i + (S i x t ) j Rd 1 j t= j C h (x j ) = S i + (S i x t ) + j Rd 1 j t= j (S i j+t 1 S i )(S i j+t 1 + S i x t ), t= j (S i j+t 1 S i )(S i j+t 1 + S i x t ). t= (5.30) Proof. Assume samples x j 1 are rotating the machine in a j steps minimal circle starting at state i. Let us analyze the regret of this sequence: regret j = 1 j = 1 j j [(x t ˆx t ) (x t x) ] j [ x x tˆx t + ˆx t ] = x + 1 j [S i x 1 S i ] + 1 j = 1 j [1 j x 1 + x 1 ( 1 j j [Si j+t 1 x t S i j+t 1 ] t= j (x t ) S i ) + 1 j j [ x t ] + Si + t= t= j (Si j+t 1 x t S i j+t 1 )] We claim for regret j R d, meaning x 1 must satisfy the following equation: t= j j x 1 + x 1 ( (x t ) js i ) + [ x t ] + j[si + t= t= j (Si j+t 1 x t S i j+t 1 )] j R d 0 (5.31) t= 30

Lecture 16: Perceptron and Exponential Weights Algorithm

EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew