EECS 598-6: Prediction, Learning and Games Fall 3 Lecture : Weighted Majority Algorithm Lecturer: Jacob Abernethy Scribe: Petter Nilsson Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.. Course announcements A GSI will likely) be recruited to the course. For now, office hours are Tuesday pm-pm. Homework is about to be posted, due 9/3. Group work is encouraged for homework.. Predictive sorting revisited Lecture ended by studying the problem of predicting the outcomes of games played among k teams, where a game between two of these k teams is played in every round. Assume the outcomes y s, s =... t in t rounds have been observed for pairs of teams i, j ),..., i t, j t ). Further assume that there eists a perfect ranking π S k determining the outcomes y s of each game. Without loss of generality WLOG) assume π i s ) < π j s ) for s =,..., t. In round t, the pair i t, j t ) is observed. The Halving algorithm makes a prediction ŷ t of the outcome y t by comparing the number of members of the following sets: C < t := {π S k πi s ) < πj s ) s =,..., t and πi t ) < πj t )},.) C t > := {π S k πi s ) < πj s ) s =,..., t and πi t ) > πj t )}..) The prediction follows majority vote, i.e. { < if C < t C > t, ŷ t =.3) > otherwise. The naive way to determine ŷ t is to enumerate S k and eliminate permutations that are inconsistent with past outcomes. This is however not efficient since S k = k!. The problem of computing the quantitites C t < and C t > is in fact #P -hard. The outcomes can be recorded in a directed acyclic graph DAG), and the problem is equivalent to finding the number of topological orderings of that graph, which is known to be #P -complete []. Challenge Problem) Continuation from last lecture: Is the problem of determining the largest of these two sets also #P -complete, or is there an efficient algorithm?.3 Review of basics in Conve Analysis Definition. A function f : R n R is conve if and only if any of the following conditions hold: a) For all, y R n, f) + + y fy) f )..4) -
Lecture : Weighted Majority Algorithm - b) For all, y R n and α [, ], c) For all random variables X, Jensen s inequality is satisfied. α)f) + αfy) f α) + αy)..5) EfX) fex)).6) Remark. As seen in.6), Jensen s inequality can actually be used to define conveness. Eercise) Prove that.4)-.6) are equivalent. Proposition.3 A sufficient condition for f to be conve is that f,.7) where stands for positive semi-definiteness, i.e. f has non-negative eigenvalues. Proposition.4 If f is conve and differentiable, then for all, δ R n, f + δ) f ) + δ f )..8) Remark.5 Equation.8) could also be used to define conveness. However, in the case of non-differentiability the concept of subgradient is required..3. Approimations The following inequalities will be used in the course. The functions are plotted in Figure. for a visual interpretation of these inequalities. ) Upper bound on the eponential function obtained by linearizing e in : e +..9) ) Lower bound on the logarithm function, derived by using the log operator on.9): log + )..) ) This inequality follows from the conveity of fz) = e αz and is obtained by inserting and as evaluation points in.5): e α + e α ) for [, ]..) 3) Finally, a harder inequality: log ) + for [, /]..) Challenge Problem) Logarithm inequality: Prove the inequality.).
Lecture : Weighted Majority Algorithm -3 3.5 f) = e f) = +. f) = log + ) f) =.8 f).5 f).6.4..5.5.5 3.5 f) = e 3 f) = + e α )....4.6.8.. f) = log ) f) = +.5.8 f) f).6.4.5..5...4.6.8......3.4.5.6 Figure.: Plots of the functions in the inequalities.9) -.).
Lecture : Weighted Majority Algorithm -4.4 Back to Online Learning The drawback of the Halving algorithm presented last time is the strong assumption that a perfect epert eists. How can this assumption be removed? The answer is to use weights for the eperts, based on past performance. This leads to the Weighted Majority Algorithm WMA) outlined below. This algorithm was first proposed 994 by Littlestone and Warmuth in []. Algorithm : Weighted Majority Algorithm input: N eperts predicting the outcomes, a parameter ɛ >. wi for all i =,..., N weights initialized to for all eperts) for rounds t=,,... do fi t {, } obtain prediction of epert i, i =,..., N) ŷ t round i wt i f i t ) compute prediction of WMA) i wt i Outcome y t is revealed w t+ i w t i ɛ)[f t i y t ] epert weights are updated according to their predictions) end The function round : [, ] {, } above is defined as round) = { if [, /], if /, ]..3) Theorem.6 For all eperts i, it holds that M T WMA) log N ɛ + + ɛ)m T epert i),.4) where M T WMA) resp. M T epert i)) is the number of mistakes of the algorithm resp. epert i) up to round T. Proof Idea: The proof is similar to the that of the bound on the Halving algorithm, but instead of remaining number of eperts, total weight is used. Proof: Define total weight at time t as Evidently, we have that Furthermore, it holds that Φ t = N wi. t.5) i= Φ = N..6) Φ T w T i = ɛ) M T epert i),.7) since the weight of epert i will have been decreased M T epert i) times. Suppose that the algorithm makes a false prediction at time t. Then more than half of the weight Φ t must have been on eperts who made false predictions. Thus, their weight would decrease by a factor ɛ. This leads to the following result: Lemma.7 If the algorithm makes a mistake at time t, the following inequality holds: Φ t+ Φ t ɛ )..8)
Lecture : Weighted Majority Algorithm -5 Eercise) Prove this lemma rigorously. By repeatedly applying the lemma, Using.6),.7) and.9) one then obtains Φ T Φ ɛ )M T WMA)..9) ɛ) M T epert i) N ɛ )M T WMA)..) Apply the negative logarithm to get log ɛ) M T epert i) logn) + log ɛ ) }{{} ) M T WMA),.) }{{} ɛ + ɛ ɛ/ where the new inequalities result from using.) and.), respectively. By inserting these and rearranging the terms, the inequality stated in the theorem is derived..5 Net lecture The Weighted Majority Algorithm will be discussed further in the net lecture; a variant of the algorithm will be presented which gets rid of the factor in Theorem.6. References [] Brightwell, G., and Winkler, P. Counting linear etensions is #p-complete. In Proceedings of the twenty-third annual ACM symposium on Theory of computing New York, NY, USA, 99), STOC 9, ACM, pp. 75 8. [] Littlestone, N., and Warmuth, M. The weighted majority algorithm. Information and Computation 8, 994), 6. To use this inequality, ε must be less than.5 or actually.68)