COS 511: heoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 15 Scrbe: Jemng Mao Aprl 1, 013 1 Bref revew 1.1 Learnng wth expert advce Last tme, we started to talk about learnng wth expert advce. hs model s very dfferent from models we saw earler n the semester. In ths model, the learner can see only one example at a tme, and has to make predctons based on the advce of experts and the hstory. Formally, the model can be wrtten as N = # of experts for t = 1,, each expert predcts ξ {0, 1} learner predcts ŷ {0, 1} learner observes y {0, 1} (mstake f y ŷ) In the prevous lecture, we gave a Halvng algorthm whch works when there exsts a perfect expert. In ths case, the number of mstakes that the Halvng algorthm makes s at most lg(n). 1. Connecton wth PAC learnng Now we consder an analogue of the PAC learnng model : hypothess space H = {h 1,, h N } target concept c H on each round get x predct ŷ observe c(x) hs problem s a very natural onlne learnng problem. However, we can relate ths problem to learnng wth expert advce by consderng each hypothess h as an expert. Snce the target concept s chosen nsde the hypothess space, we always have a perfect expert. herefore, we can apply the Halvng algorthm to solvng ths problem and the Halvng algorthm wll make at most lg(n) = lg( H ) mstakes.
Boundng the number of mstakes In the prevous secton, t s natural to ask whether the Halvng algorthm s the best possble. In order to answer ths problem, we have to defne what s best. Let A be any determnstc algorthm, defne M A (H) = max(# mstakes made by A) c,x M A (H) s the number of mstakes A wll make on hypothess space H n the worst case. An algorthm s consdered good f ts M A (H) s small. Now defne opt(h) = mn M A(H). A e are gong to lower bound opt(h) by V Cdm(H) as the followng theorem: heorem 1. V Cdm(H) opt(h) Proof: Let A be any determnstc algorthm. Let d = V Cdm(H). Let x 1,..., x d be shattered by H. Now we can construct an adversaral strategy that forces A to make d mstakes as followng: for t = 1,, d - present x t to A - ŷ t = A s predcton - choose y t ŷ t. Snce x 1,..., x d s shattered by H, we know there exsts a concept c H, such that c(x t ) = y t for all t. In addton, snce A s determnstc, we can smulate t ahead of tme. herefore such adversaral constructon s possble. hus there exsts an adversaral strategy to force A to make d mstakes. o sum up ths secton, we have the followng result, V Cdm(H) opt(h) M Halvng (H) lg( H ). 3 eghted majorty algorthm Now we are gong back to learnng wth expert advce and we want to devse algorthms when there s no perfect expert. In ths settng, we wll compare the number of mstakes of our algorthm and the number of mstakes of the best expert. Here we modfy the Halvng algorthm to get an algorthm when there s no perfect expert. Instead of dscardng experts, we wll keep a weght for each expert and lower t when ths expert makes a mstake. e call ths algorthm weghted majorty algorthm: w = weght on expert Parameter 0 β < 1 N = # of experts
Intally, w = 1 for t = 1,, - each expert predcts ξ {0, 1} - q 0 = :ξ =0 w, q 1 = :ξ =1 w. - learner predcts ŷ {0, 1} { 1 f q1 > q 0 - ŷ = (weghted majorty vote) 0 else - learner observes y {0, 1} - (mstake f y ŷ) - for each, f ξ y, then w w β Now we are gong to analyze weghted majorty algorthm(ma) by provng the followng theorem: heorem. (# of mstakes of MA) a β (# of mstakes of the best expert) +c β lg(n), where a β = lg(1/β) and c lg( 1+β ) β = 1 lg( ). 1+β Before provng ths theorem, let s frst try to understand what ths theorem mples. he followng table gves a good understandng of the parameter: β a β c β 1/.4.4 0 + 1 1 + If we dvde both sdes of the nequalty by, we get (# of mstakes of MA) a β(# of mstakes of the best expert) + c β lg(n) hen +, c β lg(n) 0. hen ths theorem means that the rate that MA makes mstakes s bounded by a constant tmes the rate that the best expert makes mstakes. Proof: Defne as the sum of the weghts of all the experts: = N =1 w. Intally, we have = N. Now we consder how changes n some round. On some round, wthout loss of generalty, we assume that y = 0. hen new = N =1 = :ξ =1 w new w β + :ξ =0 = q 1 β + q 0 = q 1 β + ( q 1 ) = (1 β)q 1 w 3
Now suppose MA makes a mstake. hen ŷ y ŷ = 1 q 1 q 0 q 1 new (1 β) = (1 + β ) So f MA makes a mstake, the sum of weghts wll decrease by multplyng 1+β. herefore, after m mstakes, we get an upper bound of the sum of weghts as N ( 1 + β ) m. Defne L to be the number of mstakes that expert makes. hen we have, w = β L N ( 1 + β ) m. Solvng ths nequalty, and notng that t holds for all experts, we get m (mn L ) lg(1/β) + lg(n) lg( 1+β ). 4 Randomzed weghted majorty algorthm he prevous proof has shown that, at best, the number of mstakes of MA s at most twce the number of mstakes of the best expert. hs s a weak result f the best expert actually makes a substantal number of mstakes. In order to get a better algorthm, we have to ntroduce randomness. he next algorthm we are gong to ntroduce s called randomzed weghted majorty algorthm. hs algorthm s very smlar to weghted majorty algorthm. he only dfference s that rather than makng predcton by weghted majorty vote, n each round, the algorthm pcks an expert wth some probablty accordng to ts weght, and uses that randomly chosen expert to make ts predcton. he algorthm s as followng: w = weght on expert Parameter 0 β < 1 N = # of experts Intally, w = 1 for t = 1,, - each expert predcts ξ {0, 1} - q 0 = :ξ =0 w, q 1 = :ξ =1 w. - learner predcts ŷ {0, 1} 4
1 wth probablty - ŷ = 0 wth probablty - learner observes y {0, 1} - (mstake f y ŷ) :ξ =1 w :ξ =0 w - for each, f ξ y, then w w β = q 1 = q 0 Now let s analyze randomzed weght majorty algorthm(rma) by provng the followng theorem: heorem 3. E(# of mstakes of RMA) a β (# of mstakes of the best expert) +c β ln(n), where a β = ln(1/β) 1 β and c β = 1 1 β. Notce here we are consderng the expectaton of number of mstakes RMA makes because RMA s a randomzed algorthm. hen β 1, a β 1. hs means RMA can do really close to the optmal. Proof: Smlar to the proof of the prevous theorem, we prove ths theorem by consderng the sum of weghts. On some round, defne l as the probablty that RMA makes a mstake: :ξ l = P r[ŷ y] = y w. hen the weght n ths round s changed as new = w β + :ξ y :ξ =y = l β + (1 l) = (1 l(1 β)) w hen we can wrte the sum of weghts at the end as hen we have fnal = N (1 l 1 (1 β)) (1 l t (1 β)) N exp( l t (1 β)) = N exp( (1 β) l t ) β L w fnal N exp( (1 β) l t ). Defne L A = l t as the expected number of mstakes RMA makes. e have L A = l t ln(1/β) 1 (#of mstakes of the best expert) + ln N. 1 β 1 β 5
5 Varant of RMA Let s frst dscuss how to choose the parameter β for RMA. Suppose we have an upper on the number of mstakes that the best expert makes as mn L K, then we can choose 1 β as ln(n). hen we have 1+ K L A mn L + K ln(n) + ln(n). It s natural to ask whether we can do better than RMA. he answer s yes. he hgh level dea s to work n the mddle of MA and RMA. he followng fgure shows how ths algorthm works. he y-axs s the probablty of predctng ŷ as 1. And the x-axs s the weghted fracton of experts predctng 1. he red lne s MA, the blue lne s RMA, and the green lne s the new algorthm. ln(1/β) ln( he new algorthm can reach a β = and c 1 1+β ) β = ln( ). hese are exactly half of 1+β those of MA. And f we make the assumpton that mn L K agan, the new algorthm has L + K ln(n) + lg(n) L A mn. If we set K = 0 whch means there exsts a perfect expert, we have L A lg(n) hus we know ths algorthm s better than the Halvng algorthm. e usually can assume that K = by addng two experts: one always predcts 1 and the other always predcts 0. hen we have L A mn L + ln(n) 6 + lg(n).
If we dvde both sdes by, we get L A mn L + ln(n) + lg(n). hen gets large, the rght hand sde of the nequalty s domnated by the term 6 Lower bound ln(n). Fnally we are gong to gve a lower bound for learnng wth expert advce. hs lower bound matches the upper bound we get from the prevous secton even for constant factors. So ths lower bound s tght and we cannot mprove the algorthm n the prevous secton. o prove the lower bound, we consder the followng scenaro: On each round, the adversary chooses the predcton of each expert randomly, ξ = { 1 w.p. 1/ 0 w.p. 1/ On each round, the adversary chooses the feedback randomly, y = For any learnng algorthm A, E y,ξ [L A ] = E ξ [E y [L A ξ]] = /. { 1 w.p. 1/ 0 w.p. 1/ For any expert, we have However, we can prove that E[L ] = /. E[mn L ] ln(n). hus, ln(n) E[L A ] E[mn L ]. ln(n) he term meets the domnatng term n the prevous secton. Also, n the proof of the lower bound, the adversary only uses entrely random expert predctons and outcomes. So we would not gan anythng by assumng that the data s random rather than adversaral. he bounds we proved n an adversaral settng are just as good as could possbly be obtaned even f we assumed that the data are entrely random. 7