COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

1 The Mistake Bound Model

Lecture 4. Instructor: Haipeng Luo

Online Classification: Perceptron and Winnow

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

COS 511: Theoretical Machine Learning

Lecture 14: Bandits with Budget Constraints

Linear Feature Engineering 11

Homework Assignment 3 Due in class, Thursday October 15

Lecture 10: May 6, 2013

1 Review From Last Time

Learning Theory: Lecture Notes

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

8.6 The Complex Number System

Lecture 4: Universal Hash Functions/Streaming Cont d

10-701/ Machine Learning, Fall 2005 Homework 3

Errors for Linear Systems

COS 521: Advanced Algorithms Game Theory and Linear Programming

Lecture 4: November 17, Part 1 Single Buffer Management

Boostrapaggregating (Bagging)

CS286r Assign One. Answer Key

Lecture 5 Decoding Binary BCH Codes

Ensemble Methods: Boosting

The Expectation-Maximization Algorithm

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

2.3 Nilpotent endomorphisms

Math 217 Fall 2013 Homework 2 Solutions

Assortment Optimization under MNL

Feature Selection: Part 1

Finding Primitive Roots Pseudo-Deterministically

Lecture 10 Support Vector Machines II

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Edge Isoperimetric Inequalities

Module 9. Lecture 6. Duality in Assignment Problems

18.1 Introduction and Recap

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

1 Definition of Rademacher Complexity

Lecture 10 Support Vector Machines. Oct

Note on EM-training of IBM-model 1

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

The Second Anti-Mathima on Game Theory

Lecture 4: September 12

Lecture Notes on Linear Regression

Generalized Linear Methods

Excess Error, Approximation Error, and Estimation Error

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

HMMT February 2016 February 20, 2016

A 2D Bounded Linear Program (H,c) 2D Linear Programming

The Experts/Multiplicative Weights Algorithm and Applications

Vapnik-Chervonenkis theory

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Lecture 17. Solving LPs/SDPs using Multiplicative Weights Multiplicative Weights

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Kernel Methods and SVMs Extension

1 Convex Optimization

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

( ) 1/ 2. ( P SO2 )( P O2 ) 1/ 2.

Volume 18 Figure 1. Notation 1. Notation 2. Observation 1. Remark 1. Remark 2. Remark 3. Remark 4. Remark 5. Remark 6. Theorem A [2]. Theorem B [2].

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Graph Reconstruction by Permutations

1 GSW Iterative Techniques for y = Ax

For example, if the drawing pin was tossed 200 times and it landed point up on 140 of these trials,

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Lecture 3. Ax x i a i. i i

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

Finding Dense Subgraphs in G(n, 1/2)

Problem Set 9 Solutions

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Min Cut, Fast Cut, Polynomial Identities

Week 5: Neural Networks

Time-Varying Systems and Computations Lecture 6

Difference Equations

Lecture Space-Bounded Derandomization

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

P exp(tx) = 1 + t 2k M 2k. k N

Foundations of Arithmetic

Lecture 4: Constant Time SVD Approximation

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Machine Learning, 43, , 2001.

Digital Signal Processing

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8

Notes on Frequency Estimation in Data Streams

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

First day August 1, Problems and Solutions

Lecture 17: Lee-Sidford Barrier

5 The Rational Canonical Form

Eigenvalues of Random Graphs

between standard Gibbs free energies of formation for products and reactants, ΔG! R = ν i ΔG f,i, we

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Computational and Statistical Learning theory Assignment 4

The Geometry of Logit and Probit

FE REVIEW OPERATIONAL AMPLIFIERS (OP-AMPS)( ) 8/25/2010

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Text S1: Detailed proofs for The time scale of evolutionary innovation

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Linear Approximation with Regularization and Moving Least Squares

Calculation of time complexity (3%)

Transcription:

COS 511: heoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 15 Scrbe: Jemng Mao Aprl 1, 013 1 Bref revew 1.1 Learnng wth expert advce Last tme, we started to talk about learnng wth expert advce. hs model s very dfferent from models we saw earler n the semester. In ths model, the learner can see only one example at a tme, and has to make predctons based on the advce of experts and the hstory. Formally, the model can be wrtten as N = # of experts for t = 1,, each expert predcts ξ {0, 1} learner predcts ŷ {0, 1} learner observes y {0, 1} (mstake f y ŷ) In the prevous lecture, we gave a Halvng algorthm whch works when there exsts a perfect expert. In ths case, the number of mstakes that the Halvng algorthm makes s at most lg(n). 1. Connecton wth PAC learnng Now we consder an analogue of the PAC learnng model : hypothess space H = {h 1,, h N } target concept c H on each round get x predct ŷ observe c(x) hs problem s a very natural onlne learnng problem. However, we can relate ths problem to learnng wth expert advce by consderng each hypothess h as an expert. Snce the target concept s chosen nsde the hypothess space, we always have a perfect expert. herefore, we can apply the Halvng algorthm to solvng ths problem and the Halvng algorthm wll make at most lg(n) = lg( H ) mstakes.

Boundng the number of mstakes In the prevous secton, t s natural to ask whether the Halvng algorthm s the best possble. In order to answer ths problem, we have to defne what s best. Let A be any determnstc algorthm, defne M A (H) = max(# mstakes made by A) c,x M A (H) s the number of mstakes A wll make on hypothess space H n the worst case. An algorthm s consdered good f ts M A (H) s small. Now defne opt(h) = mn M A(H). A e are gong to lower bound opt(h) by V Cdm(H) as the followng theorem: heorem 1. V Cdm(H) opt(h) Proof: Let A be any determnstc algorthm. Let d = V Cdm(H). Let x 1,..., x d be shattered by H. Now we can construct an adversaral strategy that forces A to make d mstakes as followng: for t = 1,, d - present x t to A - ŷ t = A s predcton - choose y t ŷ t. Snce x 1,..., x d s shattered by H, we know there exsts a concept c H, such that c(x t ) = y t for all t. In addton, snce A s determnstc, we can smulate t ahead of tme. herefore such adversaral constructon s possble. hus there exsts an adversaral strategy to force A to make d mstakes. o sum up ths secton, we have the followng result, V Cdm(H) opt(h) M Halvng (H) lg( H ). 3 eghted majorty algorthm Now we are gong back to learnng wth expert advce and we want to devse algorthms when there s no perfect expert. In ths settng, we wll compare the number of mstakes of our algorthm and the number of mstakes of the best expert. Here we modfy the Halvng algorthm to get an algorthm when there s no perfect expert. Instead of dscardng experts, we wll keep a weght for each expert and lower t when ths expert makes a mstake. e call ths algorthm weghted majorty algorthm: w = weght on expert Parameter 0 β < 1 N = # of experts

Intally, w = 1 for t = 1,, - each expert predcts ξ {0, 1} - q 0 = :ξ =0 w, q 1 = :ξ =1 w. - learner predcts ŷ {0, 1} { 1 f q1 > q 0 - ŷ = (weghted majorty vote) 0 else - learner observes y {0, 1} - (mstake f y ŷ) - for each, f ξ y, then w w β Now we are gong to analyze weghted majorty algorthm(ma) by provng the followng theorem: heorem. (# of mstakes of MA) a β (# of mstakes of the best expert) +c β lg(n), where a β = lg(1/β) and c lg( 1+β ) β = 1 lg( ). 1+β Before provng ths theorem, let s frst try to understand what ths theorem mples. he followng table gves a good understandng of the parameter: β a β c β 1/.4.4 0 + 1 1 + If we dvde both sdes of the nequalty by, we get (# of mstakes of MA) a β(# of mstakes of the best expert) + c β lg(n) hen +, c β lg(n) 0. hen ths theorem means that the rate that MA makes mstakes s bounded by a constant tmes the rate that the best expert makes mstakes. Proof: Defne as the sum of the weghts of all the experts: = N =1 w. Intally, we have = N. Now we consder how changes n some round. On some round, wthout loss of generalty, we assume that y = 0. hen new = N =1 = :ξ =1 w new w β + :ξ =0 = q 1 β + q 0 = q 1 β + ( q 1 ) = (1 β)q 1 w 3

Now suppose MA makes a mstake. hen ŷ y ŷ = 1 q 1 q 0 q 1 new (1 β) = (1 + β ) So f MA makes a mstake, the sum of weghts wll decrease by multplyng 1+β. herefore, after m mstakes, we get an upper bound of the sum of weghts as N ( 1 + β ) m. Defne L to be the number of mstakes that expert makes. hen we have, w = β L N ( 1 + β ) m. Solvng ths nequalty, and notng that t holds for all experts, we get m (mn L ) lg(1/β) + lg(n) lg( 1+β ). 4 Randomzed weghted majorty algorthm he prevous proof has shown that, at best, the number of mstakes of MA s at most twce the number of mstakes of the best expert. hs s a weak result f the best expert actually makes a substantal number of mstakes. In order to get a better algorthm, we have to ntroduce randomness. he next algorthm we are gong to ntroduce s called randomzed weghted majorty algorthm. hs algorthm s very smlar to weghted majorty algorthm. he only dfference s that rather than makng predcton by weghted majorty vote, n each round, the algorthm pcks an expert wth some probablty accordng to ts weght, and uses that randomly chosen expert to make ts predcton. he algorthm s as followng: w = weght on expert Parameter 0 β < 1 N = # of experts Intally, w = 1 for t = 1,, - each expert predcts ξ {0, 1} - q 0 = :ξ =0 w, q 1 = :ξ =1 w. - learner predcts ŷ {0, 1} 4

1 wth probablty - ŷ = 0 wth probablty - learner observes y {0, 1} - (mstake f y ŷ) :ξ =1 w :ξ =0 w - for each, f ξ y, then w w β = q 1 = q 0 Now let s analyze randomzed weght majorty algorthm(rma) by provng the followng theorem: heorem 3. E(# of mstakes of RMA) a β (# of mstakes of the best expert) +c β ln(n), where a β = ln(1/β) 1 β and c β = 1 1 β. Notce here we are consderng the expectaton of number of mstakes RMA makes because RMA s a randomzed algorthm. hen β 1, a β 1. hs means RMA can do really close to the optmal. Proof: Smlar to the proof of the prevous theorem, we prove ths theorem by consderng the sum of weghts. On some round, defne l as the probablty that RMA makes a mstake: :ξ l = P r[ŷ y] = y w. hen the weght n ths round s changed as new = w β + :ξ y :ξ =y = l β + (1 l) = (1 l(1 β)) w hen we can wrte the sum of weghts at the end as hen we have fnal = N (1 l 1 (1 β)) (1 l t (1 β)) N exp( l t (1 β)) = N exp( (1 β) l t ) β L w fnal N exp( (1 β) l t ). Defne L A = l t as the expected number of mstakes RMA makes. e have L A = l t ln(1/β) 1 (#of mstakes of the best expert) + ln N. 1 β 1 β 5

5 Varant of RMA Let s frst dscuss how to choose the parameter β for RMA. Suppose we have an upper on the number of mstakes that the best expert makes as mn L K, then we can choose 1 β as ln(n). hen we have 1+ K L A mn L + K ln(n) + ln(n). It s natural to ask whether we can do better than RMA. he answer s yes. he hgh level dea s to work n the mddle of MA and RMA. he followng fgure shows how ths algorthm works. he y-axs s the probablty of predctng ŷ as 1. And the x-axs s the weghted fracton of experts predctng 1. he red lne s MA, the blue lne s RMA, and the green lne s the new algorthm. ln(1/β) ln( he new algorthm can reach a β = and c 1 1+β ) β = ln( ). hese are exactly half of 1+β those of MA. And f we make the assumpton that mn L K agan, the new algorthm has L + K ln(n) + lg(n) L A mn. If we set K = 0 whch means there exsts a perfect expert, we have L A lg(n) hus we know ths algorthm s better than the Halvng algorthm. e usually can assume that K = by addng two experts: one always predcts 1 and the other always predcts 0. hen we have L A mn L + ln(n) 6 + lg(n).

If we dvde both sdes by, we get L A mn L + ln(n) + lg(n). hen gets large, the rght hand sde of the nequalty s domnated by the term 6 Lower bound ln(n). Fnally we are gong to gve a lower bound for learnng wth expert advce. hs lower bound matches the upper bound we get from the prevous secton even for constant factors. So ths lower bound s tght and we cannot mprove the algorthm n the prevous secton. o prove the lower bound, we consder the followng scenaro: On each round, the adversary chooses the predcton of each expert randomly, ξ = { 1 w.p. 1/ 0 w.p. 1/ On each round, the adversary chooses the feedback randomly, y = For any learnng algorthm A, E y,ξ [L A ] = E ξ [E y [L A ξ]] = /. { 1 w.p. 1/ 0 w.p. 1/ For any expert, we have However, we can prove that E[L ] = /. E[mn L ] ln(n). hus, ln(n) E[L A ] E[mn L ]. ln(n) he term meets the domnatng term n the prevous secton. Also, n the proof of the lower bound, the adversary only uses entrely random expert predctons and outcomes. So we would not gan anythng by assumng that the data s random rather than adversaral. he bounds we proved n an adversaral settng are just as good as could possbly be obtaned even f we assumed that the data are entrely random. 7