COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Online Classification: Perceptron and Winnow

1 The Mistake Bound Model

Multilayer Perceptron (MLP)

Generalized Linear Methods

10-701/ Machine Learning, Fall 2005 Homework 3

Ensemble Methods: Boosting

Lecture 10 Support Vector Machines. Oct

COS 511: Theoretical Machine Learning

Kernel Methods and SVMs Extension

Boostrapaggregating (Bagging)

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lecture 4. Instructor: Haipeng Luo

Linear Feature Engineering 11

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Feature Selection: Part 1

Learning Theory: Lecture Notes

Lecture Notes on Linear Regression

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

The Geometry of Logit and Probit

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Lecture 12: Discrete Laplacian

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

1 Convex Optimization

CS286r Assign One. Answer Key

Vapnik-Chervonenkis theory

Section 8.3 Polar Form of Complex Numbers

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Maximal Margin Classifier

Hidden Markov Models

CSC 411 / CSC D11 / CSC C11

The Experts/Multiplicative Weights Algorithm and Applications

Lecture 10: May 6, 2013

CSE 546 Midterm Exam, Fall 2014(with Solution)

Assortment Optimization under MNL

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Homework Assignment 3 Due in class, Thursday October 15

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

EEE 241: Linear Systems

Errors for Linear Systems

Which Separator? Spring 1

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Support Vector Machines CS434

Lecture 10 Support Vector Machines II

Convergence of random processes

Lecture 4: Constant Time SVD Approximation

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 17: Lee-Sidford Barrier

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

ECE559VV Project Report

Chapter Newton s Method

Note on EM-training of IBM-model 1

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

A 2D Bounded Linear Program (H,c) 2D Linear Programming

1 Matrix representations of canonical matrices

1 Review From Last Time

Lecture 2: Prelude to the big shrink

1 Review of the Perceptron Algorithm

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

CSCI B609: Foundations of Data Science

Support Vector Machines CS434

Problem Set 9 Solutions

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Active Learning Models and Noise

MMA and GCMMA two methods for nonlinear optimization

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

1 Definition of Rademacher Complexity

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

CSE 252C: Computer Vision III

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Lecture 4: September 12

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

Lecture 14: Bandits with Budget Constraints

Machine Learning, 43, , 2001.

STAT 3008 Applied Regression Analysis

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

What would be a reasonable choice of the quantization step Δ?

On the correction of the h-index for career length

Structured Perceptrons & Structural SVMs

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

Week 5: Neural Networks

Neural networks. Nuno Vasconcelos ECE Department, UCSD

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Introduction to the Introduction to Artificial Neural Network

APPENDIX A Some Linear Algebra

The Granular Origins of Aggregate Fluctuations : Supplementary Material

Chapter 6. Supplemental Text Material

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

17 Support Vector Machines

Transcription:

COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and do as well as the best expert. An alternatve scenaro s there s no sngle good expert. But we could form a commttee of experts and they mght be much better. We ll formalze ths as follows: We have N experts. For t = 1,..., T we get x t {1, 1} N Note: x t : a set of predctons N: dmenson th component: predcton of expert In each round, learner predcts ŷ t {1, 1} In each round, we observe the outcome y t {1, 1} The above s the same; what we changed s the assumpton of data. We assume that there s a perfect commttee,.e. a weghted sum of experts that are always rght. Formally, ths means that u R N, N t : y t = sgn( u x t, ) = sgn(x t u) =1 y t (u x t ) > 0 Geometrcally, the perfect commttee means that there s a lnear threshold that separates the 1 ponts and 1 ponts, generated by the approprate weghted sum of the experts. How to do updates We are focusng on w t, the predcton of u. It s sort of a guess of the correct weghtng of the experts. We wll update the weghtng on each round. Today we are lookng at two algorthms. For each algorthm, we only need to focus on (1)ntalze ()update

Fgure 1: Perceptron geometrc ntuton: tpng the hyperplane.1 Perceptron The frst way to update weghts wll gve us an algorthm called Perceptron. The update rules are as follows: Intalze: w 1 = 0 Update: If mstake( ŷ t y t y t (w t x t ) 0), w t+1 = w t + y t x t else, w t+1 = w t Not adjustng the weghts when there are no mstakes makes the algorthm conservatve; the algorthm gnores the correctly classfyng samples. The ntuton s that n case of a wrong answer we shft the weghts on all the experts n the drecton of the correct answer. Fgure 1 gves a geometrcal ntuton of the Perceptron algorthm. Here y t = +1, when (x t, y t ) s classfed ncorrectly, then we add x t y t to w t to such a drecton that s more lkely to correctly classfy (x t, y t ) next tme; we are shftng the hyperplane defned by w t n such a drecton that we are more lkely to correctly classfy x t. Now let s state a theorem to formally analyze the performance of the Perceptron algorthm. However, frst we wll make a a few assumptons: Mstakes happens n every round. Ths s because no algorthmc change happens durng other rounds. So: T = # of rounds = # of mstakes. We normalze the vector of predctons x t, so that x t 1.

We normalze the vector of weghts for the perfect commttee, so that u = 1. (Ths s fne because the value of the sgn functon wll not be affected by ths normalzaton.) We make the assumpton that the ponts are lnearly separable wth margn at least δ: δ, u IR N, t : y t (u x t ) δ > 0. Note that ths assumpton s wth loss of generalty. Theorem.1 Under the assumptons above, T = # mstakes, we have T 1 δ Proof : In order to prove ths, we wll fnd some quantty that depends on the state of the algorthm at tme t, upper bound and lower bound t, and derve a bound from there. The quantty here s Φ t, whch s cosne of the angle Θ between w t and u. More formally, Φ t = w t u w t = cos Θ 1 Now for the lower bound, we wll prove that Φ T +1 T δ We wll do ths n two parts by lower boundng the numerator of Φ t and by upper boundng the denomnator. step 1: w T +1 u T δ: w t+1 u = (w t + y t x t ) u = w t u + y t (u x t ) w t u + δ The nequalty s by the 4th assumpton above. Intally we have set w 1 u = 0, thus the above bound mples that w T +1 u T δ. step : w T +1 T : w t+1 = w t+1 w t+1 = (w t + y t x t ) (w t + y t x t ) = w t + y t (x t w t ) + x t Snce we have made the assumpton that we get a mstake at each round, y t (x t w t ) 0, and from the normalzaton assumpton, x t 1, so that we get w t+1 w t + 1. Intally we have set w 1 = 0, so we get w T +1 T Now we put step 1 and step together, 1 Φ T +1 T δ T,.e. T 1. δ Let H be the hypothess space and M perceptron (H) be the number of mstakes made by the Perceptron algorthm. As a smple consequence of the above, snce the VC dmenson of the hypothess space s upper bounded by the number of mstakes the algorthm makes, we get the VC dmenson of threshold functons wth margn at least δ s at most 1 δ : V C-dm(H) opt(h) M perceptron (H) 1 δ Now consder a scenaro where the target u conssts of 0s and 1s, and the number of 1s n the vector s k. u = 1 (0 1 0 0 1...) k 3

Note that here 1 k s for normalzaton. Thnk of k as beng small compared to N, the number of experts,.e. t could be a very sparse vector. Ths s also one example of the problems we earler examned the k experts are the perfect commttee. We have, x t = 1 N (+1, 1, 1, +1,...) y t = sgn(u x t ) y t (u x t ) 1 kn 1 Note that here 1 N s for normalzaton. So usng kn as δ, by Theorem.1, the Perceptron algorthm would make at most kn mstakes. However ths s not good consder nterpretng the experts as features, and we have mllons of rrelevant features, and the commttee s the mportant (maybe a dozen) features. We get a lnear dependences on N, whch s usually large. Motvated by ths example, we present another update algorthm, called the Wnnow algorthm, whch wll get a better bound.. Wnnow Algorthm Intalze:, w 1, = 1 N we start wth a unform dstrbuton over all experts. Update: If we make a mstake, : w t+1, = w t, e ηytxt Z t Here η s a parameter we wll defne later, and Z t s a normalzaton factor. Else, w t+1 = w t Ths update rule s lke exponental punshment for the experts that are wrong. If we gnore the normalzaton factors, the above update s equvalent to w t+1, = w t, e η, f predcts correctly, and w t+1, = w t, e η otherwse. Ignorng the normalzaton factor, we could see t as w t+1, = w t,, f predcts correctly, and w t+1, = w t, e η otherwse. Ths s the same as the weghted majorty vote. Before statng the formal theorem for the Wnnow algorthm, we make a few assumptons wthout loss of generalty: We make mstake at every round. t : x t 1. δ, u : t : y t (u x t ) δ > 0. u 1 = 1 and : u 0. Notce here we used L 1 and L norm here nstead of the L norm that we used n Perceptron algorthm. 4

Theorem. Under the assumptons above, We have the followng upper bound on the number of mstakes: ln N T ηδ + ln( ) e η +e η If we choose an optmal η to mnmze the bound, we get when η = 1 ln( 1+δ 1 δ ), T ln N δ Proof The approach s smlar to the prevous one. We use a quantty Φ t, whch we both upper and lower-bound. The quantty we use here s Φ t = RE(u w t ). Immedately we have, Φ t 0 for all t. Φ t+1 Φ t = = u u ln( ) w t+1, u ln( w t, w t+1, ) u ln( u w t, ) = Z t u ln( e ηytx ) t, (1) = u ln Z t u ln e ηytx t, = ln Z t ηy t (u x t ) ln Z t ηδ The last nequalty follows from the margn property we assumed. Now let s approxmate Z t. We know that Z s the normalzaton factor and can be computed as: Z = w e ηyx () Note that here we are droppng the subscrpt t for smplcty; Z and w are same as Z t and w t,. We wll bound the exponental term by a lnear functon, as llustrated n fgure : e ηx ( 1 + x )e η + ( 1 x )e η, for 1 x 1. Usng ths bound, we have: Z = w e ηyx = eη + e η = eη + e η eη + e η w ( 1 + yx )e η + w + eη e η y + eη + e η y(w x) w ( 1 yx )e η w x (3) 5

Fgure : Usng lnear functon to bound exponental functon The last nequalty comes from the assumpton that the expert makes a wrong predcton every tme, so the second term s less than 0. So we have, Φ t+1 Φ t ln Z t ηδ ln( eη + e η ) ηδ = C (4) Note that here ln( eη +e η ) ηδ s a constant and let s make t equals C. So for each round Φ t s decreasng by at least C = ln( ) + ηδ. e η +e η In the next class, we wll fnsh the proof of Theorem. and we wll study a modfed verson of Wnnow Algorthm called Balanced Wnnow Algorthm that gets rd of the assumpton that : u 0. 6