Symmetrization and Rademacher Averages

Similar documents
E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

1 Generalization bounds based on Rademacher complexity

VC Dimension and Sauer s Lemma

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Rademacher Complexity Bounds

Supplement to: Subsampling Methods for Persistent Homology

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Bounding the Margin

arxiv: v2 [stat.ml] 23 Feb 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Machine Learning Basics: Estimators, Bias and Variance

A Simple Regression Problem

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Generalization theory

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

1 Proof of learning bounds

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

1 Proving the Fundamental Theorem of Statistical Learning

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

CS Lecture 13. More Maximum Likelihood

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Metric Entropy of Convex Hulls

Bootstrapping Dependent Data

Combining Classifiers

Learnability of Gaussians with flexible variances

Learnability and Stability in the General Learning Setting

Generalization bounds

IFT Lecture 7 Elements of statistical learning theory

Efficient Learning with Partially Observed Attributes

1 Definition of Rademacher Complexity

Variance Reduction. in Statistics, we deal with estimators or statistics all the time

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 16: Perceptron and Exponential Weights Algorithm

Random Process Review

Introduction to Statistical Learning Theory

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Stochastic Subgradient Methods

Lecture 3: October 2, 2017

The degree of a typical vertex in generalized random intersection graph models

Structured Prediction Theory Based on Factor Graph Complexity

Understanding Generalization Error: Bounds and Decompositions

Machine Learning: Fisher s Linear Discriminant. Lecture 05

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Stability Bounds for Non-i.i.d. Processes

Introduction to Statistical Learning Theory

Lean Walsh Transform

Probably Approximately Correct (PAC) Learning

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Introduction to Machine Learning (67577) Lecture 3

arxiv: v4 [cs.lg] 4 Apr 2016

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

arxiv: v1 [cs.ds] 3 Feb 2014

SPECTRUM sensing is a core concept of cognitive radio

Lecture 21 Nov 18, 2015

Compressive Distilled Sensing: Sparse Recovery Using Adaptivity in Compressive Measurements

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Testing Properties of Collections of Distributions

Tail estimates for norms of sums of log-concave random vectors

Some Classical Ergodic Theorems

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

arxiv: v3 [cs.lg] 7 Jan 2016

Robustness and Regularization of Support Vector Machines

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Universal algorithms for learning theory Part II : piecewise polynomial functions

PREPRINT 2006:17. Inequalities of the Brunn-Minkowski Type for Gaussian Measures CHRISTER BORELL

GEE ESTIMATORS IN MIXTURE MODEL WITH VARYING CONCENTRATIONS

Generalization Bounds and Stability

Block designs and statistics

Online Learning and Sequential Decision Making

A talk on Oracle inequalities and regularization. by Sara van de Geer

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Polygonal Designs: Existence and Construction

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Sharp Time Data Tradeoffs for Linear Inverse Problems

Understanding Machine Learning Solution Manual

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Lecture 3: Introduction to Complexity Regularization

Computable Shell Decomposition Bounds

Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees

1 Identical Parallel Machines

Best Linear Unbiased and Invariant Reconstructors for the Past Records

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

The Weierstrass Approximation Theorem

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

1 A Lower Bound on Sample Complexity

Some Proofs: This section provides proofs of some theoretical results in section 3.

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

3.8 Three Types of Convergence

1 Widrow-Hoff Algorithm

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

A Theoretical Framework for Deep Transfer Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Transcription:

Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and true expectations uniforly over soe function class G. In the context of classification or regression, we are typically interested in a class G that is the loss class associated with soe function class F. That is, given a bounded loss function l : D Y 0,, we consider the class l F := {x, y lfx, y f F}. Radeacher averages give us a powerful tool to obtain unifor convergence results. We begin by exaining the quantity gz gz i, where Z, {Z i } are i.i.d. rando variables taking values in soe space Z and G a, bz is a set of bounded functions. We will later show that the rando quantity we are interested in, naely gz gz i, will be close to the above expectation with high probability. Let ɛ,..., ɛ be i.i.d. {±}-valued rando variables with P ɛ i = + = P ɛ i = = /2. These are also independent of the saple Z,..., Z. Define the epirical Radeacher average of G as ˆR G := ɛ i gz i Z. The Radeacher average of G is defined as R G := ˆR G. Theore.. We have, gz gz i 2R G. Proof. Introduce the ghost saple Z,..., Z. By that we ean that Z i s are independent of each other and of Z i s

and have the sae distribution as the latter. Then we have, gz gz i = gz gz i = gz i gz i Z gz i gz i Z = gz i gz i = ɛ i gz i gz i ɛ i gz i + ɛ i gz i = 2R G. Since R G = R G, we have the following corollary. Corollary.2. We have, gz i gz 2R G. Since gx i a, b, gz gz i does not change by ore than b a/ if soe Z i is changed to Z i. Applying the bounded differences inequality, we get the following corollary. Corollary.3. With probability at least δ, gz ln/δ gz i 2R G + b a 2 Recall that we denote the epirical l-loss iniizer by ˆf l. We refer to L l ˆf l in f F L l f as the estiation error. The next theore bounds the estiation error using Radeacher averages. 2

2 xpected Regret Now let us exaine the expected regret of the epirical risk iniizer e.g. analogous to the statistical risk. Let ĝ = arg in gz i where τ is the training set and which is true iniizer. g = arg in gz Lea 2.. The expected regret is: ĝz g Z 2R G + g Z i g Z 4R G where the expectation is with respect ĝ due to randoness in the training set. Proof. Let The expected regret is: ĝz g Z ĝz ĝz i + ĝz i g Z ĝz ĝz i + g Z i g Z g G ĝz ĝz i + g Z i g Z 2R G + g Z i g Z The final clai is straightforward. ĝ 3 Growth function Consider the case Y = {±} classification. Let l be the 0- loss function and F be a class of ±-valued functions. We can relate the Radeacher average of l F to that of F as follows. Lea 3.. Suppose F {±} X and let ly, y = y y be the 0- loss function. Then we have, R l F = 2 R F. 3

Proof. Note that we can write ly, y as yy /2. Then we have, Y i fx i R l F = ɛ i f F 2 X, Y Y i fx i = ɛ i f F 2 X, Y = 2 ɛ i Y i fx i f F X, Y = 2 ɛ i fx i f F X, Y = 2 R F. 2 quation follows because ɛ i X, Y = 0. quation 2 follows because ɛ i Y i s jointly have the sae distribution as ɛ i s. Note that the Radeacher average of the class F can also be written as R F = a F X, where F X is the function class F restricted to the set X,..., X. That is, F X := {fx,..., fx f F}. Note that F X is finite and Thus we can define the growth function as F X in{ F, 2 }. Π F := ax F x X x. The following lea due to Massart allows us to bound the Radeacher average in ters of the growth function. Lea 3.2. Finite Class Lea Let A be soe finite subset of R and ɛ,..., ɛ be independent Radeacher rando variables. Let r = a. Then, we have, r 2 ln A. Proof. Let µ =. 4

We have, for any λ > 0, e λµ exp = exp λ λ exp λ = exp λ = = exp λ e λ2 a 2 i /2 e λ2 a 2 /2 Jensen s inequality Hoeffding s lea A e λ2 r 2 /2 Taking logs and dividing by λ, we get that, for any λ > 0, µ ln A λ + λr2 2. Setting λ = 2 ln A /r 2 gives, which proves the lea. µ r 2 ln A, 5