Rademacher Complexity. Examples

Similar documents
Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

18.413: Error Correcting Codes Lab March 2, Lecture 8

Dimensionality Reduction and Learning

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Lecture 9: Tolerant Testing

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Lecture 3 Probability review (cont d)

1 Onto functions and bijections Applications to Counting

1 Review and Overview

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

Chapter 4 Multiple Random Variables

13. Parametric and Non-Parametric Uncertainties, Radial Basis Functions and Neural Network Approximations

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

4 Inner Product Spaces

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture 4 Sep 9, 2015

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

means the first term, a2 means the term, etc. Infinite Sequences: follow the same pattern forever.

PROJECTION PROBLEM FOR REGULAR POLYGONS

Chapter 9 Jordan Block Matrices

18.657: Mathematics of Machine Learning

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Functions of Random Variables

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Class 13,14 June 17, 19, 2015

Lecture 07: Poles and Zeros

2SLS Estimates ECON In this case, begin with the assumption that E[ i

Some Different Perspectives on Linear Least Squares

ρ < 1 be five real numbers. The

Bayes (Naïve or not) Classifiers: Generative Approach

α1 α2 Simplex and Rectangle Elements Multi-index Notation of polynomials of degree Definition: The set P k will be the set of all functions:

MOLECULAR VIBRATIONS

AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

ENGI 4421 Propagation of Error Page 8-01

Econometric Methods. Review of Estimation

The Arithmetic-Geometric mean inequality in an external formula. Yuki Seo. October 23, 2012

A tighter lower bound on the circuit size of the hardest Boolean functions

Algorithms Theory, Solution for Assignment 2

Introduction to local (nonparametric) density estimation. methods

13. Artificial Neural Networks for Function Approximation

8.1 Hashing Algorithms

An Introduction to. Support Vector Machine

Point Estimation: definition of estimators

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Introduction to Matrices and Matrix Approach to Simple Linear Regression

COV. Violation of constant variance of ε i s but they are still independent. The error term (ε) is said to be heteroscedastic.

Ideal multigrades with trigonometric coefficients

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Principal Components. Analysis. Basic Intuition. A Method of Self Organized Learning

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

3. Basic Concepts: Consequences and Properties

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

ESS Line Fitting

1. A real number x is represented approximately by , and we are told that the relative error is 0.1 %. What is x? Note: There are two answers.

6.867 Machine Learning

= lim. (x 1 x 2... x n ) 1 n. = log. x i. = M, n

Supervised learning: Linear regression Logistic regression

Chapter 14 Logistic Regression Models

Lecture 12: Multilayer perceptrons II

ONE GENERALIZED INEQUALITY FOR CONVEX FUNCTIONS ON THE TRIANGLE

Lecture 02: Bounding tail distributions of a random variable

arxiv:math/ v1 [math.gm] 8 Dec 2005

CSE 5526: Introduction to Neural Networks Linear Regression

Generalized Linear Regression with Regularization

Transforms that are commonly used are separable

CHAPTER 4 RADICAL EXPRESSIONS

Non-uniform Turán-type problems

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Summary of the lecture in Biostatistics

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Special Instructions / Useful Data

Lattices. Mathematical background

Logistic regression (continued)

A unified matrix representation for degree reduction of Bézier curves

The Occupancy and Coupon Collector problems

1 Solution to Problem 6.40

D KL (P Q) := p i ln p i q i

Chapter 5 Properties of a Random Sample

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

MA 524 Homework 6 Solutions

Multiple Choice Test. Chapter Adequacy of Models for Regression

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

General Method for Calculating Chemical Equilibrium Composition

STK4011 and STK9011 Autumn 2016

Lecture 1. (Part II) The number of ways of partitioning n distinct objects into k distinct groups containing n 1,

TESTS BASED ON MAXIMUM LIKELIHOOD

Unit 9. The Tangent Bundle

arxiv: v1 [cs.lg] 22 Feb 2015

7.0 Equality Contraints: Lagrange Multipliers

(b) By independence, the probability that the string 1011 is received correctly is

NP!= P. By Liu Ran. Table of Contents. The P versus NP problem is a major unsolved problem in computer

Transcription:

Algorthmc Foudatos of Learg Lecture 3 Rademacher Complexty. Examples Lecturer: Patrck Rebesch Verso: October 16th 018 3.1 Itroducto I the last lecture we troduced the oto of Rademacher complexty ad showed that t yelds a upper boud o the expected value of the uform (over the choce of actos/rules devato betwee the expected rsk r ad the emprcal rsk R, amely, where we recall the otato E {r(a R(a} E Rad(L {Z 1,..., Z } a A L := {z Z l(a, z R : a A}. I ths lecture we establsh bouds for Rad(L {z 1,..., z } for ay z 1,..., z Z the settg of regresso. I ervsed learg, the observed examples correspod to pars of pots,.e., Z = (X, Y X Y. The pot X s called feature or covarate, ad the pot Y s ts correspodg label. The set of admssble decsos s a subset of the set fuctos from X to Y,.e., A B := {a : X Y}, ad the loss fucto s of the form l(a, (x, y = φ(a(x, y, for a fucto φ : Y Y R +. The regresso settg s represeted by the choce X = R d for a gve dmeso d, Y = R. We have S = {(X 1, Y 1,..., (X, Y } ad s = {(x 1, y 1,..., (x, y } whch represets a realzato of the trag sample. Let us recall the followg otato: A {x 1,..., x } := {(a(x 1,..., a(x Y : a A}. Proposto 3.1 Let the fucto ŷ φ(ŷ, y be γ-lpschtz for ay y Y. The, for ay (x 1, y 1,..., (x, y X Y, Rad(L {(x 1, y 1,..., (x, y } γ Rad(A {x 1,..., x } Proof: By the cotracto property of Rademacher complexty, Lemma.10, we get 1 Rad(L s = E Ω φ(w x, y = Rad((φ(, y 1,..., φ(, y A {x 1,..., x } w R d : w c γ Rad(A {x 1,..., x }. Below we show how to cotrol the quatty Rad(A {x 1,..., x } for some fucto classes A of terest. 3. Lear predctors l /l costrats I the case of l /l costrats, the Rademacher complexty of lear predctors does ot deped explctly o the dmeso d (the depedece o d s mplct, va the term max x. 3-1

3- Lecture 3: Rademacher Complexty. Examples Proposto 3. Let A := {x R d w x : w R d, w 1}. The, for ay x 1..., x R d, Rad(A {x 1,..., x } max x Proof: We have Rad(A {x 1,..., x } = E w R d : w 1 w R d : w 1 Ω w x = E w R d : w 1 w ( Ω x w E Ω x by Cauchy-Schwarz s eq. x y x y E Ω x E Ω x d ( = E Ω x,j j=1 j=1 by Jese s, as x x s cocave d = E (Ω x,j as the Ω s are depedet ad EΩ = 0 = E x max x as Ω = 1. Remark 3.3 Note that as the predctors that we are cosderg are lear,.e., x R d w x, the costrat w 1 the defto of A Proposto 3. s wthout loss of geeralty. I fact, f w c for a gve costat c 0, the we ca rescale w x = ( w w ( w x ad we have the equvalece {x R d w x : w R d, w c} = {x R d w (cx : w R d, w 1}. Proposto 3. stll apples, wth a costat c o the rght-had sde of the boud. 3.3 Lear predctors l 1 /l costrats (l 1 Boostg I the case of l 1 /l costrats, the Rademacher complexty of lear predctors oly depeds logarthmcally o the dmeso d. Proposto 3.4 Let A 1 := {x R d w x : w R d, w 1 1}. The, for ay x 1..., x R d, Rad(A 1 {x 1,..., x } max x log(d

Lecture 3: Rademacher Complexty. Examples 3-3 Proof: We have Rad(A 1 {x 1,..., x } = E w R d : w 1 1 w R d : w 1 1 E Ω x Ω w x = E w 1 E Ω x w R d : w 1 1 Let t j := (x 1,j,..., x,j R for ay j 1 : d, ad let T = {t 1,..., t d }. The, Ω x = max Ω x,j = max w ( Ω x by Hölder s equalty x y x 1 y Ω t j, = max t T, Ω t whose expectato looks lke a Rademacher complexty apart from the absolute value (ad the ormalzato by 1/. To remove the absolute value, ote that for ay ω 1,..., ω { 1, 1} we have max t T ω t = max t T T ω t, where we have defed T = { t 1,..., t d }, wth t j = ( x 1,j,..., x,j. Hece, we have ad the proof follows by Massart s lemma as Rad(T T Rad(A 1 {x 1,..., x } Rad(T T, max t T T t log T T max log(d x. Remark 3.5 Note that as the predctors that we are cosderg are lear,.e., x R d w x, the costrat w 1 1 the defto of A 1 s wthout loss of geeralty. I fact, f w 1 c for a gve costat c 0, the we ca rescale w x = ( w w 1 ( w 1 x ad we have the equvalece {x R d w x : w R d, w 1 c} = {x R d w (cx : w R d, w 1 1}. Proposto 3.4 stll apples, wth a costat c o the rght-had sde of the boud. 3.4 Lear predctors smplex/l costrats (Boostg Proposto 3.6 Let d := {w R d : w 1 = 1, w 1,..., w d 0} ad let A := {x R d w x : w d }. The, for ay x 1..., x R d, Rad(A {x 1,..., x } max x log d

3-4 Lecture 3: Rademacher Complexty. Examples Proof: We have Rad(A {x 1,..., x } = E Note that for ay vector v = (v 1,..., v d R d we have The, Ω w x = E w ( Ω x. w v = max v j. E w ( Ω x = E max Ω x,j = Rad(T, wth T = {t 1..., t d } wth t j = (x 1,j,..., x,j for ay j {1,..., d}. The proof follows by Massart s lemma as log T Rad(T max t log d max x. t T 3.5 Feed-forward eural etworks Let us defe a feed-forward eural etworks wth actvato fuctos appled elemet-wse to ts uts. A layer l (k : R d k 1 R d k cossts of a coordate-wse composto of a actvato fucto σ (k : R R ad a affe map, amely, l (k (x := σ (k (W (k x + b (k, for a gve teracto matrx W (k ad bas vector b (k. A eural etwork wth depth p (ad p 1 hdde layers s gve by the fucto f p : R d R defed as f (p (x := l (p l (1 (x l (p ( l ( (l (1 (x, wth d 0 = d, d p = 1, σ (r = σ for a gve fucto σ for all r < p, ad σ (p (x = x (.e., the last layer s smply a affe map. The actvato fucto σ s kow to the desg maker, whle the teracto matrces ad the bas vectors are treated as parameters to tue. For stace, a class of eural etworks wth depth p s gve by A (p := {x R d f (p (x : W (k ω, b (k β k}, (3.1 where for a gve matrx M, the l orm s defed as M := max l M j. The Rademacher complexty of a feed-forward eural etwork ca be bouded recursvely by cosderg each layer at a tme. A boud that ca be used for the recurso s gve by the followg proposto, that expresses the Rademacher complextes at the outputs of oe layer terms of the outputs at the prevous layers. Proposto 3.7 Let L be a class of fuctos from R d to R that cludes the zero fucto. Let σ : R R be α-lpschtz ad defe L := {x R d σ( m j=1 w jl j (x + b R : b β, w 1 ω, l 1,..., l m L}. The, for ay x 1,..., x R d, β Rad(L {x 1,..., x } α( + ω Rad(L {x 1,..., x } (3.

Lecture 3: Rademacher Complexty. Examples 3-5 If L = L, the β Rad(L {x 1,..., x } α( + ω Rad(L {x 1,..., x } (3.3 Proof: We gve a proof that makes use of may of the property of the Rademacher complexty descrbed the prevous lecture. Let F : = {x R d m w j l j (x R : w 1 ω, l 1,..., l m L}, G : = {x R d b R : b β}. By the cotracto property ad the summato property of Rademacher complextes, we have ( Rad(L {x 1,..., x } α Rad(F {x 1,..., x } + Rad(G {x 1,..., x }. O the oe had, as L cotas the zero fucto we have F {x 1,..., x } = ω cov(l L, where L L = {l l : l L, l L }. I fact, frst of all ote that where Rad(F {x 1,..., x } = Rad(F {x 1,..., x } F := {x R d m w j l j (x R : w 1 = ω, l 1,..., l m L} (ths s because the maxmum over w 1 ω s acheved for the values satsfyg w 1 = ω. The, ote that for ay w R m such that w 1 = 1 we have w l = w (l 0 + w (0 l, :w 0 :w <0 where 0 represets the zero fucto. The rght-had sde s a covex combato of elemets L L. Hece, by the covex hall property of Rademacher complexty we fd Rad(F {x 1,..., x } = ω Rad(cov(L L {x 1,..., x } = ω Rad((L L {x 1,..., x } = ω Rad(L {x 1,..., x } + ω Rad( L {x 1,..., x } = ω Rad(L {x 1,..., x }, where the factor s ot ecessary f L = L. O the other had, Rad(G {x 1,..., x } = E b: b β b Ω E b Ω = β E Ω β, b: b β where the last equalty follows by Jese s equalty, as E Ω E[( Ω ] = usg the depedece of the Ω s ad that Ω = 1. We are ow ready to gve a boud for the full eural etwork. We ca use Proposto 3.7 to ru the recurso, otcg that the last layer volves a lear fucto (whch s 1-Lpschtz. The frst layer requres a dfferet treatmet, ad we ca use Proposto 3.4. Usg Proposto 3.7 we ca establsh the followg boud for the Rademacher complexty of a layered eural etwork.

3-6 Lecture 3: Rademacher Complexty. Examples Proposto 3.8 Let σ be λ-lpschtz. Let A (p be defed as 3.1. The, for ay x 1..., x R d, Rad(A (p {x 1,..., x } 1 p 3 (β + ωβλ (ωλ k + ω(ωλ p max x log(d If λ = 1 ad σ s at-symmetrc, amely, σ(x = σ( x, we have Rad(A (p {x 1,..., x } 1 k=0 p (β k=0 ω k + ω p 1 max x log(d Proof: As the last layer of the eural etwork s lear,.e., σ (p (x = x, we ca apply Proposto 3.7 wth α = 1 (as σ (p s 1-Lpschtz oce ad the apply (3. Proposto 3.7 wth α = λ for p tmes. We fd Rad(A (p {x 1,..., x } β ( βλ + ω p 3 (ωλ k + (ωλ p Rad(A 1 {x 1,..., x }. k=0 The result of the frst equalty follows by Proposto 3.4. The secod equalty ca be proved usg the same strategy, usg (3.3 stead of (3..