Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Similar documents
Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT

Elementary Analysis in Q p

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

GENERICITY OF INFINITE-ORDER ELEMENTS IN HYPERBOLIC GROUPS

#A37 INTEGERS 15 (2015) NOTE ON A RESULT OF CHUNG ON WEIL TYPE SUMS

Uniform Law on the Unit Sphere of a Banach Space

Sums of independent random variables

arxiv: v1 [physics.data-an] 26 Oct 2012

HENSEL S LEMMA KEITH CONRAD

Finite Mixture EFA in Mplus

Journal of Mathematical Analysis and Applications

Stochastic integration II: the Itô integral

ECON 4130 Supplementary Exercises 1-4

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education

Solvability and Number of Roots of Bi-Quadratic Equations over p adic Fields

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

2 Asymptotic density and Dirichlet density

Estimation of the large covariance matrix with two-step monotone missing data

Information collection on a graph

4. Score normalization technical details We now discuss the technical details of the score normalization method.

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

1 Extremum Estimators

The non-stochastic multi-armed bandit problem

Ž. Ž. Ž. 2 QUADRATIC AND INVERSE REGRESSIONS FOR WISHART DISTRIBUTIONS 1

Approximating min-max k-clustering

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

Analysis of some entrance probabilities for killed birth-death processes

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

ON THE SET a x + b g x (mod p) 1 Introduction

The Hasse Minkowski Theorem Lee Dicker University of Minnesota, REU Summer 2001

ECE 534 Information Theory - Midterm 2

2 Asymptotic density and Dirichlet density

On a class of Rellich inequalities

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Mollifiers and its applications in L p (Ω) space

Numerical Linear Algebra

Lecture: Condorcet s Theorem

Lecture 3 Consistency of Extremum Estimators 1

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Information collection on a graph

Strong Matching of Points with Geometric Shapes

A Note on Guaranteed Sparse Recovery via l 1 -Minimization

Marcinkiewicz-Zygmund Type Law of Large Numbers for Double Arrays of Random Elements in Banach Spaces

Chapter 7: Special Distributions

Lecture 4: Law of Large Number and Central Limit Theorem

PETER J. GRABNER AND ARNOLD KNOPFMACHER

E-companion to A risk- and ambiguity-averse extension of the max-min newsvendor order formula

A Social Welfare Optimal Sequential Allocation Procedure

p-adic Measures and Bernoulli Numbers

State Estimation with ARMarkov Models

k- price auctions and Combination-auctions

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

An Estimate For Heilbronn s Exponential Sum

Improved Capacity Bounds for the Binary Energy Harvesting Channel

SECTION 12: HOMOTOPY EXTENSION AND LIFTING PROPERTY

Interactive Hypothesis Testing Against Independence

1 1 c (a) 1 (b) 1 Figure 1: (a) First ath followed by salesman in the stris method. (b) Alternative ath. 4. D = distance travelled closing the loo. Th

arxiv: v4 [math.nt] 11 Oct 2017

March 4, :21 WSPC/INSTRUCTION FILE FLSpaper2011

Math 4400/6400 Homework #8 solutions. 1. Let P be an odd integer (not necessarily prime). Show that modulo 2,

Multiplicity of weak solutions for a class of nonuniformly elliptic equations of p-laplacian type

A Special Case Solution to the Perspective 3-Point Problem William J. Wolfe California State University Channel Islands

On Z p -norms of random vectors

Positive Definite Uncertain Homogeneous Matrix Polynomials: Analysis and Application

Consistency and asymptotic normality

COMMUNICATION BETWEEN SHAREHOLDERS 1

The spectrum of kernel random matrices

SUMS OF TWO SQUARES PAIR CORRELATION & DISTRIBUTION IN SHORT INTERVALS

1 Riesz Potential and Enbeddings Theorems

Elementary theory of L p spaces

Applications to stochastic PDE

arxiv: v2 [math.ac] 5 Jan 2018

Commutators on l. D. Dosev and W. B. Johnson

General Linear Model Introduction, Classes of Linear models and Estimation

On the stability and integration of Hamilton-Poisson systems on so(3)

1 Probability Spaces and Random Variables

LEIBNIZ SEMINORMS IN PROBABILITY SPACES

BEST CONSTANT IN POINCARÉ INEQUALITIES WITH TRACES: A FREE DISCONTINUITY APPROACH

STRONG TYPE INEQUALITIES AND AN ALMOST-ORTHOGONALITY PRINCIPLE FOR FAMILIES OF MAXIMAL OPERATORS ALONG DIRECTIONS IN R 2

New Information Measures for the Generalized Normal Distribution

ADAMS INEQUALITY WITH THE EXACT GROWTH CONDITION IN R 4

Elliptic Curves Spring 2015 Problem Set #1 Due: 02/13/2015

AP Calculus Testbank (Chapter 10) (Mr. Surowski)

The Longest Run of Heads

Online Learning of Noisy Data with Kernels

On information plus noise kernel random matrices

NOTES. Hyperplane Sections of the n-dimensional Cube

Department of Mathematics

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

New weighing matrices and orthogonal designs constructed using two sequences with zero autocorrelation function - a review

On-Line Appendix. Matching on the Estimated Propensity Score (Abadie and Imbens, 2015)

A New Perspective on Learning Linear Separators with Large L q L p Margins

AVERAGES OF EULER PRODUCTS, DISTRIBUTION OF SINGULAR SERIES AND THE UBIQUITY OF POISSON DISTRIBUTION

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

Eötvös Loránd University Faculty of Informatics. Distribution of additive arithmetical functions

Coding Along Hermite Polynomials for Gaussian Noise Channels

CHAPTER 2: SMOOTH MAPS. 1. Introduction In this chapter we introduce smooth maps between manifolds, and some important

Transcription:

Robustness of classifiers to uniform l and Gaussian noise Sulementary material Jean-Yves Franceschi Ecole Normale Suérieure de Lyon LIP UMR 5668 Omar Fawzi Ecole Normale Suérieure de Lyon LIP UMR 5668 Alhussein Fawzi UCLA Vision Lab In these aendices we rove the theoretical results stated in the main article A Preliminary Results In this section we exlicitly comute r x for a linear classifier as described in the main article Lemma For all [ ] the l -distance from any oint x to the decision hyerlane H defined by f z 0 is: if : if : if : r x r x r x f x w ; f x w ; f x w Overall for all [ ] the l -distance from any oint x to the decision hyerlane H : f z 0 is: Proof We distinguish between the three cases r x f x w Suose The distance from x to H is equal to the minimum radius α of a ball ie for l a hyercube centered at x that intersects H This intersection with minimum radius necessarily contains a vertex of the hyercube To determine which one it suffices to determine which vector x + αε with ε d first intersects H when α increases starting at 0 Such an intersection arises when w T x + αε + b 0 so α fx and since α must be non-negative: w T ε r x min f x f x fx w T ε0 w T ε w because ε d simly choose ε i sign f x w i Now at DeeMind

Suose In this case the roof is symmetric to the one for with ε d having exactly one non-zero coordinate Suose The distance from x to H is equal to the minimum radius α of an l ball B centered at x that intersects H This ball is described by the following equation where z is the variable: z i x i α i For such a minimum radius the lane described by f z 0 is tangent to B at some oint x + n Let us assume without loss of generality that every coordinate of n is non-negative We also know that this hyerlane is described by the following equations where z is the variable: x+n i z i x i α T z x + n 0 i i n i z i x i n i 0 n i z i x i α beacuse n belongs to the boundary of B The last equation thus describes the same hyerlane as w T z b Therefore there exists λ R \ 0 such that i n i λw i Then since x + n B : and since x + n H: we have λ α fx Finally: i n i x i + n i x i λ w i n i α i w i x i + n i + b f x + w T n 0 i α α i n i f x w i λw i α f x w B Robustness of Linear Classifiers to l Noise B Main Theorem Theorem Let [ ] Let [ ] be such that + Then there exist universal constants C c c > 0 such that for all ε < c c : where ζ ε C ε and ζ ε c c ε Theorem is roved by the following lemmas ζ εd / w r εx / w w r ζ εd x w

Lemma There exists a universal constant C > 0 such that where ζ ε C ε r ε x / w r ζ εd x w Proof Let us first exress conveniently P v B g x g x + αv where v B means that v is chosen uniformly at random in B : P v B g x g x + αv P v B f x f x + αv 0 P v B sign f x w T x + b sign f x αw T v P v B w r P v B w x P r v B w x r x sign f x αw T v w T v w T v 3 where Eq is given by Lemma and Eq and 3 follow from v B v B Markov s inequality gives from Eq 3: P v B g x g x + αv E v B [ d i w iv i ] w r x In [Barthe et al 005 Theorem 7] it is roved that there is a constant C 0 > 0 such that: w i v i C0 w E v B i d Therefore: P v B g x g x + αv C 0 d w w r x So if < ε d C0 w constant C C0 w r x then P v B g x g x + αv < ε Thus there is a universal > 0 such that: r ε x / w r ζ εd x w Lemma 3 There exist universal constants c c > 0 such that for all ε < c c : where ζ ε c c ε r ε x / w r ζ εd x w 3

Proof We first transform the exression of P v B g x g x + αv: P v B g x g x + αv P v B w r x P r v B w x P v B Var w T v w w T v w T v r x w T v Var w T v Paley-Zygmund s inequality states that if X is a random variable with finite variance and t [0 ] then: P X > te [X] t E [X] E [X ] w Note that E T v v B because E Varw T v v B w T v 0 So by using Paley-Zygmund s inequality v with X wt Varw T v and t r w x w Varw T v when VarwT v r x : So if > P v B g x g x + αv Varw T v εe[w T v 4 ] w Varw T v r x E v B [ wt v 4 Varw T v ] w r x then P v B g x g x + αv > ε According to [Barthe et al 005 Theorem 7] there is a universal constant c 0 > 0 such that: for Var w T v : Var w T v c0 d w ; for E [ w T v 4 ] : [ w E T v ] 4 4 4C0 w So there are universal constants c c 0 c 5C 4 0 > 0 such that: d r ε x / w r ζ εd x w B Alternative Lower Bound Actually the lower bound of Theorem may be imroved for most -norms by the following result Lemma 4 There exists a universal constant C > 0 such that where ζ ε C log 3 min ε r ε x / w r ζ εd x w 4

Proof Let min We have: where t w P v B g x g x + αv P v B w r x P v B e θt ex θw T v r x for any θ > 0 Markov s inequality gives: P v B g x g x + αv e θt E v B [ ex θw T v ] e θt e θt k0 k0 k! E v B [ θw T v k ] since w T v is symmetric In [Barthe et al 005 Theorem 7] it is roved that: w T v k! E v B [ θw T v k ] if k d and : if k d and > : if k > d and : E v B E v B E v B i k w i v i C 0 k d i k w i v i C 0 k d i w k ; w k ; k w i v i C 0 w k C 0 k d w k ; if k > d and > : E v B k w i v i C 0 d d i w k C 0 k d w k where C 0 is a universal constant the same as in the roof of Lemma So overall: E v B k w i v i C 0 k k w d i Thus: P v B g x g x + αv e θt k0 θ C 0 k k! d w k 5

We can bound the following ower series using Stirling-like bounds [Robbins 955] in 4 and 5: k0 k k k! xk + π k k k k k+ k e x k + k k e x k π + π k + e π k + e π k ex ex k e k x k k e x k k! k e x k k! 4 5 Therefore: By choosing θ t P v B g x g x + αv 3e θt ex θ e C 0 w d e C 0k d w : P v B g x g x + αv 3 ex t d e C0 w r x 3 ex d w e C0 w So if < C d w ln 3 w r ε x then P v B g x g x + αv < ε where C > 0 e C0 is a universal constant and: r ε x / w r ζ εd x w B3 Tyical Value of the Multilicative Factor Proosition For any ] if w is a random direction uniformly distributed over the unit l -shere then as d : d / w w d Γ as π Moreover for d w w d ln d as g Proof w can be written as g where g g g d are iid with normal distribution µ 0 σ The law of large numbers gives that for : d i g i as E g Γ 6 + π

Thus: and for ]: g d Γ + as π d g d g Γ as π because g d as For we use a result roved in [Galambos 987 Examle 44] directly imlying that Using the revious comutations for we find: g ln d as d w w d ln d as C Robustness of Linear Classifiers to Gaussian Noise C Main Theorem Theorem For ε < 3 ζ ε ln ε and ζ ε ζ ε w r Σεx Σw r x Theorem is roved by the following lemmas Lemma 5 For ζ ε ln ε 3ε : ζ ε w Σw r Σε x r x ζ ε w Σw Proof As in the roof of Lemma : P v N 0Σ g x + αv g x P v N 0Σ w r x w T v Since v N 0 Σ follows a multivariate normal distribution with a ositive definite covariance matrix Σ if Σ is the symmetric square root of Σ then v Σv with v N 0 I d So: P v N 0Σ g x + αv g x P v N 0Id w r x w T Σv P v N 0Id w r x T Σw v If v N 0 I d then T Σw v N 0 α Σw Therefore: P v N 0Σ g x + αv g x ex w r x α Σw 7

So if < w ln ε r Σw x then P v N 0Σ g x + αv g x < ε Thus r Σε x r x ζ ε w Σw Lemma 6 For ε < 3 and ζ ε 3ε r Σε x r x ζ ε w Σw Proof P v N 0Σ g x + αv g x P v N 0Id w r x T Σw v P v N 0I d w r x α T Σw v P w v N 0I d α r x Σw Σw T v Σw Note that E v N 0Id Σw T v Σw Var v N0I d Σw Σw T v So by using Paley-Zygmund s inequality when w Σw r x : P v N 0Σ g x + αv g x So if > 3ε w α w r Σw x Σw 4 4 + Σw 4 Σw 4 α w r Σw x 3 α w r Σw x Σw 4 Σw 4 + Σw r x then P v N 0Σ g x + αv g x > ε Therefore r Σε x r x ζ ε w Σw C Tyical Value of the Multilicative Factor Proosition Let Σ be a d d ositive semidefinite matrix with Tr Σ If w is a random direction uniformly distributed over the unit l -shere then for t π 8 d: w P d Σw t where t 5 t ex t + ex 8d t 8d Tr Σ + ex 00 Tr Σ 8

Proof Suose that w is a random direction uniformly distributed over the unit l shere g Then w can be written as g where g g g d are iid with normal distribution µ 0 σ By using this reresentation in the orthogonal basis in which Σ is diagonal we get g Σg d i g i d i λ ig i where Σ Diag λ i in the reviously mentioned orthogonal basis Let us focus on the concentration of d i λ ig i We have: λ i g i λ i g i i i λ i One of Bernstein-tye inequalities [Bernstein 97] can be alied: P λ i g i E gi t Var g i E gi i for t β Tr Σ where β π 8 is a constant ie for t β : P Σg t t ex Tr Σ i i λ i g i E gi i λ 4 i e t g has a chi-squared distribution so using a simle concentration inequality for the chi-squared distribution : P d g t ex dt Overall for t βd and t 5 t: g P d Σg t g P d Σg Σg t g P d d Σg Σg t d g + Σg P t Σg d P d g t + P Σg d t d + P Σg 0 so using the revious inequalities: g P d Σg t ex t t + ex 8d 8d Tr Σ + ex 00 Tr Σ Because Γ k+ π E g i k E gi k π 4 k! for all k > Using the fact that g is a sum of indeendent sub-exonential random variables see htts://wwwstatberkeley edu/~mjwain/stat0b/cha_tailbounds_jan_05df Examle 5 for instance 9

D Robustness of LAF Classifiers to l and Gaussian Noise Theorem 3 Let [ ] Let [ ] be such that + Let ε 0 ζ ε ζ ε be as in Theorem Then for all ε < ε 0 the following holds Assume f is a classifier that is γ η-laf at oint x and x be such that r x x x Then: γζ εd / fx fx r εx r x and rovided r ε x r x + γζ εd / fx fx η + γζ εd / fx fx r x η lim Proof Let f and f + be functions such that the searating hyerlanes of resectively H γ x x and H + γ x x are described by equations resectively f z 0 and f + z 0 By definition we know that r f x γ r x and r f + x + γ r x From the definition of LAF classifiers since for all η γ +γ η lim z Hγ x x B x η f z f x > 0 we have r ε f x r ε x; indeed if x + αv with α γ +γ η lim is not misclassified by f then it is not misclassified by f Therefore by alying Lemma to f we get: γ ζ ε d / fx fx r εx r x Since as long as η η lim z H γ + x x B x η f z f x < 0 we can aly a symmetric reasoning for f + and get: r ε x r + γζ εd / fx x fx Theorem 4 Let Σ be a d d ositive semidefinite matrix with TrΣ Let ε 0 ζ ε ζ ε as in Theorem Then for all ε < ε 0 the following holds Assume f is a classifier that is γ η-laf at oint x and x be such that r x x x Then: and rovided γζ ε r Σε x r x + γζ η + γ + 8 Tr Σ ln 4 ζ ε fx r Σεx Σ fx r x 3ε fx Σ fx 3ε fx r Σ fx x η lim Proof This roof can be directly adated from the roof of Theorem 3 The difference in the Gaussian case is that v is no longer samled from the unit ball and its norm is not limited anymore However its norm can be bounded with high robability and this enables to adat the bounds of Theorem 3 to the Gaussian case Indeed using a Bernstein inequality as in the roof of Proosition we have: t P Σv t ex 8 Tr Σ ε for t ψε 8 Tr Σ ln 4 ε 0

Let us focus on the uer bound for this roof; the lower bound follows by a similar reasoning From the definition of LAF classifiers since for all for all η η lim z H γ + x x B x η f z f x < 0 we have r Σε x r Σ 3ε f + x; indeed if x + αv with α η lim +ψε is misclassified by f + then it is misclassified by f if αv η lim Therefore by alying Lemma 6 to f + : r Σε x ε fx r x + γζ Σ fx E Generalization to Multi-class Classifiers We resent in this section a generalization of Theorem to multi-class linear classifiers and discuss about the generalization of the other results to the multi-class case A classifier f is said to be linear if for all k L there are vector w k b k such that f k x w T k x + b ε k In this setting Theorem can be generalized by relacing ζ ε by ζ L in the lower bound Theorem 5 Let [ ] Let [ ] be such that + Let ε 0 ζ ε ζ ε be the constants as defined in Theorem Let k g x the label attributed to x by f j be a class such that x + r x lies on the decision boundary between classes k and j ie the class of the adversarial ertubation of x and j w argmin k w l l w k w l Then for all ε < ε 0 : ε ζ d / w k w j r εx L w k w j r ζ εd / w k w j x w k w j Proof We first define for the sake of the demonstration for any class l the adversarial erturbation in the binary case where only classes k and l are considered: r x l argmin r st f k x + r < f j x + r r It is then ossible to exress conveniently P v B g x g x + αv: P v B g x g x + αv P v B l k f k x < f l x + αv P v B l k w l w k T v f k x f l x P v B l k w l w k T v r x l w l w k Let us first rove the inequality on the uer bound as in Lemma 3 w j w k T P v B g x g x + αv P v B v r x j w j w k w j w k T P v B v r x w j w k by definition of j Then using the same reasoning as in Lemma 3 leads to r ε x r x ζ εd / w k w j w k w j Let us then rove the inequality on the lower bounds as in Lemma We use the union bound to derive the inequality: P v B g x g x + αv w l w k T P v B v r x l w l w k l k w l w k T P v B v r x w l w k l k

because r x r x l for all l Moreover for < ζ the reasoning of Lemma for each l k: P v B g x g x + αv l k ε L d w k w j w k w j r x by following ε L ε Therefore: r ε x ε r ζ d / w k w j x L w k w j The roof of this theorem uses the union bound to obtain the lower bound exlaining that ζ ε in the binary case becomes ζ ε L in the multi-class setting However this inequality reresents a worst case in the majoration used in the roof and we observed in our exeriments that using the coefficient ζ ε instead of ζ ε L gives a roer lower bound on rεx r x Notice that it is ossible to generalize other results that we roved in the binary case Lemma 4 Theorems and 3 to the multi-class roblem with a similar transormation of the inequalities relacing ζ ε by ζ ε L and using similar definitions of j and j References [Barthe et al 005] Barthe F Guédon O Mendelson S and Naor A 005 A robabilistic aroach to the geometry of the l n -ball The Annals of Probability 33:480 53 [Bernstein 97] Bernstein S 97 Theory of Probability [Galambos 987] Galambos J 987 The Asymtotic Theory of Extreme Order Statistics R E Krieger second edition [Robbins 955] Robbins H 955 A remark on stirling s formula The American Mathematical Monthly 6:6 9