Robustness of classifiers to uniform l and Gaussian noise Sulementary material Jean-Yves Franceschi Ecole Normale Suérieure de Lyon LIP UMR 5668 Omar Fawzi Ecole Normale Suérieure de Lyon LIP UMR 5668 Alhussein Fawzi UCLA Vision Lab In these aendices we rove the theoretical results stated in the main article A Preliminary Results In this section we exlicitly comute r x for a linear classifier as described in the main article Lemma For all [ ] the l -distance from any oint x to the decision hyerlane H defined by f z 0 is: if : if : if : r x r x r x f x w ; f x w ; f x w Overall for all [ ] the l -distance from any oint x to the decision hyerlane H : f z 0 is: Proof We distinguish between the three cases r x f x w Suose The distance from x to H is equal to the minimum radius α of a ball ie for l a hyercube centered at x that intersects H This intersection with minimum radius necessarily contains a vertex of the hyercube To determine which one it suffices to determine which vector x + αε with ε d first intersects H when α increases starting at 0 Such an intersection arises when w T x + αε + b 0 so α fx and since α must be non-negative: w T ε r x min f x f x fx w T ε0 w T ε w because ε d simly choose ε i sign f x w i Now at DeeMind
Suose In this case the roof is symmetric to the one for with ε d having exactly one non-zero coordinate Suose The distance from x to H is equal to the minimum radius α of an l ball B centered at x that intersects H This ball is described by the following equation where z is the variable: z i x i α i For such a minimum radius the lane described by f z 0 is tangent to B at some oint x + n Let us assume without loss of generality that every coordinate of n is non-negative We also know that this hyerlane is described by the following equations where z is the variable: x+n i z i x i α T z x + n 0 i i n i z i x i n i 0 n i z i x i α beacuse n belongs to the boundary of B The last equation thus describes the same hyerlane as w T z b Therefore there exists λ R \ 0 such that i n i λw i Then since x + n B : and since x + n H: we have λ α fx Finally: i n i x i + n i x i λ w i n i α i w i x i + n i + b f x + w T n 0 i α α i n i f x w i λw i α f x w B Robustness of Linear Classifiers to l Noise B Main Theorem Theorem Let [ ] Let [ ] be such that + Then there exist universal constants C c c > 0 such that for all ε < c c : where ζ ε C ε and ζ ε c c ε Theorem is roved by the following lemmas ζ εd / w r εx / w w r ζ εd x w
Lemma There exists a universal constant C > 0 such that where ζ ε C ε r ε x / w r ζ εd x w Proof Let us first exress conveniently P v B g x g x + αv where v B means that v is chosen uniformly at random in B : P v B g x g x + αv P v B f x f x + αv 0 P v B sign f x w T x + b sign f x αw T v P v B w r P v B w x P r v B w x r x sign f x αw T v w T v w T v 3 where Eq is given by Lemma and Eq and 3 follow from v B v B Markov s inequality gives from Eq 3: P v B g x g x + αv E v B [ d i w iv i ] w r x In [Barthe et al 005 Theorem 7] it is roved that there is a constant C 0 > 0 such that: w i v i C0 w E v B i d Therefore: P v B g x g x + αv C 0 d w w r x So if < ε d C0 w constant C C0 w r x then P v B g x g x + αv < ε Thus there is a universal > 0 such that: r ε x / w r ζ εd x w Lemma 3 There exist universal constants c c > 0 such that for all ε < c c : where ζ ε c c ε r ε x / w r ζ εd x w 3
Proof We first transform the exression of P v B g x g x + αv: P v B g x g x + αv P v B w r x P r v B w x P v B Var w T v w w T v w T v r x w T v Var w T v Paley-Zygmund s inequality states that if X is a random variable with finite variance and t [0 ] then: P X > te [X] t E [X] E [X ] w Note that E T v v B because E Varw T v v B w T v 0 So by using Paley-Zygmund s inequality v with X wt Varw T v and t r w x w Varw T v when VarwT v r x : So if > P v B g x g x + αv Varw T v εe[w T v 4 ] w Varw T v r x E v B [ wt v 4 Varw T v ] w r x then P v B g x g x + αv > ε According to [Barthe et al 005 Theorem 7] there is a universal constant c 0 > 0 such that: for Var w T v : Var w T v c0 d w ; for E [ w T v 4 ] : [ w E T v ] 4 4 4C0 w So there are universal constants c c 0 c 5C 4 0 > 0 such that: d r ε x / w r ζ εd x w B Alternative Lower Bound Actually the lower bound of Theorem may be imroved for most -norms by the following result Lemma 4 There exists a universal constant C > 0 such that where ζ ε C log 3 min ε r ε x / w r ζ εd x w 4
Proof Let min We have: where t w P v B g x g x + αv P v B w r x P v B e θt ex θw T v r x for any θ > 0 Markov s inequality gives: P v B g x g x + αv e θt E v B [ ex θw T v ] e θt e θt k0 k0 k! E v B [ θw T v k ] since w T v is symmetric In [Barthe et al 005 Theorem 7] it is roved that: w T v k! E v B [ θw T v k ] if k d and : if k d and > : if k > d and : E v B E v B E v B i k w i v i C 0 k d i k w i v i C 0 k d i w k ; w k ; k w i v i C 0 w k C 0 k d w k ; if k > d and > : E v B k w i v i C 0 d d i w k C 0 k d w k where C 0 is a universal constant the same as in the roof of Lemma So overall: E v B k w i v i C 0 k k w d i Thus: P v B g x g x + αv e θt k0 θ C 0 k k! d w k 5
We can bound the following ower series using Stirling-like bounds [Robbins 955] in 4 and 5: k0 k k k! xk + π k k k k k+ k e x k + k k e x k π + π k + e π k + e π k ex ex k e k x k k e x k k! k e x k k! 4 5 Therefore: By choosing θ t P v B g x g x + αv 3e θt ex θ e C 0 w d e C 0k d w : P v B g x g x + αv 3 ex t d e C0 w r x 3 ex d w e C0 w So if < C d w ln 3 w r ε x then P v B g x g x + αv < ε where C > 0 e C0 is a universal constant and: r ε x / w r ζ εd x w B3 Tyical Value of the Multilicative Factor Proosition For any ] if w is a random direction uniformly distributed over the unit l -shere then as d : d / w w d Γ as π Moreover for d w w d ln d as g Proof w can be written as g where g g g d are iid with normal distribution µ 0 σ The law of large numbers gives that for : d i g i as E g Γ 6 + π
Thus: and for ]: g d Γ + as π d g d g Γ as π because g d as For we use a result roved in [Galambos 987 Examle 44] directly imlying that Using the revious comutations for we find: g ln d as d w w d ln d as C Robustness of Linear Classifiers to Gaussian Noise C Main Theorem Theorem For ε < 3 ζ ε ln ε and ζ ε ζ ε w r Σεx Σw r x Theorem is roved by the following lemmas Lemma 5 For ζ ε ln ε 3ε : ζ ε w Σw r Σε x r x ζ ε w Σw Proof As in the roof of Lemma : P v N 0Σ g x + αv g x P v N 0Σ w r x w T v Since v N 0 Σ follows a multivariate normal distribution with a ositive definite covariance matrix Σ if Σ is the symmetric square root of Σ then v Σv with v N 0 I d So: P v N 0Σ g x + αv g x P v N 0Id w r x w T Σv P v N 0Id w r x T Σw v If v N 0 I d then T Σw v N 0 α Σw Therefore: P v N 0Σ g x + αv g x ex w r x α Σw 7
So if < w ln ε r Σw x then P v N 0Σ g x + αv g x < ε Thus r Σε x r x ζ ε w Σw Lemma 6 For ε < 3 and ζ ε 3ε r Σε x r x ζ ε w Σw Proof P v N 0Σ g x + αv g x P v N 0Id w r x T Σw v P v N 0I d w r x α T Σw v P w v N 0I d α r x Σw Σw T v Σw Note that E v N 0Id Σw T v Σw Var v N0I d Σw Σw T v So by using Paley-Zygmund s inequality when w Σw r x : P v N 0Σ g x + αv g x So if > 3ε w α w r Σw x Σw 4 4 + Σw 4 Σw 4 α w r Σw x 3 α w r Σw x Σw 4 Σw 4 + Σw r x then P v N 0Σ g x + αv g x > ε Therefore r Σε x r x ζ ε w Σw C Tyical Value of the Multilicative Factor Proosition Let Σ be a d d ositive semidefinite matrix with Tr Σ If w is a random direction uniformly distributed over the unit l -shere then for t π 8 d: w P d Σw t where t 5 t ex t + ex 8d t 8d Tr Σ + ex 00 Tr Σ 8
Proof Suose that w is a random direction uniformly distributed over the unit l shere g Then w can be written as g where g g g d are iid with normal distribution µ 0 σ By using this reresentation in the orthogonal basis in which Σ is diagonal we get g Σg d i g i d i λ ig i where Σ Diag λ i in the reviously mentioned orthogonal basis Let us focus on the concentration of d i λ ig i We have: λ i g i λ i g i i i λ i One of Bernstein-tye inequalities [Bernstein 97] can be alied: P λ i g i E gi t Var g i E gi i for t β Tr Σ where β π 8 is a constant ie for t β : P Σg t t ex Tr Σ i i λ i g i E gi i λ 4 i e t g has a chi-squared distribution so using a simle concentration inequality for the chi-squared distribution : P d g t ex dt Overall for t βd and t 5 t: g P d Σg t g P d Σg Σg t g P d d Σg Σg t d g + Σg P t Σg d P d g t + P Σg d t d + P Σg 0 so using the revious inequalities: g P d Σg t ex t t + ex 8d 8d Tr Σ + ex 00 Tr Σ Because Γ k+ π E g i k E gi k π 4 k! for all k > Using the fact that g is a sum of indeendent sub-exonential random variables see htts://wwwstatberkeley edu/~mjwain/stat0b/cha_tailbounds_jan_05df Examle 5 for instance 9
D Robustness of LAF Classifiers to l and Gaussian Noise Theorem 3 Let [ ] Let [ ] be such that + Let ε 0 ζ ε ζ ε be as in Theorem Then for all ε < ε 0 the following holds Assume f is a classifier that is γ η-laf at oint x and x be such that r x x x Then: γζ εd / fx fx r εx r x and rovided r ε x r x + γζ εd / fx fx η + γζ εd / fx fx r x η lim Proof Let f and f + be functions such that the searating hyerlanes of resectively H γ x x and H + γ x x are described by equations resectively f z 0 and f + z 0 By definition we know that r f x γ r x and r f + x + γ r x From the definition of LAF classifiers since for all η γ +γ η lim z Hγ x x B x η f z f x > 0 we have r ε f x r ε x; indeed if x + αv with α γ +γ η lim is not misclassified by f then it is not misclassified by f Therefore by alying Lemma to f we get: γ ζ ε d / fx fx r εx r x Since as long as η η lim z H γ + x x B x η f z f x < 0 we can aly a symmetric reasoning for f + and get: r ε x r + γζ εd / fx x fx Theorem 4 Let Σ be a d d ositive semidefinite matrix with TrΣ Let ε 0 ζ ε ζ ε as in Theorem Then for all ε < ε 0 the following holds Assume f is a classifier that is γ η-laf at oint x and x be such that r x x x Then: and rovided γζ ε r Σε x r x + γζ η + γ + 8 Tr Σ ln 4 ζ ε fx r Σεx Σ fx r x 3ε fx Σ fx 3ε fx r Σ fx x η lim Proof This roof can be directly adated from the roof of Theorem 3 The difference in the Gaussian case is that v is no longer samled from the unit ball and its norm is not limited anymore However its norm can be bounded with high robability and this enables to adat the bounds of Theorem 3 to the Gaussian case Indeed using a Bernstein inequality as in the roof of Proosition we have: t P Σv t ex 8 Tr Σ ε for t ψε 8 Tr Σ ln 4 ε 0
Let us focus on the uer bound for this roof; the lower bound follows by a similar reasoning From the definition of LAF classifiers since for all for all η η lim z H γ + x x B x η f z f x < 0 we have r Σε x r Σ 3ε f + x; indeed if x + αv with α η lim +ψε is misclassified by f + then it is misclassified by f if αv η lim Therefore by alying Lemma 6 to f + : r Σε x ε fx r x + γζ Σ fx E Generalization to Multi-class Classifiers We resent in this section a generalization of Theorem to multi-class linear classifiers and discuss about the generalization of the other results to the multi-class case A classifier f is said to be linear if for all k L there are vector w k b k such that f k x w T k x + b ε k In this setting Theorem can be generalized by relacing ζ ε by ζ L in the lower bound Theorem 5 Let [ ] Let [ ] be such that + Let ε 0 ζ ε ζ ε be the constants as defined in Theorem Let k g x the label attributed to x by f j be a class such that x + r x lies on the decision boundary between classes k and j ie the class of the adversarial ertubation of x and j w argmin k w l l w k w l Then for all ε < ε 0 : ε ζ d / w k w j r εx L w k w j r ζ εd / w k w j x w k w j Proof We first define for the sake of the demonstration for any class l the adversarial erturbation in the binary case where only classes k and l are considered: r x l argmin r st f k x + r < f j x + r r It is then ossible to exress conveniently P v B g x g x + αv: P v B g x g x + αv P v B l k f k x < f l x + αv P v B l k w l w k T v f k x f l x P v B l k w l w k T v r x l w l w k Let us first rove the inequality on the uer bound as in Lemma 3 w j w k T P v B g x g x + αv P v B v r x j w j w k w j w k T P v B v r x w j w k by definition of j Then using the same reasoning as in Lemma 3 leads to r ε x r x ζ εd / w k w j w k w j Let us then rove the inequality on the lower bounds as in Lemma We use the union bound to derive the inequality: P v B g x g x + αv w l w k T P v B v r x l w l w k l k w l w k T P v B v r x w l w k l k
because r x r x l for all l Moreover for < ζ the reasoning of Lemma for each l k: P v B g x g x + αv l k ε L d w k w j w k w j r x by following ε L ε Therefore: r ε x ε r ζ d / w k w j x L w k w j The roof of this theorem uses the union bound to obtain the lower bound exlaining that ζ ε in the binary case becomes ζ ε L in the multi-class setting However this inequality reresents a worst case in the majoration used in the roof and we observed in our exeriments that using the coefficient ζ ε instead of ζ ε L gives a roer lower bound on rεx r x Notice that it is ossible to generalize other results that we roved in the binary case Lemma 4 Theorems and 3 to the multi-class roblem with a similar transormation of the inequalities relacing ζ ε by ζ ε L and using similar definitions of j and j References [Barthe et al 005] Barthe F Guédon O Mendelson S and Naor A 005 A robabilistic aroach to the geometry of the l n -ball The Annals of Probability 33:480 53 [Bernstein 97] Bernstein S 97 Theory of Probability [Galambos 987] Galambos J 987 The Asymtotic Theory of Extreme Order Statistics R E Krieger second edition [Robbins 955] Robbins H 955 A remark on stirling s formula The American Mathematical Monthly 6:6 9