Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine We start by calculating the following integral, assuming 2 : { } ω 2 ω I exp ω R σ 2 σ ω { 2 exp ω } ω σ 2 2 { } ω exp ω ω 2 { ω exp ω 2 2 Changing variables y ω I σ2 σ 2 2 2 2σ2 σ 2 y 2 yiels, y 2 y 2 2 exp } ω exp {y}y exp {y}y exp {y}y { } 2 σ We thus conclue for the general case, { I 2σ2 2 exp } 2 σ 2 5 As for the Kulbac-Leibler Divergence, we use the chain formula for inepenent ranom variables, KL P D KL P σp, R R i P i 2 i e ω σ ω P, ω ω σ P, Proof: We prove that, Pr yω x < Pr yω x < y x ω ω The ranom variable 4 E x, y,, σ Z yω x, is a sum of inepenent zero-mean laplace istribute ranom variables, Z Laplace, σ x, each is equal in istribution to a ifference between two ii exponential ranom variables Therefore, Pr yω x < Pr A ω B < y x, where A, B Expλ an, λ λ x σ x,, 6 Without the loss of generality we assume that the coorinates of x are sorte, ie λ < λ 2 < λ Calculating the convolution for x x an z, f AA z z λ λ e λtz e λ t λ λ λ λ e λ z e λz Exploiting the structure of the resulting convolution, we convolve it with the lth ensity an get, f AA A l z λ λ λ l λm λ e λ z λ m λ e λz λ λ e λmz λ λ λ m λ λ m λ Performing convolution for all ensities yiels, A z ξ e λz for z, where we efine ξ ξ x λ n,n λ n λ The first term of the integral is given in 5, an the secon term is exactly the -imensional σ-weighte l -norm, therefore, 2 ω E, which completes the proof Similarly, we get the same result for f B z, yet it is efine for z From 6 we convolute the 4 Notice that if x the ranom variable ω x equals zero too, therefore we assume without loss of generality that x

Asaf Noy, Koby Crammer ifference an get, A B z minz, m,n m,n A f B ξ m e λmt m ξ m ξ e λ z eλmλ t λ m λ ξ m ξ λ m λ e λ z for ψ ψ x m We integrate to get the CDF, minz, ξ e λ zt ψ e λ z y x l cf yω x { ψ z ξ m ξ λ m λ ψ e λ z Finally, we efine α x ψ x λ x ξ sort x 3, α x ξ m ξ λ e λ y x y x ψ λ e λ y x y x < m ξ 2, an obtain for ξ ξ ξm ξ, m z ξm In particular, from the symmetry of A B z, we have for, that 2 Pr yω x < ω which conclues the proof C Proof of Theorem 4 α i, Proof: From the assumption that the ata is { linearly separable we conclue that the set y i x i, i,, m } is not empty Aitionally, the set is efine via linear constraints an thus convex The obective 7 is convex in σ as its secon erivative with respect to σ is σ 2 > The regularization term of 7 is convex in as the secon erivative of z exp z is always positive an well efine for all values of z see also Remar for a iscussion of this function for values z alpha 4 3 2 2 3 4 2 3 4 5 6 7 8 9 feature inex Figure 6: Illustration of the cumulative sums, i α ix, for five -imensional vectors As for the loss term l y i x i, we use the following auxiliary lemma Lemma The following set of probability ensity functions over the reals { S f pf f C, fz fz, } an z, z 2, z 2 > z fz 2 < fz is close uner convolution, ie f, g S f g S Since the ranom variables ω,, ω are inepenent, the ensity f Zi z of the margin Z i y i ω xi, is obtaine by convoluting inepenent zero-mean Laplace istribute ranom variables y i ω i, x i, Since the -imensional Laplace pf is in S, it follows from Lemma by inuction that so is f Zi As a member of S, the positivity of the erivative f Z i z for z is conclue from Lemma Finally, we note that the integral of the ensity is l cf, the cumulative ensity function, Ex i, y i,, σ y i x i f Zi zz Thus, the secon erivative of Ex i, y i,, σ for positive values of the margin, equals to f Z i z for z, an hence positive Changing variables accoring to 6 completes the proof D Proof of Lemma Proof: Assume f, g S an enote by h f g The erivative of a convolution between two ifferentiable functions always exists, an equals to, f g z

f g z We compute for the convolution erivative, gt h z fz t gt gt fz t fz t gt gt fz t fz t gt fz t fz t, where the last equality follows the fact gt is an o function as a erivative of an even function Since f, g S, hz C ie continuously ifferentiable almost everywhere, an since h z is o, we have that hz is even Using the monotonicity property of f, g, ie z 2 > z fz 2 < fz, we get, gt fz t fz t signz fz t fz t gt Since f, g are pfs, the integral is always efine, an thus the sign of the erivative of h epens on the sign of its argument, an in particular it is an increasing function for z < an ecreasing for z >, yieling the thir property for h Thus, h S, as esire E Proof of Lemma 5 Rearranging, we get, exp cmη e, an we can conclue, σ exp cmη F Proof of Theorem 6 Proof: While the empirical loss term epens only on, an was prove to be strictly convex for examples that satisfies y i x i in theorem 4, the regularization term is optimize over both, σ Incorporating the optimal value for sigma from 8 into the obective yiels the following: F, σ e c lyx i i Differentiating the regularization term twice with respect to results in the following Hessian matrix, H e iagexp v v e, Proof: Setting an σ the obective becomes cmη Since the loss is non-negative we get that the minimizers satisfy, cmη σ e σ c i e ly i x i σ e σ e Substituting the optimal value of σ from 8 we get, cmη e e e for the -imensional vector v sign exp, an iagexp is a iagonal vector for which its ith elements equals exp i The Hessian H is a ifference of two positive semi-efinite matrices We upper boun the maximal eigenvalues of the secon term by its trace, inee, maxλ e v v e e 2 e < 2 2 2 2

Asaf Noy, Koby Crammer Thus, the minimal eigenvalue of H is boune from below by, an the Hessian of the sum of the obective an 2 2 has positive eigenvalues, therefore strictly convex For the secon part, we use 7, Corollary 723 stating the a iagonally-ominate matrix with non-negative iagonal values is PSD We next show that inee is a sufficient conition for the Hessian to be iagonally ominate It is straightforwar to verify that both conitions follows from the following set of inequalities, for all,,, e or equivalently, e e e e e > e, e > e, e > 7 Fixing the left-han-sie is ecompose to a sum of one variable convex functions We minimize it for each by taing the erivative an setting it to zero, yieling, sign e e 8 From here we conclue that 7 is satisfie if a for a scalar a that satisfy, ga 2e a ae a > The function ga is monotonically ecreasing an continuous, with g 3/e >, which completes the proof In fact, one can compute numerically an fin that a 46 satisfy ga, which leas to a slightly better constant than state in the theorem G Proof of Lemma 7 Proof: We first nee to compute l lin irectly, as α x is not efine on the stanar basis, which contains few elements of the same value, Pr e ω Pr ω Pr ω < 2σ e ω σ ω Thus, if we get the convex part Pr e ω 2 exp Otherwise, we boun Pr e ω with the linear extension an get 2 To conclue, for each element we get that, y± l linye 2 exp { } Taing the sum over an multiplying by 2 yiels the above regularization term H RobuCop Pseuo-coe Input: Training set S {x i, y i} m i, c >, Artifical set A {e, y :, y Y} Initialization: Loop o until convergence criterion met: Set: σ n exp { } Solve n arg min { S A ci l linyx i } { c xi, y i S for: c i x i, y i A Output:, σ I Proof of Lemma 8 2σ n Proof: Denote the change of the loss term of 2 by, Δ t D i e yixi t i i yixi D i e t t We start by bouning Δ t from below, then a to it the ifference of the regularization term, before an after the upate Bouning the improvement for a

single example, we get, Δ t,i c q t i D i e yixi t D i e yixi t D i e yixi t D i D i e yixi t e yix i, t D ie yixi t D i e yixi t D ie yixi t D i e yixi t By using z z for z < we get, q t i q t i e yix i, t e yix i, t t Convexity of the exponent, for every,, yiels, e yix i, t Summing over the examples, Δ t c c x i, e signyix i, x i, e t q t i x i, e signyix i, i i,y ix i, c i,y ix i, < c γ q t i x i, q t i x i, e t e e γ t t e t t aing the regularization terms completes the proof J Proof of Lemma 9 Proof: t Without loss of generality we assume that t γ e γ e >, an in aition we assume for the sae of contraiction that, t t < Differentiating the obective with respect for t t, an equating to zero yiels: { t t t e t c γ t e γ t e t Arranging the terms: cγ e t cγ e cγ t e t e t } t cγ e t e t t σ, the right han sie is assume to be strictly positive, an as for the left han sie: γ e t γ < γ t e e γ e t γ t e t t e t t γ e < e t t This is a contraiction, so we must have that t t The proof for the symmetric case follows similarly K Experiments- Data Details: Synthetic ata: We generate 4, vectors x i R 8 sample from a zero mean isotropic normal istribution x i N, I Labels were assigne by generating once per run ω R 8 at ranom an using: y i signω x i Each input x i training ata was then corrupte with probability p by aing to it a ranom vector sample from a zero mean isotropic Gaussian, ɛ i N, σi, with some positive stanar-eviation σ Each run was repeate 2 times, an results are average test-error over the 2 runs All boosting algorithms were run for, iterations, except for the RobuCoP algorithm which was execute until a convergence criterion was met, which often was about 2 rouns Vocal Joystic: For each problem, we pice three sets of size 2, each, for training, parameter tuning an testing Each example is a frame of spoen value escribe with 3 MFCC coefficients transforme into 27 features In orer to examine the robustness of ifferent algorithms, we contaminate % of the ata with an aitive zero-mean ii Gaussian noise, for ifferent values of the stanar-eviation σ