Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Similar documents
Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Euler equations for multiple integrals

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

PDE Notes, Lecture #11

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

A. Exclusive KL View of the MLE

Topic 7: Convergence of Random Variables

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Least-Squares Regression on Sparse Spaces

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

A Unified Theorem on SDP Rank Reduction

Lecture 2: Correlated Topic Model

Math 342 Partial Differential Equations «Viktor Grigoryan

Proof of SPNs as Mixture of Trees

All s Well That Ends Well: Supplementary Proofs

Some Examples. Uniform motion. Poisson processes on the real line

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

REAL ANALYSIS I HOMEWORK 5

Free rotation of a rigid body 1 D. E. Soper 2 University of Oregon Physics 611, Theoretical Mechanics 5 November 2012

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

1 Math 285 Homework Problem List for S2016

Generalized Tractability for Multivariate Problems

Multi-View Clustering via Canonical Correlation Analysis

Schrödinger s equation.

Self-normalized Martingale Tail Inequality

LECTURE NOTES ON DVORETZKY S THEOREM

Convergence of Random Walks

Lecture 10: October 30, 2017

Tractability results for weighted Banach spaces of smooth functions

Sturm-Liouville Theory

Designing Information Devices and Systems II Fall 2017 Note Theorem: Existence and Uniqueness of Solutions to Differential Equations

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

7.1 Support Vector Machine

Lagrangian and Hamiltonian Dynamics

Discrete Mathematics

Proof of Proposition 1

Monotonicity for excited random walk in high dimensions

A. Incorrect! The letter t does not appear in the expression of the given integral

Levy Process and Infinitely Divisible Law

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Acute sets in Euclidean spaces

Table of Common Derivatives By David Abraham

Separation of Variables

On combinatorial approaches to compressed sensing

Introduction to the Vlasov-Poisson system

Evolutionary Stability of Pure-Strategy Equilibria in Finite Games

The total derivative. Chapter Lagrangian and Eulerian approaches

Permanent vs. Determinant

12.11 Laplace s Equation in Cylindrical and

ALGEBRAIC AND ANALYTIC PROPERTIES OF ARITHMETIC FUNCTIONS

P(x) = 1 + x n. (20.11) n n φ n(x) = exp(x) = lim φ (x) (20.8) Our first task for the chain rule is to find the derivative of the exponential

Equilibrium in Queues Under Unknown Service Times and Service Value

Problem Sheet 2: Eigenvalues and eigenvectors and their use in solving linear ODEs

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

ELEC3114 Control Systems 1

Agmon Kolmogorov Inequalities on l 2 (Z d )

Extreme Values by Resnick

SOLUTIONS for Homework #3

Applications of the Wronskian to ordinary linear differential equations

The Ehrenfest Theorems

Least Distortion of Fixed-Rate Vector Quantizers. High-Resolution Analysis of. Best Inertial Profile. Zador's Formula Z-1 Z-2

Fall 2016: Calculus I Final

Generalization of the persistent random walk to dimensions greater than 1

LOGISTIC regression model is often used as the way

Global Solutions to the Coupled Chemotaxis-Fluid Equations

Kinematics of Rotations: A Summary

Stable and compact finite difference schemes

Lecture 5. Symmetric Shearer s Lemma

Lower bounds on Locality Sensitive Hashing

Linear First-Order Equations

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

Jointly continuous distributions and the multivariate Normal

1. At time t = 0, the wave function of a free particle moving in a one-dimension is given by, ψ(x,0) = N

arxiv: v2 [physics.data-an] 5 Jul 2012

Conservation Laws. Chapter Conservation of Energy

An extension of Alexandrov s theorem on second derivatives of convex functions

Witt#5: Around the integrality criterion 9.93 [version 1.1 (21 April 2013), not completed, not proofread]

Pure Further Mathematics 1. Revision Notes

Diagonalization of Matrices Dr. E. Jacobs

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics

Hyperbolic Systems of Equations Posed on Erroneous Curved Domains

ENTROPY AND KINETIC FORMULATIONS OF CONSERVATION LAWS

Parameter estimation: A new approach to weighting a priori information

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Monte Carlo Methods with Reduced Error

On a limit theorem for non-stationary branching processes.

Problem set 2: Solutions Math 207B, Winter 2016

A Course in Machine Learning

Abstract A nonlinear partial differential equation of the following form is considered:

Calculus of Variations

SHARP BOUNDS FOR EXPECTATIONS OF SPACINGS FROM DECREASING DENSITY AND FAILURE RATE FAMILIES

u!i = a T u = 0. Then S satisfies

A Sketch of Menshikov s Theorem

Slide10 Haykin Chapter 14: Neurodynamics (3rd Ed. Chapter 13)

Convergence of Langevin MCMC in KL-divergence

Physics 251 Results for Matrix Exponentials Spring 2017

Transcription:

A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine We start by calculating the following integral, assuming 2 : { } ω 2 ω I exp ω R σ 2 σ ω { 2 exp ω } ω σ 2 2 { } ω exp ω ω 2 { ω exp ω 2 2 Changing variables y ω I σ2 σ 2 2 2 2σ2 σ 2 y 2 yiels, y 2 y 2 2 exp } ω exp {y}y exp {y}y exp {y}y { } 2 σ We thus conclue for the general case, { I 2σ2 2 exp } 2 σ 2 5 As for the Kulbac-Leibler Divergence, we use the chain formula for inepenent ranom variables, KL P D KL P σp, R R i P i 2 i e ω σ ω P, ω ω σ P, Proof: We prove that, Pr yω x < Pr yω x < y x ω ω The ranom variable 4 E x, y,, σ Z yω x, is a sum of inepenent zero-mean laplace istribute ranom variables, Z Laplace, σ x, each is equal in istribution to a ifference between two ii exponential ranom variables Therefore, Pr yω x < Pr A ω B < y x, where A, B Expλ an, λ λ x σ x,, 6 Without the loss of generality we assume that the coorinates of x are sorte, ie λ < λ 2 < λ Calculating the convolution for x x an z, f AA z z λ λ e λtz e λ t λ λ λ λ e λ z e λz Exploiting the structure of the resulting convolution, we convolve it with the lth ensity an get, f AA A l z λ λ λ l λm λ e λ z λ m λ e λz λ λ e λmz λ λ λ m λ λ m λ Performing convolution for all ensities yiels, A z ξ e λz for z, where we efine ξ ξ x λ n,n λ n λ The first term of the integral is given in 5, an the secon term is exactly the -imensional σ-weighte l -norm, therefore, 2 ω E, which completes the proof Similarly, we get the same result for f B z, yet it is efine for z From 6 we convolute the 4 Notice that if x the ranom variable ω x equals zero too, therefore we assume without loss of generality that x

Asaf Noy, Koby Crammer ifference an get, A B z minz, m,n m,n A f B ξ m e λmt m ξ m ξ e λ z eλmλ t λ m λ ξ m ξ λ m λ e λ z for ψ ψ x m We integrate to get the CDF, minz, ξ e λ zt ψ e λ z y x l cf yω x { ψ z ξ m ξ λ m λ ψ e λ z Finally, we efine α x ψ x λ x ξ sort x 3, α x ξ m ξ λ e λ y x y x ψ λ e λ y x y x < m ξ 2, an obtain for ξ ξ ξm ξ, m z ξm In particular, from the symmetry of A B z, we have for, that 2 Pr yω x < ω which conclues the proof C Proof of Theorem 4 α i, Proof: From the assumption that the ata is { linearly separable we conclue that the set y i x i, i,, m } is not empty Aitionally, the set is efine via linear constraints an thus convex The obective 7 is convex in σ as its secon erivative with respect to σ is σ 2 > The regularization term of 7 is convex in as the secon erivative of z exp z is always positive an well efine for all values of z see also Remar for a iscussion of this function for values z alpha 4 3 2 2 3 4 2 3 4 5 6 7 8 9 feature inex Figure 6: Illustration of the cumulative sums, i α ix, for five -imensional vectors As for the loss term l y i x i, we use the following auxiliary lemma Lemma The following set of probability ensity functions over the reals { S f pf f C, fz fz, } an z, z 2, z 2 > z fz 2 < fz is close uner convolution, ie f, g S f g S Since the ranom variables ω,, ω are inepenent, the ensity f Zi z of the margin Z i y i ω xi, is obtaine by convoluting inepenent zero-mean Laplace istribute ranom variables y i ω i, x i, Since the -imensional Laplace pf is in S, it follows from Lemma by inuction that so is f Zi As a member of S, the positivity of the erivative f Z i z for z is conclue from Lemma Finally, we note that the integral of the ensity is l cf, the cumulative ensity function, Ex i, y i,, σ y i x i f Zi zz Thus, the secon erivative of Ex i, y i,, σ for positive values of the margin, equals to f Z i z for z, an hence positive Changing variables accoring to 6 completes the proof D Proof of Lemma Proof: Assume f, g S an enote by h f g The erivative of a convolution between two ifferentiable functions always exists, an equals to, f g z

f g z We compute for the convolution erivative, gt h z fz t gt gt fz t fz t gt gt fz t fz t gt fz t fz t, where the last equality follows the fact gt is an o function as a erivative of an even function Since f, g S, hz C ie continuously ifferentiable almost everywhere, an since h z is o, we have that hz is even Using the monotonicity property of f, g, ie z 2 > z fz 2 < fz, we get, gt fz t fz t signz fz t fz t gt Since f, g are pfs, the integral is always efine, an thus the sign of the erivative of h epens on the sign of its argument, an in particular it is an increasing function for z < an ecreasing for z >, yieling the thir property for h Thus, h S, as esire E Proof of Lemma 5 Rearranging, we get, exp cmη e, an we can conclue, σ exp cmη F Proof of Theorem 6 Proof: While the empirical loss term epens only on, an was prove to be strictly convex for examples that satisfies y i x i in theorem 4, the regularization term is optimize over both, σ Incorporating the optimal value for sigma from 8 into the obective yiels the following: F, σ e c lyx i i Differentiating the regularization term twice with respect to results in the following Hessian matrix, H e iagexp v v e, Proof: Setting an σ the obective becomes cmη Since the loss is non-negative we get that the minimizers satisfy, cmη σ e σ c i e ly i x i σ e σ e Substituting the optimal value of σ from 8 we get, cmη e e e for the -imensional vector v sign exp, an iagexp is a iagonal vector for which its ith elements equals exp i The Hessian H is a ifference of two positive semi-efinite matrices We upper boun the maximal eigenvalues of the secon term by its trace, inee, maxλ e v v e e 2 e < 2 2 2 2

Asaf Noy, Koby Crammer Thus, the minimal eigenvalue of H is boune from below by, an the Hessian of the sum of the obective an 2 2 has positive eigenvalues, therefore strictly convex For the secon part, we use 7, Corollary 723 stating the a iagonally-ominate matrix with non-negative iagonal values is PSD We next show that inee is a sufficient conition for the Hessian to be iagonally ominate It is straightforwar to verify that both conitions follows from the following set of inequalities, for all,,, e or equivalently, e e e e e > e, e > e, e > 7 Fixing the left-han-sie is ecompose to a sum of one variable convex functions We minimize it for each by taing the erivative an setting it to zero, yieling, sign e e 8 From here we conclue that 7 is satisfie if a for a scalar a that satisfy, ga 2e a ae a > The function ga is monotonically ecreasing an continuous, with g 3/e >, which completes the proof In fact, one can compute numerically an fin that a 46 satisfy ga, which leas to a slightly better constant than state in the theorem G Proof of Lemma 7 Proof: We first nee to compute l lin irectly, as α x is not efine on the stanar basis, which contains few elements of the same value, Pr e ω Pr ω Pr ω < 2σ e ω σ ω Thus, if we get the convex part Pr e ω 2 exp Otherwise, we boun Pr e ω with the linear extension an get 2 To conclue, for each element we get that, y± l linye 2 exp { } Taing the sum over an multiplying by 2 yiels the above regularization term H RobuCop Pseuo-coe Input: Training set S {x i, y i} m i, c >, Artifical set A {e, y :, y Y} Initialization: Loop o until convergence criterion met: Set: σ n exp { } Solve n arg min { S A ci l linyx i } { c xi, y i S for: c i x i, y i A Output:, σ I Proof of Lemma 8 2σ n Proof: Denote the change of the loss term of 2 by, Δ t D i e yixi t i i yixi D i e t t We start by bouning Δ t from below, then a to it the ifference of the regularization term, before an after the upate Bouning the improvement for a

single example, we get, Δ t,i c q t i D i e yixi t D i e yixi t D i e yixi t D i D i e yixi t e yix i, t D ie yixi t D i e yixi t D ie yixi t D i e yixi t By using z z for z < we get, q t i q t i e yix i, t e yix i, t t Convexity of the exponent, for every,, yiels, e yix i, t Summing over the examples, Δ t c c x i, e signyix i, x i, e t q t i x i, e signyix i, i i,y ix i, c i,y ix i, < c γ q t i x i, q t i x i, e t e e γ t t e t t aing the regularization terms completes the proof J Proof of Lemma 9 Proof: t Without loss of generality we assume that t γ e γ e >, an in aition we assume for the sae of contraiction that, t t < Differentiating the obective with respect for t t, an equating to zero yiels: { t t t e t c γ t e γ t e t Arranging the terms: cγ e t cγ e cγ t e t e t } t cγ e t e t t σ, the right han sie is assume to be strictly positive, an as for the left han sie: γ e t γ < γ t e e γ e t γ t e t t e t t γ e < e t t This is a contraiction, so we must have that t t The proof for the symmetric case follows similarly K Experiments- Data Details: Synthetic ata: We generate 4, vectors x i R 8 sample from a zero mean isotropic normal istribution x i N, I Labels were assigne by generating once per run ω R 8 at ranom an using: y i signω x i Each input x i training ata was then corrupte with probability p by aing to it a ranom vector sample from a zero mean isotropic Gaussian, ɛ i N, σi, with some positive stanar-eviation σ Each run was repeate 2 times, an results are average test-error over the 2 runs All boosting algorithms were run for, iterations, except for the RobuCoP algorithm which was execute until a convergence criterion was met, which often was about 2 rouns Vocal Joystic: For each problem, we pice three sets of size 2, each, for training, parameter tuning an testing Each example is a frame of spoen value escribe with 3 MFCC coefficients transforme into 27 features In orer to examine the robustness of ifferent algorithms, we contaminate % of the ata with an aitive zero-mean ii Gaussian noise, for ifferent values of the stanar-eviation σ