Supplementary Material

Similar documents
Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

1 The Primal and Dual of an Optimization Problem

LP in Standard and Slack Forms

The Hypergeometric Coupon Collection Problem and its Dual

Problem. Consider the sequence a j for j N defined by the recurrence a j+1 = 2a j + j for j > 0

Some remarks on the paper Some elementary inequalities of G. Bennett

f(1), and so, if f is continuous, f(x) = f(1)x.

Probability Theory. Exercise Sheet 4. ETH Zurich HS 2017

A PROBABILITY PROBLEM

Optimal Estimator for a Sample Set with Response Error. Ed Stanek

Lecture 19. sup y 1,..., yn B d n

Statistics and Data Analysis in MATLAB Kendrick Kay, February 28, Lecture 4: Model fitting

ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Lecture Outline. 2 Separating Hyperplanes. 3 Banach Mazur Distance An Algorithmist s Toolkit October 22, 2009

Lecture 19. Curve fitting I. 1 Introduction. 2 Fitting a constant to measured data

Double Derangement Permutations

AVERAGE MARKS SCALING

Linear Support Vector Machines

distinct distinct n k n k n! n n k k n 1 if k n, identical identical p j (k) p 0 if k > n n (k)

An Extended Result on the Optimal Estimation under Minimum Error Entropy Criterion

Lecture 20 - Wave Propagation Response

10/ Statistical Machine Learning Homework #1 Solutions

174. A Tauberian Theorem for (J,,fin) Summability*)

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Refinements of Jensen s Inequality for Convex Functions on the Co-Ordinates in a Rectangle from the Plane

On The Prime Numbers In Intervals

Ma/CS 6a Class 22: Power Series

A New Type of q-szász-mirakjan Operators

Introduction to Optimization, DIKU Monday 19 November David Pisinger. Duality, motivation

Non-asymptotic sequential confidence regions with fixed sizes for the multivariate nonlinear parameters of regression. Andrey V.

Bertrand s postulate Chapter 2

Week 10 Spring Lecture 19. Estimation of Large Covariance Matrices: Upper bound Observe. is contained in the following parameter space,

FUZZY TRANSPORTATION PROBLEM WITH ADDITIONAL RESTRICTIONS

Homework Set #3 - Solutions

19.1 The dictionary problem

Jacobi symbols. p 1. Note: The Jacobi symbol does not necessarily distinguish between quadratic residues and nonresidues. That is, we could have ( a

arxiv: v1 [math.st] 12 Dec 2018

SOME FINITE SIMPLE GROUPS OF LIE TYPE C n ( q) ARE UNIQUELY DETERMINED BY THEIR ELEMENT ORDERS AND THEIR ORDER

DISTANCE BETWEEN UNCERTAIN RANDOM VARIABLES

On Modeling On Minimum Description Length Modeling. M-closed

Math 4707 Spring 2018 (Darij Grinberg): midterm 2 page 1. Math 4707 Spring 2018 (Darij Grinberg): midterm 2 with solutions [preliminary version]

7.1 Convergence of sequences of random variables

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

University of Manitoba, Mathletics 2009

Domination Number of Square of Cartesian Products of Cycles

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Pointwise observation of the state given by parabolic system with boundary condition involving multiple time delays

IP Reference guide for integer programming formulations.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

ECE Spring Prof. David R. Jackson ECE Dept. Notes 20

) is a square matrix with the property that for any m n matrix A, the product AI equals A. The identity matrix has a ii

Ma 530 Introduction to Power Series

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Technical Proofs for Homogeneity Pursuit

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

On twin primes associated with the Hawkins random sieve

APPENDIX A SMO ALGORITHM

A Tabu Search Method for Finding Minimal Multi-Homogeneous Bézout Number

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Learning Theory for Conditional Risk Minimization: Supplementary Material

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

x !1! + 1!2!

Riemann Sums y = f (x)

THE GREATEST ORDER OF THE DIVISOR FUNCTION WITH INCREASING DIMENSION

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

7.1 Convergence of sequences of random variables

An Algorithmist s Toolkit October 20, Lecture 11

The multiplicative structure of finite field and a construction of LRC

Different kinds of Mathematical Induction

1 Approximating Integrals using Taylor Polynomials

42 Dependence and Bases

n=1 a n is the sequence (s n ) n 1 n=1 a n converges to s. We write a n = s, n=1 n=1 a n

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Integrals of Functions of Several Variables

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

A Note on the Symmetric Powers of the Standard Representation of S n

1 Convergence in Probability and the Weak Law of Large Numbers

Binomial transform of products

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

On Order of a Function of Several Complex Variables Analytic in the Unit Polydisc

Optimally Sparse SVMs

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

f(x) dx as we do. 2x dx x also diverges. Solution: We compute 2x dx lim

On Nonsingularity of Saddle Point Matrices. with Vectors of Ones

JORGE LUIS AROCHA AND BERNARDO LLANO. Average atchig polyoial Cosider a siple graph G =(V E): Let M E a atchig of the graph G: If M is a atchig, the a

Løsningsførslag i 4M

Estimation of KL Divergence Between Large-Alphabet Distributions

Bernoulli Polynomials Talks given at LSBU, October and November 2015 Tony Forbes

Largest Entries of Sample Correlation Matrices from Equi-correlated Normal Populations

Chapter 2. Asymptotic Notation

INEQUALITIES BJORN POONEN

18.S34 (FALL, 2007) GREATEST INTEGER PROBLEMS. n + n + 1 = 4n + 2.

Complete Solutions to Supplementary Exercises on Infinite Series

Available online at J. Math. Comput. Sci. 4 (2014), No. 3, ISSN:

An Introduction to Randomized Algorithms

Homework 2. Show that if h is a bounded sesquilinear form on the Hilbert spaces X and Y, then h has the representation

Transcription:

Suppleetary Material Wezhuo Ya a0096049@us.edu.s Departet of Mechaical Eieeri, Natioal Uiversity of Siapore, Siapore 117576 Hua Xu pexuh@us.edu.s Departet of Mechaical Eieeri, Natioal Uiversity of Siapore, Siapore 117576 1. Notatio Table [] The set {1,, } A subset of [] c The copleet of, c = [] \ I The idetity atrix X The saple atrix X R X i The ith colu of X β Vector β R β i The ith eleet of β β The vector whose ith eleet is β i if i or 0 otherwise (i) The ith disturbace atrix (i) j The jth colu of (i) The atrix whose ith colu is i if i or 0 otherwise W i Matrix W i R vec( ) The operator vectorizi a atrix by stacki its colus X p The l p -or of vec(x), vec(x) p 2. Proofs i Sectio 2 To prove the corollaries i Sectio 2, we ive the followi lea. Lea 1. If ay two differet roups p ad q i G i i the ucertaity set U (4) are o-overlappi for i = 1,, t, which eas p q =, the the optiizatio proble (5) is equivalet to i β R { y Xβ p + Proof. Sice ay two differet roups p ad q i G i are o-overlappi, we have G i, α (i) Hece the lea holds. ax α (i) Wi β = p c By usi Theore 3 ad Lea 1, we have G i α (i) ax G i c (W i β) p} (1) p c α (i) (W i β) = G i c (W i β) p (2)

Subissio ad Foratti Istructios for ICML 2013 1. Proof of Corollary 1: G 1 = {[]} satisfies the coditio of Lea 1, so we have c (W i β) p = c β 2 = c β 2. (3) G i G 1 2. Proof of Corollary 2: G 1 = {{1},, {}} satisfies the coditio of Lea 1, the c (W i β) p = c β p = c i β i. (4) G i G 1 3. Proof of Corollary 3: G 1 = { 1,, k } satisfies the coditio of Lea 1, so we have k c (W i β) p = c i β i p. (5) G i 4. Proof of Theore 2: G i = { i, c i } satisfies the coditio of Lea 1 ad c c i = 0, so that k c (W i β) p = (c i β i p + c c i β c i p) = G i k c i β i p. (6) 5. Proof of Corollary 4: The dual proble of the optiizatio proble ca be forulated as i vi =β, supp(v i ) i k c i v i p k ax i { c i v i p α α i,supp(v i ) i = ax α {α β + i i,supp(v i ) i { = ax α {α β ax i,supp(v i ) i { = ax i, α i c i α β k v i + α β} k c i v i p α i v i }} k α i v i c i v i p}} (7) Sice the costraits i the prial proble satisfy Slater s coditio, the stro duality holds. Fro the duality ad the coditio i Corollary 4, we have i β R { y Xβ p + G i, α (i) ax α (i) Wi β} p c = i β R { y Xβ p + ax α β} G 1, α p c = i β R { y Xβ p + i vi =β, supp(v i ) i k c i v i p}. (8)

Subissio ad Foratti Istructios for ICML 2013 6. Proof of Corollary 5: Fro Theore 2 ad Lea 1, we have c (W i β) p G i = c β p + c (W 2 β) p G 1 G 2 1 = c i β i + c i β i β i+1. (9) 7. Proof of Corollary 6: By usi the proofs of Corollary 1 ad Corollary 3, we ca obtai Corollary 6. 8. Proof of Corollary 7: G 1 = {{1},, {}} satisfies the coditio of Lea 1. Sice t = 1, c {i} = λ ad W 1 = D, we have 3. Proofs i Sectio 3 3.1. Proof of Theore 4: Fro the defiitio of Û, we have G i c (W i β) p = G 1 λ (Dβ) p = λ (Dβ) i = λ Dβ 1. (10) ax y (X + )β p Û = ax c Z ax i, G i, (i) = y Xβ p + ax c Z = y Xβ p + = y Xβ p + = y Xβ p + y (X + )β p p c ax α (i) Wi β p c ax G i, α (i) c c 0;f i(c) 0 G i, α (i) i ax { λ R q +,κ Rk c R + k i λ R q +,κ Rk + υ(λ, κ, β) ax α (i) Wi β p c G i, α (i) ax α (i) Wi β + κ c p c q λ i f i (c)} (11) Hece we establish the theore by taki iiu over β o both sides. Now we show the optiizatio proble is covex ad tractable. we first prove that υ(λ, κ, β) is a covex fuctio of λ, κ, β. Sice υ(λ, κ, β) = ax { c R k, i, G i, α (i) p c α (i) W i β + κ c q λ i f i (c)} = ax c R k, i, G i, α (i) p c µ(λ, κ, β). (12) For fixed c ad α (i), µ(λ, κ, β) is a liear fuctio of λ, κ, β. Thus υ(λ, κ, β) is covex, which iplies the optiizatio proble is covex. By choosi paraeter γ, the optiizatio proble ca be reforulated as i s.t. y Xβ p υ(λ, κ, β) γ λ R p +, κ R k +, β R To show the proble is tractable, it suffices to costruct a polyoial-tie separatio oracle for the feasible set S (Grötschel et al. (Grötschel et al., 1988)). A separatio oracle is a routie such that for a solutio (λ 0, κ 0, β 0 ),

Subissio ad Foratti Istructios for ICML 2013 it ca fid, i polyoial tie, that (a) whether (λ 0, κ 0, β 0 ) belos to S or ot; ad (b) if (λ 0, κ 0, β 0 ) S, a hyperplae that separates (λ 0, κ 0, β 0 ) with S. To verify the feasibility of (λ 0, κ 0, β 0 ), otice that (λ 0, κ 0, β 0 ) S if ad oly if the optial value of the optiizatio proble (12) is saller tha or equal to γ, which ca be verified i polyoial tie. If (λ 0, κ 0, β 0 ) S, the by solvi (12), we ca fid i polyoial tie c 0, α (i) 0 such that α (i) 0 Wi β + κ c 0 which is the hyperplae separates (λ 0, κ 0, β 0 ) with S. 3.2. Extesio of Corollary 8: q λ i f i (c 0 ) > γ. Theore 1. Let 1,, t be t roups such that t i = [], ad i be a atrix whose colus except the ith oe are all zero. Suppose that c i is a i diesio vector whose eleets ive the or boud of j for j i, e.. j 2 c j i, ad c = (c 1,, c t ). We defie the ucertaity set as Û = { t j j i c such that c 0 ad c i q s i, i [t]; j 2 c j i, i [t], j i }, the the equivalet liear reularized reressio proble is where q is the dual or of q. i { y Xβ p + β R Proof. Fro Theore 3 ad Theore 4, we have i υ(λ, κ, β) λ R +,κ R + = i λ R +,κ R + ax c R s i β i q }, { j=1 λ i ( c i q + s i )}. i j (κ i + β i )c i Defie r i as the vector whose eleets are κ j + β j for j i, the the equatio above is equivalet to i λ R +,κ R + r i q λ i, ii[t] λ s = s i β i q, which establishes the theore. 4. Proofs i Sectio 5 Recall that the ucertaity set cosidered i this paper is U = { (1) W 1 + + (t) W t i, G i, (i) 2 c } (13) where G i is the set of the roups of (i) ad c ives the boud of (i) for roup. We deote Ḡi ad Ḡc i as the set { G i c 0} ad G i Ḡi, respectively. I this theore, we restrict our discussio to the case that W i = I for i = 1,, t ad the boud c of (i) for each roup equals c or 0, so the ucertaity set ca be rewritte as U = { (1) + + (t) i, Ḡi, (i) 2 c } (14)

Subissio ad Foratti Istructios for ICML 2013 Note that the costrait 2 c ca be reforulated as the uio of several eleet-wise costraits. Deote D = {D i j D2 ij = c2, D ij 0} (we call a eleet D D decopositio), the we have { 2 c} = { i, j, ij D ij }. D D Siilarly, the ucertaity set { 2 c} is equivalet to { i, j, ij D ij }, D D where D = {D i j D2 ij = c2, D ij 0}. After the costraits of the ucertaity sets are decoposed ito eleet-wise costraits, the set {X + (1) + + (t) } ca also be represeted by a eleet-wise way. The otatio is a little coplicated so we first cosider three siple cases: Oe ucertaity set such that 2 c: for fixed D D, we have {X ij + ij } = [X ij D ij, X ij + D ij ]. Two ucertaity sets (1) ad (2) such that (1) 2 c ad (2) 2 c: for fixed D (1) D ad D (2) D, we have {X ij + (1) ij + (2) ij } = [X ij D (1) ij D (2) ij, X ij + D (1) ij + D (2) ij ]. Oe ucertaity set ad two overlappi roups p ad q such that p 2 c ad q 2 c: for fixed P D p ad Q D q, we have [X ij P ij, X ij + P ij ] j p, j q {X ij + ij } = [X ij Q ij, X ij + Q ij ] j p, j q [X ij i{p ij, Q ij }, X ij + i{p ij, Q ij }] j p, j q Thus, if the decopositio D D for each (i) is fixed, we have {X ij + (1) ij + + (t) ij } = [X ij γ ij, X ij +γ ij ] where γ ij is deteried by the decopositio Ds. Sice the uber of the eleets of (i) is less tha or equal to ( is the feature diesio ad is the uber of saples), there exists a decopositio D for each such that [X ij c, X ij + c ] [X ij γ ij, X ij + γ ij ]. We ow prove the theore. (i) Propositio 1. (Xu et al., 2010) Give a fuctio h : R +1 R ad Borel sets Z 1,, Z R +1, let P = {µ P S {1,, } : µ( i S Z i ) S /}. The followi holds 1 sup h(b i, r i ) = sup h(b i, r i )dµ(b i, r i ). (b i,r i ) Z i µ P R +1 Step 1: Usi the otatio above, we first ive the followi corollary: Corollary 1. Give y R, X R, the followi equatio holds for ay β R, y Xβ 2 + c + ax α (i) β = sup (b Ḡi, α (i) 2 r β) 2 dµ(b, r ) (15) c R +1 Here, ˆP () = S={D (i) } D (i) D, i, Ḡi µ ˆP () P (X, S, y, c ) = {µ P Z i = [y i c, y i + c ] where γ ij depeds o the decopositio set S. P (X, S, y, c ) S {1,, } : µ( i S Z i ) S /}, [X ij γ ij, X ij + γ ij ]; j=1

Subissio ad Foratti Istructios for ICML 2013 Proof. The riht had side of Equatio (15) is equal to sup S={D (i) } i, Ḡi,D (i) µ P (X,S,y,c ) D { sup Fro Theore 2, we kow that the left had side is equal to sup i, G i, δ y 2 c, (i) 2 c y + δ y (X + )β 2 = sup { sup i, G i,d (i) D = sup i, G i,d (i) D δ y 2 2 c2, (i) D (i) Furtherore, applyi Propositio 1 yields which proves the corollary. (b r β) 2 dµ(b, r )}. R +1 y + δ y (X + )β 2 } sup (b i,r i ) [y i c /,y i +c / ] j=1 [X ij γ ij,x ij +γ ij ] sup (b i,r i ) [y i c /,y i +c / ] j=1 [X ij γ ij,x ij +γ ij ] = sup (b r β) 2 dµ(b, r ) µ P (X,S,y,c ) R +1 = sup (b r β) 2 dµ(b, r ) µ P (X,S,y,c ) R +1 (b i r i β) (b i r i β). Step 2: As (Xu et al., 2010), we cosider the followi kerel estiator ive saples (b i, r i ), h (b, r) = (c +1 ) 1 K( b b i, r r i ) c where K(x) = I [ 1,1] +1(x)/2 +1, ad c = c. (16) Observe that the estiated distributio above belos to the set of distributios P (X, S, y, c ) = {µ P Z i = [y i c, y i + c ] S {1,, } : µ( i S Z i ) S /} [X ij γ ij, X ij + γ ij ]; j=1 ad hece belos to ˆP () = S={D (i) } D (i) D P (X, S, y, c )., i, Ḡi Step 3: Cobii the last two steps, ad usi the fact that h (b, r) h(b, r) d(b, r) oes to zero alost surely whe c 0 ad c +1 or equivaletly c 0 ad c +1. Now we prove cosistecy of robust reressio. Proof. Let f( ) be the true probability desity fuctio of the saples, ad ˆµ be the estiated distributio usi Equatio (16) ive S ad c, ad deote its desity fuctio as f ( ). The coditio that β(c, S ) 2 H alost surely ad P has a bouded support iplies that there exists a uiversal costat C such that ax (b r β(c, S )) 2 C

alost surely. By Corollary 1 ad ˆµ ˆP (), we have Subissio ad Foratti Istructios for ICML 2013 (b r β(c, S )) 2 dˆµ (b, r) sup (b r β(c, S )) 2 dµ (b, r) µ ˆP () = (b i r i β(c, S )) 2 + ax α (i) 1 β + c Ḡi, α(i) 2 c (b i r i β(p ))2 + ax α (i) 1 β + c Ḡi, α(i) 2 c Notice that, t ax Ḡi, α(i) 2 c α (i) β + 1 c coveres to 0 as c 0 alost surely, so the riht-had side coveres to (b r β(p )) 2 dp (b, r) as ad c 0 alost surely. Furtherore, we have (b r β(c, S )) 2 dp (b, r) (b r β(c, S )) 2 dˆµ (b, r) + ax (b r β(c, S )) 2 f (b, r) f(b, r) d(b, r) (b r β(c, S )) 2 dˆµ (b, r) + C f (b, r) f(b, r) d(b, r), where the last iequality follows fro the defiitio of C. Notice that f (b, r) f(b, r) d(b, r) oes to zero alost surely whe c 0 ad c +1. Hece the theore follows. As etioed i the paper, the assuptio that β(c, S ) 2 H i Theore 7 ca be reoved, the we have Theore 2. Let {c } covere to zero sufficietly slowly. The alost surely. li (b i r i β(c, S )) 2 dp (b, r) = (b i r i β(p ))2 dp (b, r) We ow prove this heore. We establish the followi lea first. Lea 2. Partitio the support of P as V 1,, V T such that the l radius of each set is less tha c. If a distributio µ satisfies µ(v t ) = #((b i, r i ) V t )/; t = 1,, T, (17) the µ ˆP (). Proof. Let Z i = [y i c, y i + c ] j=1 [X ij c, X ij + c ], recall that X ij is the jth eleet of r i. Notice that the l radius of V t is less tha c, we have (b i, r i ) V t V t Z i.

Subissio ad Foratti Istructios for ICML 2013 Therefore, for ay S {1,, }, the followi holds µ( i S Z i ) µ( V t i S : (b i, r i ) V t ) = µ(v t ) = #((b i, r i ) V t )/ S /. t i S:(b i,r i ) V t t i S:(b i,r i ) V t Hece µ P (X, S, y, c ) which iplies µ ˆP (). Partitio the support of P ito T subsets such that the l radius of each set is less tha c. Deote P () as the set of probability easures satisfyi Equatio (17). Hece P () ˆP () by Lea 1. Further otice that there exists a uiversal costat K such that β(c, S ) 2 K/c due to the fact that the square loss of the solutio β = 0 is bouded by a costat oly depeds o the support of P. Thus, there exists a costat C such that ax (b r β(c, S )) 2 C/c 2. Follow a siilar aruet as the proof of Theore 6, we have sup (b r β(c, S )) 2 dµ (b, r) µ P () (18) (b i r i β(p ))2 + ax α (i) 1 β + c Ḡi, α(i) 2 c ad (b r β(c, S )) 2 dp (b, r) if { (b r β(c, S )) 2 dµ (b, r) + ax µ P () r β(c, S )) 2 f µ (b, r) f(b, r) d(b, r)} sup (b r β(c, S )) 2 dµ (b, r) + 2C/c 2 if f µ (b, r) f(b, r) d(b, r), µ P () µ P () here f µ stads for the desity fuctio of a easure µ. Notice that P () is the set of distributios satisfyi Equatio (17), hece if µ P () f µ (b, r) f(b, r) d(b, r) is upper-bouded by T t=1 P (V t) #((b i, r i ) V t ) /, which oes to zero as icreases for ay fixed c. Therefore, 2C/c 2 if f µ (b, r) f(b, r) d(b, r) 0, µ P () if c 0 sufficietly slow. Cobii this with Iequality (18) proves the theore. Refereces Grötschel, Marti, Lovász, Lászlo, ad Schrijver, Alexader. Geoetric Aloriths ad Cobiatorial Optiizatio, volue 2. Sprier, 1988. Xu, H., Caraais, C., ad Maor, S. Robust reressio ad lasso. IEEE Trasactios o Iforatio Theory, 56(7):3561 3574, 2010.