Supplemental Material: Proofs

Similar documents
b i u x i U a i j u x i u x j

Algebra of Least Squares

Efficient GMM LECTURE 12 GMM II

Problem Set 2 Solutions

Optimally Sparse SVMs

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Brief Review of Functions of Several Variables

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Differentiable Convex Functions

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

1. By using truth tables prove that, for all statements P and Q, the statement

Optimization Methods MIT 2.098/6.255/ Final exam

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture 10 October Minimaxity and least favorable prior sequences

Homework Set #3 - Solutions

Linear Support Vector Machines

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

S1 Notation and Assumptions

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

Technical Proofs for Homogeneity Pursuit

Fundamental Theorem of Algebra. Yvonne Lai March 2010

MATH 324 Summer 2006 Elementary Number Theory Solutions to Assignment 2 Due: Thursday July 27, 2006

APPENDIX A SMO ALGORITHM

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

10-701/ Machine Learning Mid-term Exam Solution

On Nonsingularity of Saddle Point Matrices. with Vectors of Ones

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Assignment 5: Solutions

Week 5-6: The Binomial Coefficients

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Linear Elliptic PDE s Elliptic partial differential equations frequently arise out of conservation statements of the form

Estimation for Complete Data

CHAPTER 5 SOME MINIMAX AND SADDLE POINT THEOREMS

Sequences and Series of Functions

Mathematical Induction

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

n=1 a n is the sequence (s n ) n 1 n=1 a n converges to s. We write a n = s, n=1 n=1 a n

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

MAS111 Convergence and Continuity

1+x 1 + α+x. x = 2(α x2 ) 1+x

7.1 Convergence of sequences of random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Singular Continuous Measures by Michael Pejic 5/14/10

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

Solutions to home assignments (sketches)

Supplementary Materials for Statistical-Computational Phase Transitions in Planted Models: The High-Dimensional Setting

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Lecture 3 The Lebesgue Integral

Math 61CM - Solutions to homework 3

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Math 299 Supplement: Real Analysis Nov 2013

TENSOR PRODUCTS AND PARTIAL TRACES

TEACHER CERTIFICATION STUDY GUIDE

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

CHAPTER 5. Theory and Solution Using Matrix Techniques

Empirical Processes: Glivenko Cantelli Theorems

Chapter IV Integration Theory

MATH10212 Linear Algebra B Proof Problems

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Question 1: The magnetic case

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ON WELLPOSEDNESS QUADRATIC FUNCTION MINIMIZATION PROBLEM ON INTERSECTION OF TWO ELLIPSOIDS * M. JA]IMOVI], I. KRNI] 1.

7 Sequences of real numbers

Chapter 7 Isoperimetric problem

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Chapter 2. Periodic points of toral. automorphisms. 2.1 General introduction

Complex Analysis Spring 2001 Homework I Solution

4x 2. (n+1) x 3 n+1. = lim. 4x 2 n+1 n3 n. n 4x 2 = lim = 3

Bertrand s Postulate

Introduction to Machine Learning DIS10

Determinants of order 2 and 3 were defined in Chapter 2 by the formulae (5.1)

Sequences and Limits

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

On forward improvement iteration for stopping problems

1 Introduction. 1.1 Notation and Terminology

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

Analytic Continuation

Arkansas Tech University MATH 2924: Calculus II Dr. Marcel B. Finan

SOLUTIONS TO EXAM 3. Solution: Note that this defines two convergent geometric series with respective radii r 1 = 2/5 < 1 and r 2 = 1/5 < 1.

6.3 Testing Series With Positive Terms

Beurling Integers: Part 2

USA Mathematical Talent Search Round 3 Solutions Year 27 Academic Year

CSE 1400 Applied Discrete Mathematics Number Theory and Proofs

5.1 Review of Singular Value Decomposition (SVD)

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

page Suppose that S 0, 1 1, 2.

Introduction to Optimization Techniques. How to Solve Equations

Machine Learning Brett Bernstein

Output Analysis and Run-Length Control

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Transcription:

Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special case of Aθ. Clearly, this case is cosistet with LB. Next cosider the case. Sice θ solves (), the KKT coditio holds: We seek all δ such that θ δ satisfies λaθ l(x i θ, y i )x i. (29) A(θ δ) Aθ ad x i (θ δ) x i θ i,,, (3) For ay such δ, simple algebra verifies that θ tδ satisfies the KKT coditio (29) for ay t [, ]. Cosequetly, θ δ also solves the problem i (). To see this, we cosider two situatios: If the loss fuctio l(, ) is covex i the first argumet, the KKT coditio is a sufficiet optimality coditio, which meas that θ δ solves (). If the loss fuctio l(, ) is smooth (ot ecessary covex) i the first argumet, we have f(θ ) f(θ δ) by usig the Taylor expasio (recall f is defied i equatio ): f(θ δ) f(θ ) f(θ tδ), δ (for some t [, ]) f(θ ) l(x i (θ tδ), y i )x i λa(θ tδ)), δ f(θ ) f(θ ). l(x i θ, y i )x i λaθ, δ } {{ } due to the KKT coditio (29) Therefore, θ δ also solves (). However, the uiqueess of θ requires δ to be the oly value satisfyig (3). This is equivalet to say It idicates that Null(A) Null(Spa{x,, x }) {}. (3) Rak(A) Dim(Spa{x,, x }) d. From Dim(spa{x,, x }), we have d Rak(A). We proved the geeral case for LB. If we have Aθ, we ca further improve LB. Let g (g,..., g ) be the vector satisfyig λaθ gi x i ad gi l(x i θ, y i ) i, 2,,. (32) Sice θ satisfies the KKT coditio, such vector g must exist. Applyig Aθ to (32), we have g ad Dim (Spa{A., A.2,, A.d } Spa{x,, x }). (33) To satisfy (3), we must have d Dim (Spa{A., A.2,, A.d, x,, x }).

Usig the fact i liear algebra Dim (Spa{A., A.2,, A.d, x,, x }) Dim (Spa{A., A.2,, A.d }) Rak(A) Dim (Spa{x,, x }) Dim (Spa{A., A.2,, A.d } Spa{x,, x }) (from (33)) We coclude that d Rak(A). We completed the proof for LB. Proof to Theorem 2 Proof. Whe A has full rak we have a equivalet expressio for the KKT coditio (29): λa 2 θ A 2 xi l(x i θ, y i ) i,,. (34) Let us decompose A 2 x i for all i,, ito A 2 x i α i A 2 θ u i, where u i is orthogoal to A 2 θ : u i A 2 θ. Equivaletly x i α i Aθ A 2 u i. Applyig this decompositio, we have Puttig it back i (34) we obtai x i θ α i θ 2 A u i A 2 θ α i θ 2 A. ) λa 2 θ (α i A 2 θ u i l(α i θ 2 A, y i ) i,,. (35) Sice u i is orthogoal to A 2 θ, (35) ca be rewritte as satisfyig α i R, y i Y, g i l(α i θ 2 A, y i ) i,, (36) g i u i λa 2 θ A 2 θ α i g i (37) Sice Aθ, we have A 2 θ ad (37) is equivalet to λ α ig i. It follows that λ α i g i sup αg α R,y Y,g l(α θ 2 A,y) sup αg α R,y Y,g l(α θ 2 A,y) It idicates the lower boud for λ. sup α R,y Y,g l(α θ 2 A,y) αg Proof to Theorem 3

Proof. Let D {x i, y i },, be a teachig set for [w ; b ]. The followig KKT coditio eeds to be satisfied: [ ] [ ] l(y i (x i w b xi λaw ))y i. (38) If we costruct a ew traiig set } ˆD {ˆx i x i b w 2 Aw, ŷ i y i A,, the [w ; ] satisfies the KKT coditio defied o ˆD. This ca be verified as follows: ] [ ] [ˆxi l(ŷ i (ˆx i w λaw ))ŷ i [ ] l(y i (x i w b x i b Aw [ ] ))y w λaw 2 i A [ ] [ ] [ ] l(y i (x i w b xi λaw b Aw ))y i w 2 A l(y i (x i w b ))y i from (38) from (38) (39) where l(y i(x i w b ))y i is from the bias dimesio i (38). It follows that which is equivalet to l(ŷ iˆx i w )ŷ iˆx i λaw l(ŷ iˆx i w )A 2 ŷiˆx i λa 2 w :z i l(z i w )A 2 zi λa 2 w. (4) We decompose A 2 z i α i A 2 w u i where u i satisfies u i A 2 w. Applyig this decompositio to (4), we have λa 2 w l(α i w 2 A)(α i A 2 w u i ). (4) Sice u i is orthogoal to A 2 w, (4) implies that λa 2 w Sice w we have λ Together with we obtai LB3. l(α i w 2 A)α i A 2 w. l(α i w 2 A)α i. l(α i w 2 A)α i sup αg, α R,g l(α w 2 A )

Proof to Propositio Proof. We simply verify the KKT coditio to see that θ is a solutio to () by applyig the costructio i (). The uiqueess of θ is guarateed by the strog covexity of (). Proof to Propositio 2 Proof. We oly eed to verify that the KKT coditio holds for θ. Due to the strog covexity of (2) uiqueess is guarateed automatically. We deote the subgradiet a max( a, ) max( a, ) I(a), where The KKT coditio is, if a < I(a) [, ], if a. (42), otherwise y i x i max( y i x i θ, ) λθ y i x i I(y i x i θ ) λθ λθ ( λ θ λ θ 2 I 2 ) λ θ 2 λθ ( λ θ λθ 2 ) I λ θ 2 λθ ( ) where the last lie is due to I λ θ 2 λ θ 2 givig either the set [, ] or the value. Proof to Corollary 2 Proof. We show this umber matches LB2. Let A I, l(a, b) max( ab, ), ad cosider the deomiator of (7): sup αg sup αg α R,y Y,g l(α θ 2,y) α,y {,},g yi(yα θ 2 ) sup α,g I(α θ 2 ) αg θ 2 where the first equality is due to l(a, b) bi(ab). Therefore, LB2 λ θ 2 which matches the costructio i (3). Proof to Propositio 3 Proof. We first verify that θ is a solutio to (4) based o the teachig set costructio i (6). We oly eed to verify

the gradiet of (4) is zero. Computig the gradiet of (4), we have y i x i exp{y i x i θ } λθ ( exp {τ ( τ exp λ θ 2 λ θ 2, x i λ θ 2 λ θ 2 ) λ θ 2 λ θ 2 ( {τ λ θ 2 λ θ 2 )} λθ θ )} θ 2 λθ θ θ 2 λθ where the third equality uses the fact λ θ 2 λ θ 2 τmax ad the property a τ (a). The strog covexity e τ (a) of (4) automatically implies uiqueess. Proof to Corollary 3 Proof. We show that the umber matches LB2. I (7) let A I ad l(a, b) log( exp{ ab}). The deomiator of LB2 is: which implies LB2 Proof to Propositio 4 sup α R,y Y,g l(α θ 2,y) λ θ 2. αg sup αg α,y {,},gy(exp{yα θ 2 }) sup α,g(exp{α θ 2 }) αg sup α α exp{α θ 2 } t exp{t} θ 2 sup t θ 2, Proof. We first prove the case for w. We ca verify that the KKT coditio is satisfied by desigig x ad y as i (8): (x w b y )x λw x w b y. The uiqueess of [w ; b ] is idicated by the strog covexity of (7) whe. We the prove the case for w. With simple algebra, we ca verify the KKT coditio holds via the costructio i (9): (x w b y )x (x 2 w b y 2 )x 2 λw (x w b y ) (x 2 w b y 2 ). Similarly, the uiqueess is implied by the strog covexity of (7) whe 2.

Proof to Corollary 4 Proof. We match the lower boud LB i (6). Note θ [w ; b ] R d, ad A i this case is a (d ) (d ) matrix with the d d idetity matrix I d padded with oe additioal row ad colum of zeros for the offset. Therefore Rak(A) Rak(I d ) d. Whe w, Aθ ad LB (d ) Rak(A). Whe w, Aθ ad LB (d ) Rak(A) 2. These lower bouds match the teachig set sizes i (8) ad (9), respectively. Proof to Propositio 5 Proof. Ulike i previous learers (icludig homogeeous SVM), we o loger have strog covexity w.r.t. b. I order to prove that (2) is a teachig set, we eed to verify the KKT coditio ad verify solutio uiqueess. We first verify the KKT coditio to show that the solutio uder (2) icludes the target model [w ; b ]. From (2), we have x w b, x w b. (43) Applyig them to the KKT coditio ad usig the otatio i (42) we obtai [ ] 2 I(x w b x ) [ ] 2 I() x [ 2 I() x [ ] [ ] 2 I() x x λw [ I() w w ] [ ] λw 2 [ [ ] λw λw I(). ] [ ] 2 I( x w b x ) [ ] λw ] It proves that [w ; b ] solves (2) by our teachig set costructio. [ ] λw settig the last dimesio to applyig (2) observig λ w 2 Next we prove uiqueess by cotradictio. We use f(w, b) to deote the objective fuctio i (2) uder the teachig set. It is easy to verify that f(w, b ) λ 2 w 2. Assume that there exists aother solutio [ w; b] differet from [w ; b ]. We ca obtai w 2 w 2 due to λ 2 w 2 f(w, b ) f( w, b) λ 2 w 2. The secod equality is due to [ w; b] beig a solutio; the iequality is due to whole-part relatioship. Therefore, there are oly two possibilities for the orm of w: w w or w t w for some t <. Next we will show that both cases are impossible. (Case ) For the case w w, we have f( w, b) 2 max ( (x w b), ) 2 max ( (x w b), ) λ 2 w 2 2 max x (w w) (b b), 2 max x (w w) (b b), :Δ :Δ λ 2 w 2 2 max (Δ, ) 2 max (Δ, ) f(w, b ).

From f( w, b) f(w, b ), it follows Δ ad Δ. Sice Δ Δ (x x ) (w w) 2(w ) (w w) w 2 2 2 w w w 2, we have w w w 2. But because w w, we must have w w. Applyig this ew observatio to Δ ad Δ, we obtai b b. It meas that [w ; b ] [ w; b], cotradictig our assumptio [w ; b ] [ w; b]. (Case 2) Next we tur to the case w t w for some t [, ). Recall our assumptio that [ w; b] solves (2). The it follows that the followig specific costructio [ŵ, ˆb] solves (2) as well: ŵ tw, ˆb tb. (44) To see this, we cosider the followig optimizatio problem: mi w,b s.t. L(w, b) : 2 max( (x w b), ) 2 max( (x w b), ) w t w. (45) Sice [ w; b] solves (2), it is easy to see that [ w; b] solves (45) too, otherwise there exists a solutio for (45) which gives a lower fuctio value o (2). The we ca verify that [ŵ; ˆb] solves (45) as well by showig the followig optimality coditio holds: [ ] L(w,b) w L(w,b) b [ŵ;ˆb] N w t w (ŵ, ˆb) Normal coe to the set {[w; b] : w t w } at [ŵ; ˆb] Give a covex closed set Ω ad a poit θ Ω, the ormal coe at poit θ is defied to be a set N Ω (θ) {φ : φ, ψ θ ψ Ω}. (46) The optimality coditio basically suggests that at the optimal poit, the egative (sub)gradiet directio overlaps with the ormal coe. I other words, there is ot ay directio to decrease the objective at the optimal poit. Readers ca refer to Bertsekas & Nedic (23) for more details about the geometric optimality coditio. Because of (43) ad (44), we have x ŵ ˆb t <. Thus at [ŵ; ˆb] the subgradiet is [ ] L(w,b) w L(w,b) 2 b [ŵ;ˆb] [ ] [ x x w ] w 2 (47) Ad the ormal coe is { [ ] } N w t w (ŵ, ˆb) w s s. (48) The itersectio is o-empty by choosig s w. Sice both [ŵ; ˆb] ad [ w; b] solve (45), we have L(ŵ, ˆb) L( w, b). 2 Together with ŵ w, we have f(ŵ, ˆb) L(ŵ, ˆb) λ 2 ŵ 2 f( w, b) f(w, b ). Therefore, we proved that [ŵ; ˆb] solves (2) as well. To see the cotradictio, let us check the fuctio value of f(ŵ, ˆb)

via a differet route: f(ŵ, ˆb) f(tw, tb ) 2 max ( t(x w b ), ) 2 max ( t, ) 2 max ( t(x w b ), ) λ 2 w 2 t 2 2 max ( t, ) λ 2 w 2 t 2 ( t) λ 2 w 2 ( t 2 ) λ 2 w 2 ( t) 2 ( t2 ) λ 2 w 2 2 ( t)2 f(w, b ) >f(w, b ), where the first iequality uses the fact that λ w 2. It cotradicts our early assertio f(ŵ, ˆb) f(w, b ). Puttig cases ad 2 together we prove uiqueess. Proof to Corollary 5 Proof. The upper boud directly follows Propositio 5. We oly eed to show the lower boud LB3 λ w 2 i Theorem 3. Let A I, l(a) max( a, ), ad cosider the deomiator of (9): sup αg sup αg α R,g l(α w 2 ) α,g I(α w 2 ) w 2 where the first equality is due to l(a) I(a). Therefore, LB3 λ w 2 which proves the lower boud. Proof to Propositio 6 Proof. We first poit out that for t to be well-defied the argumet to τ () has to be bouded λ w 2. This implies λ w 2. The size of our proposed teachig set is the smallest amog all such symmetric costructio that satisfy this costrait. We verify that the KKT coditio to show the costructio i (23) icludes the solutio [w ; b ]. From (23), we have x w b t x w b t. We apply them ad the teachig set costructio to compute the gradiet of (22): [ ] x 2 exp{x [ x w b } 2 exp{ x w b } [ ] x [ ] [ ] x λw 2 exp{t} 2 exp{t} [ ] [ ] t w λw w 2 exp{t} λ w 2 [ ] [ ] w λw w 2. This verifies the KKT coditio. ] [ ] λw

Fially we show uiqueess. The Hessia matrix of the objective fuctio (22) uder our traiig set (23) is: [ ] [ ] exp{t} x x x x x x I 2 ( exp{t}) }{{ 2 x x λ 2. } :a :A :B [ ] x [x Note a > ad A ] [ ] x [x ] is positive semi-defiite. We show that aaλb is positive defiite. Suppose ot. The there exists [u; v] such that [u; v] (aaλb)[u; v]. This implies [u; v] (aa)[u; v]λu u. Sice the first term is o-egative due to A beig positive semi-defiite, u. But the we have 2av 2 which implies [u; v], a cotradictio. Therefore uiqueess is guarateed. Proof to Corollary 6 Proof. The upper boud directly follows Propositio 6. We oly eed to show the lower boud LB3 i Theorem 3. Let A I ad l(a) log( exp{ a}) ad cosider the deomiator of (9): which implies LB3 λ w 2. sup αg sup αg α R,g l( α w 2 ) α,g(exp{α w 2 }) α sup α exp{α w 2 } w 2 t sup t exp{t} w 2, λ w 2 by applyig