Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

CS Lecture 13. More Maximum Likelihood

Computational and Statistical Learning Theory

Support Vector Machines MIT Course Notes Cynthia Rudin

1 Bounding the Margin

A Simple Regression Problem

Bayes Decision Rule and Naïve Bayes Classifier

Lecture 21. Interior Point Methods Setup and Algorithm

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machines. Maximizing the Margin

1 Rademacher Complexity Bounds

On Constant Power Water-filling

Machine Learning Basics: Estimators, Bias and Variance

Support Vector Machines. Goals for the lecture

Stochastic Subgradient Methods

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Symmetrization and Rademacher Averages

Soft-margin SVM can address linearly separable problems with outliers

Kernel Methods and Support Vector Machines

Introduction to Discrete Optimization

Introduction to Machine Learning. Recitation 11

Sharp Time Data Tradeoffs for Linear Inverse Problems

Boosting with log-loss

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

Robustness and Regularization of Support Vector Machines

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

1 Proof of learning bounds

Tight Complexity Bounds for Optimizing Composite Objectives

Learnability and Stability in the General Learning Setting

Geometrical intuition behind the dual problem

Ch 12: Variations on Backpropagation

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes.

A Theoretical Analysis of a Warm Start Technique

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France

arxiv: v1 [cs.ds] 3 Feb 2014

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Computational and Statistical Learning Theory

Combining Classifiers

Understanding Machine Learning Solution Manual

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

1 Generalization bounds based on Rademacher complexity

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

arxiv: v1 [cs.lg] 8 Jan 2019

Fixed-to-Variable Length Distribution Matching

VC Dimension and Sauer s Lemma

Lecture 9: Multi Kernel SVM

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Supplement to: Subsampling Methods for Persistent Homology

Distributed Subgradient Methods for Multi-agent Optimization

COS 424: Interacting with Data. Written Exercises

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Error Exponents in Asynchronous Communication

Tracking using CONDENSATION: Conditional Density Propagation

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

On the Use of A Priori Information for Sparse Signal Approximations

Probabilistic Machine Learning

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

arxiv: v1 [cs.ds] 29 Jan 2012

1 Identical Parallel Machines

ICS-E4030 Kernel Methods in Machine Learning

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Asynchronous Gossip Algorithms for Stochastic Optimization

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Feature Extraction Techniques

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

Synthetic Generation of Local Minima and Saddle Points for Neural Networks

Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers Roy D. Yates and David J.

Lecture 20 November 7, 2013

Research Article Robust ε-support Vector Regression

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Non-Parametric Non-Line-of-Sight Identification 1

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Tail estimates for norms of sums of log-concave random vectors

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Lower Bounds for Quantized Matrix Completion

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

Optimal quantum detectors for unambiguous detection of mixed states

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

The degree of a typical vertex in generalized random intersection graph models

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

PAC-Bayes Analysis Of Maximum Entropy Learning

Least Squares Fitting of Data

Chapter 6 1-D Continuous Groups

3.8 Three Types of Convergence

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Introduction to Optimization Techniques. Nonlinear Programming

Unsupervised Learning: Dimension Reduction

Chaotic Coupled Map Lattices

Bipartite subgraphs and the smallest eigenvalue

Randomized Recovery for Boolean Compressed Sensing

Transcription:

Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15, 018 15 Fenchel duality and algoriths In this section, we introduce the Fenchel conjugate. First, recall that for a real-valued convex function of a single variable f (x, we call f (p := sup x px f (x its Legendre transfor. This is illustrated in figure 1. The generalization of the Legendre transfor to (possibly nonconvex functions of ultiple variables is the Fenchel conjugate. Definition 15.1 (Fenchel conjugate. The Fenchel conjugate of f : R n R is f (p = sup p, x f (x x We now present several useful facts about the Fenchel conjugate. The proofs are left as an exercise. Fact 15.. f is convex. Indeed, f is the supreu of affine functions and therefore convex. Thus, the Fenchel conjugate of f is also known as its convex conjugate. Fact 15.3. f ( f (x = f if f is convex. In other words, the Fenchel conjugate is its own inverse for convex functions. Now, we can also relate the subdifferential of a function to that of its Fenchel conjugate. Intuitively, observe that 0 f (p 0 p f (x p f (x. This is suarized ore generally in the following fact. 1

0 18 16 14 1 10 8 6 4 0 0 0.5 1 1.5.5 3 3.5 4 4.5 5 Figure 1: Legendre Transfor. The function f (p is how uch we need to vertically shift the epigraph of f so that the linear function px is tangent to f at x. Fact 15.4. The subdifferential of f at p is f (p = {x : p f (x}. Indeed, f (0 is the set of iniizers of f. In the following theore, we introduce Fenchel duality. Theore 15.5 (Fenchel duality. Suppose we have f proper convex, and g proper concave. Then in f (x g(x = ax g (p f (p x p where g is the concave conjugate of g, defined as inf x p, x g(x. In the one-diensional case, we can illustrate Fenchel duality with Figure. In the iniization proble, we want to find x such that the vertical distance between f and g at x is as sall as possible. In the (dual axiization proble, we draw tangents to the graphs of f and g such that the tangent lines have the sae slope p, and we want to find p such that the vertical distance between the tangent lines is as large as possible. The duality theore above states that strong duality holds, that is, the two probles have the sae solution. We can recover Fenchel duality fro Lagrangian duality, which we have already studied. To do so, we need to introduce a constraint to our iniization proble in Theore 15.5. A natural reforulation of the proble with a constraint is as follows. in x,z f (x g(z subject to x = z (1

4 y 0 4 f g 1 0 1 3 4 x Figure : Fenchel duality in one diension 15.1 Deriving the dual proble for epirical risk iniization In epirical risk iniization, we often want to iniize a function of the following for: P(w = φ i ( w, x i + R(w ( We can think of w R n as the odel paraeter that we want to optiize over (in this case it corresponds to picking a hyperplane, and x i as the features of the i-th exaple in the training set. φ i (, x i is the loss function for the i-th training exaple and ay depend on its label. R(w is the regularizer, and we typically choose it to be of the for R(w = λ w. The prial proble, in w R n P(w, can be equivalently written as follows: in w,z φ i (z i + R(w subject to X w = z (3 3

By Lagrangian duality, we know that the dual proble is the following: ax in R z,w = ax = ax in w,z ( = ax = ax in ( n φ i (z i + R(w (X w z φ i (z i + i z i + R(w X w w,z ax z i ( φ i (z i + i z i + (X w R(w ( φ i (z i i z i + ax w (X w R(w φ i ( i R (X where φ i and R are the Fenchel conjugates of φ i and R respectively. Let us denote the dual objective as D( = φ i ( i R (X. By weak duality, D( P(w. For R(w = λ w, R (p = λ 1 λ p. So R is its own convex conjugate (up to correction by a constant factor. In this case the dual becoes: ax φ i ( i λ 1 λ i x i We can relate the prial and dual variables by the ap w( = 1 λ ix i. In particular, this shows that the optial hyperplane is in the span of the data. Here are soe exaples of odels we can use under this fraework. Exaple 15.6 (Linear SVM. We would use the hinge loss for φ i. This corresponds to φ i (w = ax(0, 1 y i x i w, φ i ( i = i y i Exaple 15.7 (Least-squares linear regression. We would use the squared loss for φ i. This corresponds to φ i (w = (w x i y i, φ i ( i = i y i + /4 We end with a fact that relates the soothness of φ i to the strong convexity of φ i. Fact 15.8. If φ i is 1 γ -sooth, then φ i is γ-strongly convex. 15. Stochastic dual coordinate ascent (SDCA In this section we discuss a particular algorith for epirical risk iniization which akes use of Fenchel duality. Its ain idea is picking an index i [] at rando, and then solving the dual proble at coordinate i, while keeping the other coordinates fixed. More precisely, the algorith perfors the following steps: 4

1. Start fro w 0 := w( 0. For t = 1,..., T: (a Randoly pick i [] (b Find i which axiizes Φ i ( ( t 1 i + i λ 3. Update the dual and prial solution (a t = t 1 + i (b w t = w t 1 + 1 λ ix i wt 1 + 1 λ ix i For certain loss functions, the axiizer i is given in closed for. For exaple, for hinge loss it is given explicitly by: ( i = y i ax and for squared loss it is given by: 0, in(1, 1 xt i wt 1 y i x i /λ + t 1 i y i i = y i xi Twt 1 0.5 t 1 i 0.5 + x. /λ t 1 i, Note that these updates require both the prial and dual solutions to perfor the update. Now we state a lea given in [SSZ13], which iplies linear convergence of SDCA. In what follows, assue that x i 1, Φ i (x 0 for all x, and Φ i (0 1. Lea 15.9. Assue Φ i is γ-strongly convex, where γ > 0. Then: where s = λγ 1+λγ. E[D( t D( t 1 ] s E[P(wt 1 D( t 1 ], We leave out the proof of this result, however give a short arguent that proves linear convergence of SDCA using this lea. Denote ɛ t D := D( D( t. Because the dual solution provides a lower bound on the prial solution, it follows that: ɛ t D P(wt D( t. Further, we can write: D( t D( t 1 = ɛ t 1 D ɛt D. 5

By taking expectations on both sides of this equality and applying Lea 15.9, we obtain: E[ɛ t 1 D ɛt D ] = E[D(t D( t 1 ] s E[P(wt 1 D( t 1 ] s E[ɛt 1 D ]. Rearranging ters and recursively applying the previous arguent yields: E[ɛ t D ] (1 s E[ɛt 1 D ] (1 s t ɛ 0 D. Fro this inequality we can conclude that we need O( + λγ 1 log(1/ɛ steps to achieve ɛ dual error. Using Lea 15.9, we can also bound the prial error. Again using the fact that the dual solution underestiates the prial solution, we provide the bound in the following way: E[P(w t P(w ] E[P(w t D( t ] s E[D(t+1 D( t ] s E[ɛt D ], where the last inequality ignores the negative ter E[ɛ t 1 D ]. References [SSZ13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent ethods for regularized loss iniization. Journal of Machine Learning Research, 14(Feb:567 599, 013. 6