Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Stochastic Subgradient Methods

Combining Classifiers

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

1 Bounding the Margin

CS Lecture 13. More Maximum Likelihood

Symmetrization and Rademacher Averages

Boosting with log-loss

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Support Vector Machines MIT Course Notes Cynthia Rudin

Machine Learning Basics: Estimators, Bias and Variance

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Lecture 21. Interior Point Methods Setup and Algorithm

Fixed-to-Variable Length Distribution Matching

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

1 Rademacher Complexity Bounds

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Kernel Methods and Support Vector Machines

1 Proof of learning bounds

Support Vector Machines. Goals for the lecture

Prediction by random-walk perturbation

Asynchronous Gossip Algorithms for Stochastic Optimization

Sharp Time Data Tradeoffs for Linear Inverse Problems

A Theoretical Analysis of a Warm Start Technique

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

Robustness and Regularization of Support Vector Machines

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Machine Learning: Fisher s Linear Discriminant. Lecture 05

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Pattern Recognition and Machine Learning. Artificial Neural networks

1 Generalization bounds based on Rademacher complexity

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Block designs and statistics

Logistic Regression. by Prof. Seungchul Lee isystems Design Lab UNIST. Table of Contents

arxiv: v1 [cs.ds] 3 Feb 2014

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

A note on the multiplication of sparse matrices

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

PAC-Bayes Analysis Of Maximum Entropy Learning

Domain-Adversarial Neural Networks

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

COS 424: Interacting with Data. Written Exercises

Randomized Recovery for Boolean Compressed Sensing

A Simple Regression Problem

Soft-margin SVM can address linearly separable problems with outliers

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ma x = -bv x + F rod.

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

The Weierstrass Approximation Theorem

1 Identical Parallel Machines

Interactive Markov Models of Evolutionary Algorithms

Multilayer Neural Networks

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

lecture 36: Linear Multistep Mehods: Zero Stability

PAC-Bayesian Learning of Linear Classifiers

On Constant Power Water-filling

Hamming Compressed Sensing

Topic 5a Introduction to Curve Fitting & Linear Regression

Physics 215 Winter The Density Matrix

I. Understand get a conceptual grasp of the problem

Distributed Subgradient Methods for Multi-agent Optimization

Bayes Decision Rule and Naïve Bayes Classifier

Introduction to Machine Learning. Recitation 11

Feature Extraction Techniques

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Ensemble Based on Data Envelopment Analysis

Learnability and Stability in the General Learning Setting

arxiv: v1 [cs.lg] 8 Jan 2019

Tracking using CONDENSATION: Conditional Density Propagation

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Non-Parametric Non-Line-of-Sight Identification 1

Kinematics and dynamics, a computational approach

Weighted- 1 minimization with multiple weighting sets

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Pattern Recognition and Machine Learning. Artificial Neural networks

arxiv: v3 [cs.lg] 7 Jan 2016

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Optimum Value of Poverty Measure Using Inverse Optimization Programming Problem

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

Bootstrapping Dependent Data

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

On Conditions for Linearity of Optimal Estimation

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

Analyzing Simulation Results

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Transcription:

Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October 15, 2018 10 Stochastic optiization he goal in stochastic optiization is to iniize functions of the for f (x) = E z D g(x, z) which have stochastic coponent given by a distribution D. In the case where the distribution has finite support, the function can be written as f (x) = 1 f i (x). o solve these kinds of probles, we exaine the stochastic gradient descent ethod and soe if its any applications. 10.1 he stochastic gradient ethod Following Robbins-Monro [RM51], we define the stochastic gradient ethod as follows. Definition 10.1 (Stochastic gradient ethod). he stochastic gradient ethod starts fro a point x 0 Ω and proceeds according to the update rule x t+1 = x t η t f it (x t ) where i t {1,..., } is either selected at rando at each step, or cycled through a rando perutation of {1,..., }. 1

Either of the two ethods for selecting i t above, lead to the fact that E f it (x) = f (x) his is also true when f (x) = E g(x, z) and at each step we update according to g(x, z) for randoly drawn z D. 10.1.1 Sanity check Let us check that on a siple proble that the stochastic gradient descent yields the optiu. Let p 1,..., p R n, and define f : R n R + : x R n, f (x) = 1 2 x p i 2 2 Note that here f i (x) = 1 2 x p i 2 2 and f i(x) = x p i. Moreover, x = argin x R d f (x) = 1 p i Now, run SGM with η t = 1 t in cyclic order i.e. i t = t and x 0 = 0: 10.2 he Perceptron x 0 = 0 x 1 = 0 1 1 (0 p 1) = p 1 x 2 = p 1 1 2 (p 1 p 2 ) = p 1 + p 2 2. x n = 1 p i = x he New York ies wrote in 1958 that the Perceptron [Ros58] was: the ebryo of an electronic coputer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. So, let s see. Definition 10.2 (Perceptron). Given labeled points ((x 1, y 1 ),..., (x, y )) (R n { 1, 1}), and and initial point w 0 R n, the Perceptron is the following algorith. For i t {1,..., } selected uniforly at rando, { y w t+1 = w t (1 γ) + it x it if y it w t, x it < 1 η 0 otherwise 2

Reverse-engineering the algorith, the Perceptron is equivalent to running the SGM on the Support Vector Machine (SVM) objective function. Definition 10.3 (SVM). Given labeled points ((x 1, y 1 ),..., (x, y )) (R n { 1, 1}), the SVM objective function is: f (w) = 1 n ax(1 y i w, x i, 0) + λ w 2 2 he loss function ax(1 z, 0) is known as the Hinge Loss. he extra ter λ w 2 2 is known as the regularization ter. 10.3 Epirical risk iniization We have two spaces of objects X and Y, where we think of X as the space of instances or exaples, and Y is a the set of labels or classes. Our goal is to learn a function h : X Y which outputs an object y Y, given x X. Assue there is a joint distribution D over the space X Y and the training set consists of instances S = ((x 1, y 1 ),..., (x, y )) drawn i.i.d. fro D. We also define a non-negative real-valued loss function l(y, y) to easure the difference between the prediction y and the true outcoe y. Definition 10.4. he risk of a function h : X Y is defined as R[h] = E (x,y) D l(h(x), y). he ultiate goal of a learning algorith is to find h aong a class of functions H that iniizes R[h]: h arg in h H R[h] In general, the risk R[h] cannot be coputed because the joint distribution is unknown. herefore, we instead iniize a proxy objective known as epirical risk and defined by averaging the loss function over the training set: R S [h] = 1 l(h(x i ), y i ) An epirical risk iniizer is any point h arg in h H R S [h]. he stochastic gradient ethod can be thought of as iniizing the risk directly, if each exaple is only used once. In cases where we ake ultiple passes over the training set, it is better to think of it as iniizing the epirical risk, which can give different solutions than iniizing the risk. We ll develop tools to relate risk and epirical risk in the next lecture. 3

10.4 Online learning An interesting variant of this learning setup is called online learning. It arises when we do not have a set of training data, but rather ust ake decisions one-by-one. aking advice fro experts. Iagine we have access to the predictions of n experts. We start fro an initial distribution over experts, given by weights w 1 n = {w R n : i w i = 1, w i 0}. At each step t = 1,..., : we randoly choose an expert according to w t nature deals us a loss function f t [ 1, 1] n, specifying for each expert i the loss f t [i] incurred by the prediction of expert i at tie t. we incur the expected loss E i wt f t [i] = w t, f t. we get to update our distribution to fro w t to w t+1. At the end of the day, we easure how well we perfored relative to the best fixed distribution over experts in hindsight. his is called regret: R = t=1 w n t=1 w t, f t in w, f t his is a relative benchark. Sall regret does not say that the loss is necessarily sall. It only says that had we played the sae strategy at all steps, we couldn t have done uch better even with the benefit of hindsight. 10.5 Multiplicative weights update Perhaps the ost iportant online learning algorith is the ultiplicative weights update. Starting fro the unifor distribution w 1, proceed according to the following siple update rule for t > 1, v t [i] = w t 1 [i]e η f t[i] w t = v t /( i v t [i]) (exponential weights update) (noralize) he question is how do we bound the regret of the ultiplicative weights update? We could do a direct analysis, but instead we ll relate ultiplicative weights to gradient descent and use the convergence guarantees we already know. Specifically, we will interpret ultiplicative weights as an instance of irror descent. Recall that irror descent requires a irror ap φ : Ω R over a doain Ω R n where φ is strongly convex and continuously differentiable. 4

he associated projection is Π φ Ω where D φ (x, y) is Bregan divergence. (y) = argin D φ (x, y) x Ω Definition 10.5. he Bregan divergence easures how good the first order approxiation of the function φ is: he irror descent update rule is: D φ (x, y) = φ(x) φ(y) φ(y) (x y) φ(y t+1 ) = φ(x t ) ηg t x t+1 = Π φ Ω (y t+1) where g t f (x t ). In the first hoework, we proved the following results. heore 10.6. Let be an arbitrary nor and suppose that φ is α-strongly convex w.r.t. on Ω. Suppose that f t is L-lipschitz w.r.t.. hen, we have 1 f t (x t ) D φ(x, x 0 ) + η L2 η 2α. t=1 Multiplicative weights are an instance of the irror descent where φ(w) = w i log(w i ) is the negative entropy function. We have that φ(w) = 1 + log(w), where the logarith is eleentwise. he update rule for irror descent is which iplies that φ(v t+1 ) = φ(w t ) η t f t, v t+1 = w t e η t f t and thus recovers the ultiplicative weights update. Now coes the projection step. he Bregan divergence corresponding to φ is D φ (x, y) = φ(x) φ(y) φ(y) (x y) = x i log(x i /y i ) x i + y i, i i i which we recognize as the relative entropy or Kullback-Leibler divergence over the probability siplex. We thus choose the doain Ω to be the probability siplex Ω = {w R n i w i = 1, w i 0}. he projection Π φ Ω (y) = argin D φ (x, y) x Ω turns out to just correspond to the noralization step in the update rule of the ultiplicative weights algorith. 5

Concrete rate of convergence. o get a concrete rate of convergence fro the preceding theore, we still need to deterine what value of the strong convexity constant α we get in our setting. Here, we choose the nor to be the l -nor. It follows fro Pinsker s inequality that φ is 1/2-strongly convex with respect to the l -nor. Moreover, in the l -nor all gradients are bounded by 1, since the loss ranges in [1, 1]. Finally, the relative entropy between the initial unifor distribution any any other distribution is at ost log(n). Putting these facts together and balancing the step size η, we get the noralized regret bound ( ) log n O. In particular, this shows that the noralized regret of the ultiplicative update rule is vanishing over tie. References [RM51] H. Robbins and S. Monro. A stochastic approxiation ethod. Annals of Matheatical Statistics, 22:400 407, 1951. [Ros58] F. Rosenblatt. he perceptron: A probabilistic odel for inforation storage and organization in the brain. Psychological Review, pages 65 386, 1958. 6