Fast Rates for Exp-concave Empirial Risk Minimization

Similar documents
Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

The FTRL Algorithm with Strongly Convex Regularizers

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Fast learning rates for plug-in classifiers under the margin condition

Stochastic Optimization

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

STA141C: Big Data & High Performance Statistical Computing

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Introduction to Machine Learning (67577) Lecture 7

Stochastic and online algorithms

1 Sparsity and l 1 relaxation

Learning with Decision-Dependent Uncertainty

Neural Networks: Backpropagation

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Lecture 16: FTRL and Online Mirror Descent

Convex optimization COMS 4771

Lasso, Ridge, and Elastic Net

Full-information Online Learning

Learnability, Stability, Regularization and Strong Convexity

Bandit Convex Optimization

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

1 Delayed Renewal Processes: Exploiting Laplace Transforms

ECS289: Scalable Machine Learning

Perceptron Mistake Bounds

Convex Optimization Lecture 16

Linear, Binary SVM Classifiers

Bregman Divergence and Mirror Descent

LEARNING IN CONCAVE GAMES

Supremum of simple stochastic processes

Adaptive Online Gradient Descent

References for online kernel methods

Classification Logistic Regression

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Generalization theory

6. Proximal gradient method

Cubic regularization of Newton s method for convex problems with constraints

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

(Part 1) High-dimensional statistics May / 41

Consistency of Nearest Neighbor Methods

Generalization bounds

Lasso, Ridge, and Elastic Net

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Reproducing Kernel Hilbert Spaces

A. Derivation of regularized ERM duality

Introduction to Optimization

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Class 2 & 3 Overfitting & Regularization

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley


Importance Sampling for Minibatches

Bandit Convex Optimization: T Regret in One Dimension

Plug-in Approach to Active Learning

Convex Optimization Algorithms for Machine Learning in 10 Slides

Least squares under convex constraint

Time Series Prediction & Online Learning

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Math 273a: Optimization Subgradients of convex functions

1 Regression with High Dimensional Data

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Weak Convergence of Numerical Methods for Dynamical Systems and Optimal Control, and a relation with Large Deviations for Stochastic Equations

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

On Some Estimates of the Remainder in Taylor s Formula

1 Review and Overview

Exponentiated Gradient Descent

Fourier analysis of boolean functions in quantum computation

Optimization geometry and implicit regularization

Accelerated Training of Max-Margin Markov Networks with Kernels

Multi-class classification via proximal mirror descent

Linear Convergence under the Polyak-Łojasiewicz Inequality

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Chapter 2. Optimization. Gradients, convexity, and ALS

IFT Lecture 7 Elements of statistical learning theory

Computational and Statistical Learning theory

Computing regularization paths for learning multiple kernels

Accelerate Subgradient Methods

Proximal and First-Order Methods for Convex Optimization

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Online Learning and Online Convex Optimization

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Nonparametric Bayesian Methods

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Logistic Regression Trained with Different Loss Functions. Discussion

Lecture 2: Convex Sets and Functions

Math 273a: Optimization Subgradients of convex functions

Optimization and Gradient Descent

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

RegML 2018 Class 2 Tikhonov regularization and kernels

MA 137 Calculus 1 with Life Science Applications Monotonicity and Concavity (Section 5.2) Extrema, Inflection Points, and Graphing (Section 5.

6. Proximal gradient method

Transcription:

Fast Rates for Exp-concave Empirial Risk Minimization Tomer Koren, Kfir Y. Levy Presenter: Zhe Li September 16, 2016

High Level Idea Why could we use Regularized Empirical Risk Minimization? Learning theory perspective If we could use Regularized Empirical Risk Minimization, how to solve? Optimization algorithms

Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W

Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z

Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z given n samples z 1,, z n of the random variable Z

Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z given n samples z 1,, z n of the random variable Z the goal: to produce an estimate ŵ W such that is small. E[F (ŵ)] min w F (w)

Assumptions f (, z) is α-exp-concave over the domain W for some α > 0. discuss later. f (, z) is β-smooth over W with repect to Euclidean norm. f (, z) is bounded over W.

Main results How to construct an estimate ŵ? Based on the sample z 1,, z n, construct where ˆF (w) = 1 n ŵ = argmin ˆF (w) w W n f (w, z i ) + 1 n R(w) i=1 R(w) : W R: a regularizer, 1-strongly-convex w.r.t Euclidean norm. Assump that R(w) R(w ) B for all w, w W for constant B > 0

Main results Theorem Let f : W Z R be a loss function defined over a closed and convex domain W R d, which α-exp-concave,β-smooth and C bounded w.r.t its first argument. Let R : W R be a 1-strongly-convex and B-bounded regularization function. Then for the regularized ERM estimate ŵ based on an i.i.d samples z 1,, z n, the expected excess loss is bounded as 24βd E[F (ŵ)] min F (w) w W αn + 100Cd n + B n = O(d/n) Don t care about whatever optimazition algorithms Care about the learning framework

Proof Technique Uniform Stability Average leave-one-out stablity Rademancher Complexity Local Rademancher Complexity

Average leave-one-out stablity Define the empirical leave-one-out risk for each i = 1,, n ˆF i (w) = 1 f (w, z j ) + 1 n n R(w) j i Let ŵ i = argmin ˆF i (w) w W The average leave-one-out stability of ŵ is defined as 1 n n (f (ŵ i, z i ) f (ŵ, z i )) i=1

Average leave-one-out stablity Theorem (Average leave-one-out stability). For any z 1,, z n Z and for ŵ 1,, ŵ n and ŵ as defined previously, we have 1 n n i=1 (f (ŵ i, z i ) f (ŵ, z i )) 24βd αn + 100Cd n

Proof of Main Theorem fix an arbitrary w W, we have F (w ) + 1 n R(w ) = E[ ˆF (w )] E[ ˆF (ŵ)] E[F (ŵ n )] F (w ) E[F (ŵ n ) ˆF (ŵ)] + 1 n R(w ) since the random variable ŵ 1,, ŵ n have same distribution: E[F (ŵ n )] = 1 n n E[F (ŵ i )] = 1 n i=1 n E[f (ŵ i, z i )] i=1

Proof of Main Theorem E[ ˆF (ŵ)] = 1 n n i=1 E[f (ŵ, z i)] + 1 n E[R(ŵ)] Combining the above inqualities, E[F (ŵ n )] F (w ) 1 n E[f (ŵ i, z i ) f (ŵ, z i )] + 1 n n E[R(w ) R(ŵ)] i=1 24βd αn + 100Cd n + B n = O(d n ) using the average leave-one-out stability theorem and the assumption.

Proof of Average Leave-one-out Stability Theorem

Local Strongly Convexity and Stability Definition (Local strong convexity). We say that a function g : K R is locally δ-strongly convex over a domain K R d at x with respect to a norm,if y K, g(y) g(x) + g(x)(y x) + δ y x 2 2

Local Strongly Convexity and Stability Lemma (Lemma 5). Let g 1, g 2 : K R be two convex functions defined over a closed and convex domain K R d, and let x 1 argming 1 (x) and x 2 argming 2 (x). Assume that g 2 is locally x K x K δ-strongly convex at x 1 with repect to a norm. Then, for h = g 2 g 1 we have Futhermore, if h is convex then x 2 x 1 2 δ h(x 1) 0 h(x 1 ) h(x 2 ) 2 δ ( h(x 1) ) 2 (1)

Average Stability Analysis Some Definitions Lemma f i ( ) = f (, z i ) for all i, h i = f i (ŵ) H = 1 δ I d + n i=1 h ih T i and H i = 1 δ I d + n j i h ih T i x M = x T Mx denotes the norm induced by a positive definite matrix M, dual norm x M = x T M 1 x (Lemma 6) For all i = 1,, n it holds that f i (ŵ i ) f i (ŵ) 6β δ ( h i H i ) 2

Average Stability Analysis Lemma (Lemma 8) Let I = {i [n] : h i H > 1 2 }. Then I 2d and we have ( h i H i ) 2 2d i I Lemma 6 + Lemma 8 = Stability Theorem Proof: i I (f i(ŵ i ) f i (ŵ)) C I 1 n 1 n i I (f i(ŵ i ) f i (ŵ)) 6β δn summing up. n 2Cd n i I ( h i H i ) 2 12βd δn

Continue to prove of Lemma 6 and Lemma 8?

Proof of Lemma 8 Lemma (Lemma 8) Let I = {i [n] : h i H > 1 2 }. Then I 2d and we have ( h i H i ) 2 2d Proof: i I a i = h T i H 1 h i for i = 1,, n, a i > 0. i a i d I 2d. ( h i H i ) 2 = hi T H 1 i hi T = a i + a2 i 1 a i 2a i i I ( h i H i ) 2 2 i I a i i a i = 2d

Proof of Lemma 6 Lemma (Lemma 6) For all i = 1,, n it holds that f i (ŵ i ) f i (ŵ) 6β δ ( h i H i ) 2 Proof: Using property of α-exp-concave of function. Smoothness Assumption. Lemma 5.

α-exp-concave function Definition The function f (w) is α-exp-concave over the domain W for some α > 0, if that the function exp( αf (w)) is concave over W. Lemma (Lemma 7) Let f : K R be an α-exp-concave function over a convex domain K R d such that f (x) f (y) C for any x, y K. Then for any δ 1 2 min{ 1 4C, α}, it holds that x, y K, f (y) f (x) + f (x) T (y x) + δ 2 ( f (x)t (y x)) 2

Proof of Lemma 6 Proof: Let g 1 = ˆF and g 2 = ˆF i, h i = 1 n f i Lemma7 = ˆF i is locally (δ/n) strongly convex at ŵ w.r.t Hi Lemma5 = ŵ i ŵ Hi 2n δ h(ŵ) H i = 2 δ h i H i f i is convex f i (ŵ i ) f i (ŵ) f i (ŵ i ) T (ŵ i ŵ) = f i (ŵ) T (ŵ i ŵ) + ( f i (ŵ i ) f i (ŵ)) T (ŵ i ŵ)

Proof of Lemma 6 f i (ŵ) T (ŵ i ŵ) = h T i (ŵ) T (ŵ i ŵ) h i H i ŵ i ŵ Hi 2 δ ( h i H i ) 2 ( f i (ŵ i ) f i (ŵ)) T (ŵ i ŵ) β ŵ i ŵ 2 2 ŵ i ŵ 2 2 δ ŵ i ŵ 2 H i H i (1/δ)I d. 4 δ ( h i H i ) 2, by using

Open Question Is smoothness assumption necessary? not necessary from online-batch convertion. limition of the analysis? Excess risk with high probability? Morkov s inequality?

Open Question Is smoothness assumption necessary? not necessary from online-batch convertion. limition of the analysis? Excess risk with high probability? Morkov s inequality?