Estimators based on non-convex programs: Statistical and computational guarantees

Size: px
Start display at page:

Download "Estimators based on non-convex programs: Statistical and computational guarantees"

Transcription

1 Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright (UC Berkeley) May / 1

2 Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n

3 Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) } + R λ (θ). }{{} Regularizer

4 Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima } + R λ (θ). }{{} Regularizer

5 Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima Question } + R λ (θ). }{{} Regularizer How to close this undesirable gap between statistics and computation?

6 Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c

7 Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c Estimator: R λ -regularized likelihood { θ argmin 1 n logq(y i x i, θ ) } +R λ (θ). θ n i=1

8 Vignette A: Regression with non-convex penalties y n Q ( ) X n p θ S S c Example: Logistic regression for binary responses y i {0,1}: { 1 n { θ argmin log(1+e x i,θ ) y i x i, θ } } +R λ (θ). θ n i=1 Many non-convex penalties are possible: capped l 1 -penalty SCAD penalty MCP penalty (Fan & Li, 2001) (Zhang, 2006)

9 Convex and non-convex regularizers Regularizers R λ (t) MCP L 1 Capped L t

10 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ)

11 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ

12 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ Goal of optimization-theorist Provide bounds on Optimization error: θ t θ

13 Logistic regression with non-convex regularizer log error plot for logistic regression with SCAD, a = 3.7 opt err stat err log( β t β * 2 ) iteration count

14 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum.

15 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012)

16 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012) Empirical observation #2: First-order methods converge as fast as possible up to statistical precision.

17 Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c

18 Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c Observe y R n and Z R n p Z X W n p = F n p, n p Here W R n p is a stochastic perturbation.

19 Missing data: Z n p Example: Missing data = X n p W Here W R n p is multiplicative perturbation (e.g., W ij Ber(α).) y X θ ε n n p = + S S c

20 Example: Additive perturbations Additive noise in covariates: Z X W n p = n p + n p Here W R n p is an additive perturbation (e.g., W ij N(0,σ 2 )). y X θ ε n n p = + S S c

21 A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Martin Wainwright (UC Berkeley) May / 1

22 A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. Martin Wainwright (UC Berkeley) May / 1

23 Corrected estimators Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. A general family of estimators { 1 } θ arg min θ R p 2 θt Γθ θt γ + λ 2 n θ 2 1, where ( Γ, γ) are unbiased estimators of cov(x 1 ) and cov(x 1,y 1 ). Martin Wainwright (UC Berkeley) May / 1

24 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α.

25 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n,

26 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n, solve (doubly non-convex) optimization problem: (Loh & W., 2012) {1 θ argmin θ Ω 2 θt Γθ γ, θ +Rλ (θ) }.

27 Non-convex quadratic and non-convex regularizer log error plot for corrected linear regression with MCP, b = opt err stat err 0 2 log( β t β * 2 ) iteration count

28 Remainder of talk 1 Why are all local optima statistically good? Restricted strong convexity A general theorem Various examples 2 Why do first-order gradient methods converge quickly? Composite gradient methods Statistical versus optimization error Fast convergence for non-convex problems Martin Wainwright (UC Berkeley) May / 1

29 Geometry of a non-convex quadratic loss Loss function has directions of both positive and negative curvature.

30 Geometry of a non-convex quadratic loss Loss function has directions of both positive and negative curvature. Negative directions must be forbidden by regularizer.

31 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. Martin Wainwright (UC Berkeley) May / 1

32 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ Martin Wainwright (UC Berkeley) May / 1

33 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ holds for a variety of loss functions (convex and non-convex): ordinary least-squares (Raskutti, W. & Yu, 2010) likelihoods for generalized linear models (Negahban et al., 2012) certain non-convex quadratic functions (Loh & W, 2012) Martin Wainwright (UC Berkeley) May / 1

34 Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Martin Wainwright (UC Berkeley) May / 1

35 Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Includes (among others): rescaled l 1 loss: R λ (t) = λ t. MCP penalty and SCAD penalties (Fan et al., 2001; Zhang, 2006) does not include capped l 1 -penalty Martin Wainwright (UC Berkeley) May / 1

36 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L))

37 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ.

38 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ. Theorem (Loh & W., 2013) Suppose M is chosen such that θ is feasible, and λ satisfies the bounds max { logp L n (θ } α 2 ),α 2 λ n 6LM Then any local optimum θ satisfies the bound θ θ 2 6λ n s 4(α µ) where s = β 0.

39 Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. θ

40 Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. With λ = c logp n, statistical error scales as θ ǫ stat slogp n, which is minimax optimal.

41 Empirical results (unrescaled) p=128 p=256 p=512 p=1024 Mean squared error Raw sample size

42 Empirical results (rescaled) p=128 p=256 p=512 p=1024 Mean squared error Rescaled sample size

43 Comparisons between different penalties comparing penalties for corrected linear regression p=64 p=128 p=256 l 2 norm error n/(k log p)

44 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly?

45 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable.

46 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω

47 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω Not directly applicable with f = L n and g = R λ (since R λ can be non-convex).

48 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex

49 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω

50 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ.

51 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ. will show that convergence is geometrically fast with constant stepsize

52 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse

53 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good

54 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good Theorem (Loh & W., 2013) If n slogp, there is a contraction factor κ (0,1) such that for any δ ǫ stat, we have θ t θ ( δ τ slogp ) α µ n ǫ2 stat for all t T(δ) iterations, where T(δ) log(1/δ) log(1/κ).

55 Geometry of result 0 1 t ǫ θ θ θ Optimization error t := θ t θ decreases geometrically up to statistical tolerance: θ t+1 θ 2 κ t θ 0 θ 2 +o ( θ θ 2 ) }{{} for all t = 0,1,2,... Stat. error ǫ 2 stat

56 Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = opt err 0 stat err 1 log( β t β * 2 ) iteration count

57 Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = opt err stat err 0 log( β t β * 2 ) iteration count

58 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible

59 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics?

60 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics? Papers and pre-prints: Loh & W. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Pre-print arxiv: Loh & W. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40: Negahban et al. (2012). A unified framework for high-dimensional analysis of M-estimators. Statistical Science, 27(4):

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,

More information

Robust estimation, efficiency, and Lasso debiasing

Robust estimation, efficiency, and Lasso debiasing Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation On Iterative Hard Thresholding Methods for High-dimensional M-Estimation Prateek Jain Ambuj Tewari Purushottam Kar Microsoft Research, INDIA University of Michigan, Ann Arbor, USA {prajain,t-purkar}@microsoft.com,

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

High-dimensional statistics: Some progress and challenges ahead

High-dimensional statistics: Some progress and challenges ahead High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

arxiv: v3 [stat.me] 8 Jun 2018

arxiv: v3 [stat.me] 8 Jun 2018 Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Sparse Learning and Distributed PCA. Jianqing Fan

Sparse Learning and Distributed PCA. Jianqing Fan w/ control of statistical errors and computing resources Jianqing Fan Princeton University Coauthors Han Liu Qiang Sun Tong Zhang Dong Wang Kaizheng Wang Ziwei Zhu Outline Computational Resources and Statistical

More information

arxiv: v7 [stat.ml] 9 Feb 2017

arxiv: v7 [stat.ml] 9 Feb 2017 Submitted to the Annals of Applied Statistics arxiv: arxiv:141.7477 PATHWISE COORDINATE OPTIMIZATION FOR SPARSE LEARNING: ALGORITHM AND THEORY By Tuo Zhao, Han Liu and Tong Zhang Georgia Tech, Princeton

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Electronic Journal of Statistics ISSN: 935-7524 arxiv: arxiv:503.0388 Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Yuchen Zhang, Martin J. Wainwright

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Lecture 6 : Projected Gradient Descent

Lecture 6 : Projected Gradient Descent Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Statistical Computing (36-350)

Statistical Computing (36-350) Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013 Agenda Constraints and Penalties Constraints and penalties Stochastic

More information

arxiv: v1 [math.st] 9 Aug 2014

arxiv: v1 [math.st] 9 Aug 2014 Statistical guarantees for the EM algorithm: From population to sample-based analysis Sivaraman Balakrishnan Martin J. Wainwright, Bin Yu, Department of Statistics Department of Electrical Engineering

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression Han Chen, Garvesh Raskutti and Ming Yuan University of Wisconsin-Madison Morgridge Institute for Research and Department

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Sketched Ridge Regression:

Sketched Ridge Regression: Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:

More information

General principles for high-dimensional estimation: Statistics and computation

General principles for high-dimensional estimation: Statistics and computation General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression Journal of Machine Learning Research 20 (2019) 1-37 Submitted 12/16; Revised 1/18; Published 2/19 Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression Han Chen Garvesh Raskutti

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Joint Gaussian Graphical Model Review Series I

Joint Gaussian Graphical Model Review Series I Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students

More information

arxiv: v1 [math.st] 10 Sep 2015

arxiv: v1 [math.st] 10 Sep 2015 Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING. BY HONGCHENG LIU 1,TAO YAO 1 AND RUNZE LI 2 Pennsylvania State University

GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING. BY HONGCHENG LIU 1,TAO YAO 1 AND RUNZE LI 2 Pennsylvania State University The Annals of Statistics 2016, Vol. 44, No. 2, 629 659 DOI: 10.1214/15-AOS1380 Institute of Mathematical Statistics, 2016 GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING BY HONGCHENG LIU

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Composite Loss Functions and Multivariate Regression; Sparse PCA

Composite Loss Functions and Multivariate Regression; Sparse PCA Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.

More information

Restricted Strong Convexity Implies Weak Submodularity

Restricted Strong Convexity Implies Weak Submodularity Restricted Strong Convexity Implies Weak Submodularity Ethan R. Elenberg Rajiv Khanna Alexandros G. Dimakis Department of Electrical and Computer Engineering The University of Texas at Austin {elenberg,rajivak}@utexas.edu

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Nearest Neighbor based Coordinate Descent

Nearest Neighbor based Coordinate Descent Nearest Neighbor based Coordinate Descent Pradeep Ravikumar, UT Austin Joint with Inderjit Dhillon, Ambuj Tewari Simons Workshop on Information Theory, Learning and Big Data, 2015 Modern Big Data Across

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Statistical Sparse Online Regression: A Diffusion Approximation Perspective Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University

More information

EE364b Convex Optimization II May 30 June 2, Final exam

EE364b Convex Optimization II May 30 June 2, Final exam EE364b Convex Optimization II May 30 June 2, 2014 Prof. S. Boyd Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

arxiv: v3 [math.oc] 29 Jun 2016

arxiv: v3 [math.oc] 29 Jun 2016 MOCCA: mirrored convex/concave optimization for nonconvex composite functions Rina Foygel Barber and Emil Y. Sidky arxiv:50.0884v3 [math.oc] 9 Jun 06 04.3.6 Abstract Many optimization problems arising

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information