Estimators based on non-convex programs: Statistical and computational guarantees
|
|
- Arnold Blake
- 5 years ago
- Views:
Transcription
1 Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright (UC Berkeley) May / 1
2 Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n
3 Introduction Prediction/regression problems arise throughout statistics: vector of predictors/covariates x R p response variable y R In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) } + R λ (θ). }{{} Regularizer
4 Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima } + R λ (θ). }{{} Regularizer
5 Introduction Prediction/regression problems arise throughout statistics: In big data setting: both sample size n and ambient dimension p are large many problems have p n Regularization is essential: { 1 θ argmin θ n n L(θ;x i,y i ) i=1 } {{ } L n(θ;x n 1,yn 1 ) For non-convex problems: Mind the gap! any global optimum is statistically good... but efficient algorithms only find local optima Question } + R λ (θ). }{{} Regularizer How to close this undesirable gap between statistics and computation?
6 Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c
7 Vignette A: Regression with non-convex penalties Set-up: Observe (y i,x i ) pairs for i = 1,2,...,n, where y i Q ( θ, x i ), where θ R p has low-dimensional structure y n Q ( ) X n p θ S S c Estimator: R λ -regularized likelihood { θ argmin 1 n logq(y i x i, θ ) } +R λ (θ). θ n i=1
8 Vignette A: Regression with non-convex penalties y n Q ( ) X n p θ S S c Example: Logistic regression for binary responses y i {0,1}: { 1 n { θ argmin log(1+e x i,θ ) y i x i, θ } } +R λ (θ). θ n i=1 Many non-convex penalties are possible: capped l 1 -penalty SCAD penalty MCP penalty (Fan & Li, 2001) (Zhang, 2006)
9 Convex and non-convex regularizers Regularizers R λ (t) MCP L 1 Capped L t
10 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ)
11 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ
12 Statistical error versus optimization error Algorithm generating sequence of iterates {θ t } t=0 to solve { } 1 n θ argmin L(θ;x i,y i )+R λ (θ). θ n Global minimizer of population risk i=1 [ θ := argmine X,Y θ ] L(θ;X,Y) } {{ } L(θ) Goal of statistician Provide bounds on Statistical error: θ t θ or θ θ Goal of optimization-theorist Provide bounds on Optimization error: θ t θ
13 Logistic regression with non-convex regularizer log error plot for logistic regression with SCAD, a = 3.7 opt err stat err log( β t β * 2 ) iteration count
14 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum.
15 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012)
16 What phenomena need to be explained? Empirical observation #1: From a statistical perspective, all local optima are essentially as good as a global optimum. Some past work: for least-squares loss, certain local optima are good (Zhang & Zhang, 2012) if initialized at Lasso solution with l -guarantees, local algorithm has good behavior (Fan et al., 2012) Empirical observation #2: First-order methods converge as fast as possible up to statistical precision.
17 Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c
18 Vignette B: Error-in-variables regression Begin with high-dimensional sparse regression: y X θ ε n n p = + S S c Observe y R n and Z R n p Z X W n p = F n p, n p Here W R n p is a stochastic perturbation.
19 Missing data: Z n p Example: Missing data = X n p W Here W R n p is multiplicative perturbation (e.g., W ij Ber(α).) y X θ ε n n p = + S S c
20 Example: Additive perturbations Additive noise in covariates: Z X W n p = n p + n p Here W R n p is an additive perturbation (e.g., W ij N(0,σ 2 )). y X θ ε n n p = + S S c
21 A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Martin Wainwright (UC Berkeley) May / 1
22 A second look at regularized least-squares Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. Martin Wainwright (UC Berkeley) May / 1
23 Corrected estimators Equivalent formulation: { 1 θ arg min θ R p 2 θt( X T X) X T y } θ θ, n n + R λ(θ). Population view: unbiased estimators [ X T X ] cov(x 1 ) = E, and cov(x 1,y 1 ) = E n [ X T y n ]. A general family of estimators { 1 } θ arg min θ R p 2 θt Γθ θt γ + λ 2 n θ 2 1, where ( Γ, γ) are unbiased estimators of cov(x 1 ) and cov(x 1,y 1 ). Martin Wainwright (UC Berkeley) May / 1
24 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α.
25 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n,
26 Example: Estimator for missing data observe corrupted version Z R n p Z ij = { X ij with probability 1 α with probability α. Natural unbiased estimates: set 0 and Ẑ := Z (1 α) : Γ = Ẑ T Ẑ n αdiag( Ẑ T Ẑ) Ẑ T y, and γ = n n, solve (doubly non-convex) optimization problem: (Loh & W., 2012) {1 θ argmin θ Ω 2 θt Γθ γ, θ +Rλ (θ) }.
27 Non-convex quadratic and non-convex regularizer log error plot for corrected linear regression with MCP, b = opt err stat err 0 2 log( β t β * 2 ) iteration count
28 Remainder of talk 1 Why are all local optima statistically good? Restricted strong convexity A general theorem Various examples 2 Why do first-order gradient methods converge quickly? Composite gradient methods Statistical versus optimization error Fast convergence for non-convex problems Martin Wainwright (UC Berkeley) May / 1
29 Geometry of a non-convex quadratic loss Loss function has directions of both positive and negative curvature.
30 Geometry of a non-convex quadratic loss Loss function has directions of both positive and negative curvature. Negative directions must be forbidden by regularizer.
31 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. Martin Wainwright (UC Berkeley) May / 1
32 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ Martin Wainwright (UC Berkeley) May / 1
33 Restricted strong convexity Here defined with respect to the l 1 -norm: Definition The loss function L n satisfies RSC with parameters (α j,τ j ), j = 1,2 if L n (θ + ) L n (θ ), }{{} Measure of curvature { α1 2 logp 2 τ 1 n 21 if 2 1 logp α 2 2 τ 2 n 1 if 2 > 1. holds with τ 1 = τ 2 = 0 for any function that is locally strongly convex around θ holds for a variety of loss functions (convex and non-convex): ordinary least-squares (Raskutti, W. & Yu, 2010) likelihoods for generalized linear models (Negahban et al., 2012) certain non-convex quadratic functions (Loh & W, 2012) Martin Wainwright (UC Berkeley) May / 1
34 Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Martin Wainwright (UC Berkeley) May / 1
35 Well-behaved regularizers Properties defined at the univariate level R λ : R [0, ]. Satisfies R λ (0) = 0, and is symmetric around zero (R λ (t) = R λ ( t).) Non-decreasing and subadditive R λ (s+t) R λ (s)+r λ (t). Function t R λ(t) t is nonincreasing for t > 0 Differentiable for all t 0, subdifferentiable at t = 0 with subgradients bounded in absolute value by λl. For some µ > 0, the function R λ (t) = R λ (t)+µt 2 is convex. Includes (among others): rescaled l 1 loss: R λ (t) = λ t. MCP penalty and SCAD penalties (Fan et al., 2001; Zhang, 2006) does not include capped l 1 -penalty Martin Wainwright (UC Berkeley) May / 1
36 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L))
37 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ.
38 regularized M-estimator Main statistical guarantee θ arg min θ 1 M { } L n (θ)+r λ (θ). loss function satisfies (α, τ) RSC, and regularizer is regular (with parameters (µ, L)) local optimum θ defined by conditions L n ( θ)+ R λ ( θ), θ θ 0 for all feasible θ. Theorem (Loh & W., 2013) Suppose M is chosen such that θ is feasible, and λ satisfies the bounds max { logp L n (θ } α 2 ),α 2 λ n 6LM Then any local optimum θ satisfies the bound θ θ 2 6λ n s 4(α µ) where s = β 0.
39 Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. θ
40 Geometry of local/global optima ǫ stat θ θ θ Consequence: All { local, global } optima are within distance ǫ stat of the target θ. With λ = c logp n, statistical error scales as θ ǫ stat slogp n, which is minimax optimal.
41 Empirical results (unrescaled) p=128 p=256 p=512 p=1024 Mean squared error Raw sample size
42 Empirical results (rescaled) p=128 p=256 p=512 p=1024 Mean squared error Rescaled sample size
43 Comparisons between different penalties comparing penalties for corrected linear regression p=64 p=128 p=256 l 2 norm error n/(k log p)
44 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly?
45 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable.
46 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω
47 First-order algorithms and fast convergence Thus far... have shown that all local optima are statistically good how to obtain a local optimum quickly? Composite gradient descent for regularized objectives: (Nesterov, 2007) { } f(θ)+g(θ) min θ Ω where f is differentiable, and g is convex, sub-differentiable. Simple updates: { } θ t+1 = argmin θ α t f(θ t ) 2 2 +g(θ). θ Ω Not directly applicable with f = L n and g = R λ (since R λ can be non-convex).
48 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex
49 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω
50 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ.
51 Composite gradient on a convenient splitting Define modified loss functions and regularizers: L n (θ) := L n (θ) µ θ 2 2, and Rλ (θ) := R λ (θ)+µ θ 2 2. }{{}}{{} non-convex convex Apply composite gradient descent to the objective { Ln (θ)+ R λ (θ)}. min θ Ω converges to local optimum θ (Nesterov, 2007) L n (θ)+ R λ ( θ), θ θ 0 for all feasible θ. will show that convergence is geometrically fast with constant stepsize
52 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse
53 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good
54 Theoretical guarantees on computational error implement Nesterov s composite method with constant stepsize to ( L n, R λ ) split. fixed global optimum β defines the statistical error ǫ 2 stat = β β 2. population minimizer β is s-sparse loss function satisfies (α, τ) RSC and smoothness conditions, and regularizer is (µ, L)-good Theorem (Loh & W., 2013) If n slogp, there is a contraction factor κ (0,1) such that for any δ ǫ stat, we have θ t θ ( δ τ slogp ) α µ n ǫ2 stat for all t T(δ) iterations, where T(δ) log(1/δ) log(1/κ).
55 Geometry of result 0 1 t ǫ θ θ θ Optimization error t := θ t θ decreases geometrically up to statistical tolerance: θ t+1 θ 2 κ t θ 0 θ 2 +o ( θ θ 2 ) }{{} for all t = 0,1,2,... Stat. error ǫ 2 stat
56 Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = opt err 0 stat err 1 log( β t β * 2 ) iteration count
57 Non-convex linear regression with SCAD log error plot for corrected linear regression with SCAD, a = opt err stat err 0 log( β t β * 2 ) iteration count
58 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible
59 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics?
60 Summary M-estimators based on non-convex programs arise frequently under suitable regularity conditions, we showed that: all local optima are well-behaved from the statistical point of view simple first-order methods converge as fast as possible many open questions similar guarantees for more general problems? geometry of non-convex problems in statistics? Papers and pre-prints: Loh & W. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Pre-print arxiv: Loh & W. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40: Negahban et al. (2012). A unified framework for high-dimensional analysis of M-estimators. Statistical Science, 27(4):
Robust high-dimensional linear regression: A statistical perspective
Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,
More informationRobust estimation, efficiency, and Lasso debiasing
Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationOn Iterative Hard Thresholding Methods for High-dimensional M-Estimation
On Iterative Hard Thresholding Methods for High-dimensional M-Estimation Prateek Jain Ambuj Tewari Purushottam Kar Microsoft Research, INDIA University of Michigan, Ann Arbor, USA {prajain,t-purkar}@microsoft.com,
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationHigh-dimensional statistics: Some progress and challenges ahead
High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationIntroduction to graphical models: Lecture III
Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationarxiv: v3 [stat.me] 8 Jun 2018
Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationSparse Learning and Distributed PCA. Jianqing Fan
w/ control of statistical errors and computing resources Jianqing Fan Princeton University Coauthors Han Liu Qiang Sun Tong Zhang Dong Wang Kaizheng Wang Ziwei Zhu Outline Computational Resources and Statistical
More informationarxiv: v7 [stat.ml] 9 Feb 2017
Submitted to the Annals of Applied Statistics arxiv: arxiv:141.7477 PATHWISE COORDINATE OPTIMIZATION FOR SPARSE LEARNING: ALGORITHM AND THEORY By Tuo Zhao, Han Liu and Tong Zhang Georgia Tech, Princeton
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationOptimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators
Electronic Journal of Statistics ISSN: 935-7524 arxiv: arxiv:503.0388 Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Yuchen Zhang, Martin J. Wainwright
More informationHow hard is this function to optimize?
How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationLecture 6 : Projected Gradient Descent
Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly
More informationLecture 3 September 1
STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationStatistical Computing (36-350)
Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013 Agenda Constraints and Penalties Constraints and penalties Stochastic
More informationarxiv: v1 [math.st] 9 Aug 2014
Statistical guarantees for the EM algorithm: From population to sample-based analysis Sivaraman Balakrishnan Martin J. Wainwright, Bin Yu, Department of Statistics Department of Electrical Engineering
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationNon-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression
Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression Han Chen, Garvesh Raskutti and Ming Yuan University of Wisconsin-Madison Morgridge Institute for Research and Department
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationSketched Ridge Regression:
Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:
More informationGeneral principles for high-dimensional estimation: Statistics and computation
General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLearning discrete graphical models via generalized inverse covariance matrices
Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,
More informationBi-level feature selection with applications to genetic association
Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationNon-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression
Journal of Machine Learning Research 20 (2019) 1-37 Submitted 12/16; Revised 1/18; Published 2/19 Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression Han Chen Garvesh Raskutti
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationJoint Gaussian Graphical Model Review Series I
Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationRandomized Smoothing Techniques in Optimization
Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationStatistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it
Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students
More informationarxiv: v1 [math.st] 10 Sep 2015
Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationarxiv: v5 [cs.lg] 23 Dec 2017 Abstract
Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationGLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING. BY HONGCHENG LIU 1,TAO YAO 1 AND RUNZE LI 2 Pennsylvania State University
The Annals of Statistics 2016, Vol. 44, No. 2, 629 659 DOI: 10.1214/15-AOS1380 Institute of Mathematical Statistics, 2016 GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING BY HONGCHENG LIU
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationThe MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010
Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationComposite Loss Functions and Multivariate Regression; Sparse PCA
Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.
More informationRestricted Strong Convexity Implies Weak Submodularity
Restricted Strong Convexity Implies Weak Submodularity Ethan R. Elenberg Rajiv Khanna Alexandros G. Dimakis Department of Electrical and Computer Engineering The University of Texas at Austin {elenberg,rajivak}@utexas.edu
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationNearest Neighbor based Coordinate Descent
Nearest Neighbor based Coordinate Descent Pradeep Ravikumar, UT Austin Joint with Inderjit Dhillon, Ambuj Tewari Simons Workshop on Information Theory, Learning and Big Data, 2015 Modern Big Data Across
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationStatistical Sparse Online Regression: A Diffusion Approximation Perspective
Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University
More informationEE364b Convex Optimization II May 30 June 2, Final exam
EE364b Convex Optimization II May 30 June 2, 2014 Prof. S. Boyd Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationarxiv: v3 [math.oc] 29 Jun 2016
MOCCA: mirrored convex/concave optimization for nonconvex composite functions Rina Foygel Barber and Emil Y. Sidky arxiv:50.0884v3 [math.oc] 9 Jun 06 04.3.6 Abstract Many optimization problems arising
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More information