On Computational Thinking, Inferential Thinking and Data Science

Size: px
Start display at page:

Download "On Computational Thinking, Inferential Thinking and Data Science"

Transcription

1 On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley December 17, 2016

2 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service

3 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us

4 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us It should run just as fast as our classic service

5 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us It should run just as fast as our classic service It should only improve as we collect more data; in particular it shouldn t slow down

6 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us It should run just as fast as our classic service It should only improve as we collect more data; in particular it shouldn t slow down There are serious privacy concerns of course, and they vary across the clients

7 Some Challenges Driven by Big Data Big Data analysis requires a thorough blending of computational thinking and inferential thinking

8 Some Challenges Driven by Big Data Big Data analysis requires a thorough blending of computational thinking and inferential thinking What I mean by computational thinking abstraction, modularity, scalability, robustness, etc.

9 Some Challenges Driven by Big Data Big Data analysis requires a thorough blending of computational thinking and inferential thinking What I mean by computational thinking abstraction, modularity, scalability, robustness, etc. Inferential thinking means (1) considering the realworld phenomenon behind the data, (2) considering the sampling pattern that gave rise to the data, and (3) developing procedures that will go backwards from the data to the underlying phenomenon

10 The Challenges are Daunting The core theories in computer science and statistics developed separately and there is an oil and water problem Core statistical theory doesn t have a place for runtime and other computational resources Core computational theory doesn t have a place for statistical risk

11 Outline Inference under privacy constraints Inference under communication constraints Lower bounds, the variational perspective and symplectic integration

12 Part I: Inference and Privacy with John Duchi and Martin Wainwright

13 Privacy and Data Analysis Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur Privacy loss can be quantified via differential privacy We want to trade privacy loss against the value we obtain from data analysis The question becomes that of quantifying such value and juxtaposing it with privacy loss

14 Privacy query database

15 Privacy query database

16 Privacy query database Q privatized database

17 Privacy query query database Q privatized database

18 Privacy query query database Q privatized database ˆ

19 Privacy query query database Q privatized database ˆ Classical problem in differential privacy: show that are close under constraints on Q ˆ and

20 Inference query database

21 Inference query P S database

22 Inference query query P S database

23 Inference query query P S database Classical problem in statistical theory: show that are close under constraints on S and

24 Privacy and Inference query query query P S database Q privatized database ˆ The privacy-meets-inference problem: show that are close under constraints on Q and on S and ˆ

25 Background on Inference In the 1930 s, Wald laid the foundations of statistical decision theory Given a family of distributions P, a parameter (P ) for each P 2 P, an estimator ˆ, and a loss l(ˆ, (P )), define the risk: i R P (ˆ ) :=E P hl(ˆ, (P ))

26 Background on Inference In the 1930 s, Wald laid the foundations of statistical decision theory Given a family of distributions P, a parameter (P ) for each P 2 P, an estimator ˆ, and a loss l(ˆ, (P )), define the risk: i R P (ˆ ) :=E P hl(ˆ, (P )) Minimax principle [Wald, 39, 43]: choose estimator minimizing worst-case risk: sup P 2P i E P hl(ˆ, (P ))

27 Private Local Privacy

28 Local Privacy Private Channel Individuals Estimator with private data

29 Differential Privacy Definition: channel private if is -differentially [Dwork, McSherry, Nissim, Smith 06] Given Z, cannot reliably discriminate between x and x

30 Private Minimax Risk Parameter of distribution Family of distributions Loss measuring error Family of private channels -private Minimax risk Best -private channel Minimax risk under privacy constraint

31 Vignette: Private Mean Estimation Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances 1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines Proportions = =.45 =.32 =.16 =.20 =.00 =.02

32 Vignette: Mean Estimation Consider estimation of mean errors measured in -norm, for, with Proposition: Minimax rate (achieved by sample mean)

33 Vignette: Mean Estimation Consider estimation of mean errors measured in -norm, for, with Proposition: Private minimax rate for

34 Vignette: Mean Estimation Consider estimation of mean errors measured in -norm, for, with Proposition: Private minimax rate for Note: Effective sample size

35 Optimal mechanism? Non-private observation Idea 1: add independent noise (e.g. Laplace mechanism) [Dwork et al. 06] Problem: magnitude much too large (this is unavoidable: provably sub-optimal)

36 Empirical evidence Sample size n Estimate proportion of emergency room visits involving different substances Data source: Drug Abuse Warning Network

37 Additional Examples Fixed-design regression Convex risk minimization Multinomial estimation Nonparametric density estimation Almost always, the effective sample size reduction is:

38 Part III: Inference and Compression with Yuchen Zhang, John Duchi and Martin Wainwright

39 Computation and Inference How does inferential quality trade off against classical computational resources such as time and space?

40 Computation and Inference How does inferential quality trade off against classical computational resources such as time and space? Hard!

41 Computation and Inference: Mechanisms and Bounds Tradeoffs via convex relaxations linking runtime to convex geometry and risk to convex geometry Tradeoffs via concurrency control optimistic concurrency control Bounds via optimization oracles number of accesses to a gradient as a surrogate for computation Bounds via communication complexity Tradeoffs via subsampling bag of little bootstraps, variational consensus Monte Carlo

42 A Variational Framework for Accelerated Methods in Optimization with Andre Wibisono and Ashia Wilson July 12, 2016

43 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x)

44 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k)

45 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k) Accelerated gradient descent: y k+1 = x k β f (x k ) x k+1 = (1 λ k )y k+1 + λ k y k obtains the (optimal) convergence rate of O(1/k 2 )

46 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood

47 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood Accelerated methods Nesterov s accelerated gradient descent, accelerated mirror descent, accelerated cubic-regularized Newton s method (Nesterov 08), etc. Important for both theory (optimal rate for first-order methods) and practice (many extensions: FISTA, stochastic setting, etc.) Not descent methods, faster than gradient methods, still mysterious

48 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case)

49 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)...

50 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)... But only for strongly convex functions, or first-order methods Question: What is the underlying mechanism that generates acceleration (including for higher-order methods)?

51 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow)

52 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0

53 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0 These ODEs are obtained by taking continuous time limits. Is there a deeper generative mechanism? Our work: A general variational approach to acceleration A systematic discretization methodology

54 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x)

55 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t

56 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function h(y) D h (y, x) h(x) x y

57 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function h(x) h(y) D h (y, x) x y

58 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions h(x) x h(y) y D h (y, x)

59 Bregman Lagrangian Define the Bregman Lagrangian: ( ) 1 L(x, ẋ, t) = e γt αt 2 ẋ 2 e 2αt+βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian h(x) x h(y) y D h (y, x)

60 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian Ideal scaling conditions: β t e αt γ t = e αt h(x) x h(y) y D h (y, x)

61 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) x t

62 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) E-L equation for Bregman Lagrangian under ideal scaling: Ẍ t + (e αt α t )Ẋ t + e 2αt+βt [ 2 h(x t + e αt Ẋ t)] 1 f (Xt ) = 0 x t

63 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt )

64 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt ) Proof. Exhibit a Lyapunov function for the dynamics: ) E t = D h (x, X t + e αt Ẋ t + e βt (f (X t ) f (x )) E t = e αt+βt D f (x, X t ) + ( β t e αt )e βt (f (X t ) f (x )) 0 Note: Only requires convexity and differentiability of f, h

65 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p

66 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p For p = 2: Recover result of Krichene et al with O(1/t 2 ) convergence rate In Euclidean case, recover ODE of Su et al: Ẍ t + 3 t Ẋt + f (X t ) = 0

67 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method)

68 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method) All accelerated methods are traveling the same curve in space-time at different speeds Gradient methods don t have this property From gradient flow to rescaled gradient flow: Replace by 1 p p

69 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t)

70 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t) Question: How to discretize E-L while preserving the convergence rate?

71 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t )

72 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t ) Euler discretization with time step δ > 0 (i.e., set x k = X t, x k+1 = X t+δ ): x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) with step size ɛ = δ p, and k (p 1) = k(k + 1) (k + p 2) is the rising factorial }

73 Naive discretization doesn t work x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) Cannot obtain a convergence guarantee, and empirically unstable } f(xk) k

74 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) } p p 1

75 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) Assume h is uniformly convex: D h (y, x) 1 p y x p } p p 1 Theorem Theorem f (y k ) f (x ) O ( ) 1 ɛk p Note: Matching convergence rates 1/(ɛk p ) = 1/(δk) p = 1/t p Proof using generalization of Nesterov s estimate sequence technique

76 Accelerated higher-order gradient method x k+1 = p k + p z k + k k + p y k { y k = arg min f p 1 (y; x k ) + 2 y z k = arg min z } ɛp y x k p ( 1 O } { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) ɛk p 1 If p 1 f is (1/ɛ)-Lipschitz and h is uniformly convex of order p, then: ( ) 1 f (y k ) f (x ) O ɛk p accelerated rate p = 2: Accelerated gradient/mirror descent p = 3: Accelerated cubic-regularized Newton s method (Nesterov 08) p 2: Accelerated higher-order method )

77 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1

78 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Polynomial Euler-Lagrange equation Ẍ p + 1 t+ Ẋ t+t p 2[ ( 2 h X t )] 1 f t+ t p Ẋt (Xt) = ( ) 1 O t p Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1 Accelerated higher-order method ( ) 1 O ɛk p when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p δ = ɛ 1 p

79 Summary: Bregman Lagrangian Bregman Lagrangian family with general convergence guarantee ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Polynomial subfamily generates accelerated higher-order methods: O(1/t p ) convergence rate via higher-order smoothness Exponential subfamily: O(e ct ) rate via uniform convexity Understand structure and properties of Bregman Lagrangian: Gauge invariance, symmetry, gradient flows as limit points, etc. Bregman Hamiltonian: ( H(x, p, t) = e (D αt+γt h h(x) + e γ t p, h(x) ) + e βt f (x))

80 Discussion Many conceptual and mathematical challenges arising in taking seriously the problem of Big Data Facing these challenges will require a rapprochement between computational thinking and inferential thinking bringing computational and inferential fields together at the level of their foundations

On Computational Thinking, Inferential Thinking and Data Science

On Computational Thinking, Inferential Thinking and Data Science On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley September 24, 2016 What Is the Big Data Phenomenon? Science in confirmatory mode (e.g.,

More information

On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed

On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed Michael I. Jordan University of California, Berkeley February 9, 2017 Modern Optimization and Large-Scale Data Analysis

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Katrina Ligett HUJI & Caltech joint with Seth Neel, Aaron Roth, Bo Waggoner, Steven Wu NIPS 2017

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Integration Methods and Optimization Algorithms

Integration Methods and Optimization Algorithms Integration Methods and Optimization Algorithms Damien Scieur INRIA, ENS, PSL Research University, Paris France damien.scieur@inria.fr Francis Bach INRIA, ENS, PSL Research University, Paris France francis.bach@inria.fr

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms

Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms Pan Xu * Tianhao Wang * Quanquan Gu Department of Computer Science School of Mathematical Sciences Department

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

minimize x subject to (x 2)(x 4) u,

minimize x subject to (x 2)(x 4) u, Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Continuous and Discrete-time Accelerated Stochastic Mirror Descent for Strongly Convex Functions

Continuous and Discrete-time Accelerated Stochastic Mirror Descent for Strongly Convex Functions Continuous and Discrete-time Accelerated Stochastic Mirror Descent for Strongly Convex Functions Pan Xu * Tianhao Wang *2 Quanquan Gu Abstract We provide a second-order stochastic differential equation

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Local Differential Privacy

Local Differential Privacy Local Differential Privacy Peter Kairouz Department of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Joint work with Sewoong Oh (UIUC) and Pramod Viswanath (UIUC) / 33 Wireless

More information

Private Empirical Risk Minimization, Revisited

Private Empirical Risk Minimization, Revisited Private Empirical Risk Minimization, Revisited Raef Bassily Adam Smith Abhradeep Thakurta April 10, 2014 Abstract In this paper, we initiate a systematic investigation of differentially private algorithms

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization Yinyu Ye Department is Management Science & Engineering and Institute of Computational & Mathematical Engineering Stanford

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Large Deviations for Small-Noise Stochastic Differential Equations

Large Deviations for Small-Noise Stochastic Differential Equations Chapter 22 Large Deviations for Small-Noise Stochastic Differential Equations This lecture is at once the end of our main consideration of diffusions and stochastic calculus, and a first taste of large

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Large Deviations for Small-Noise Stochastic Differential Equations

Large Deviations for Small-Noise Stochastic Differential Equations Chapter 21 Large Deviations for Small-Noise Stochastic Differential Equations This lecture is at once the end of our main consideration of diffusions and stochastic calculus, and a first taste of large

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

Fast Algorithms for Segmented Regression

Fast Algorithms for Segmented Regression Fast Algorithms for Segmented Regression Jayadev Acharya 1 Ilias Diakonikolas 2 Jerry Li 1 Ludwig Schmidt 1 1 MIT 2 USC June 21, 2016 1 / 21 Statistical vs computational tradeoffs? General Motivating Question

More information

ADMM and Accelerated ADMM as Continuous Dynamical Systems

ADMM and Accelerated ADMM as Continuous Dynamical Systems Guilherme França 1 Daniel P. Robinson 1 René Vidal 1 Abstract Recently, there has been an increasing interest in using tools from dynamical systems to analyze the behavior of simple optimization algorithms

More information

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the

More information

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info,

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Answering Many Queries with Differential Privacy

Answering Many Queries with Differential Privacy 6.889 New Developments in Cryptography May 6, 2011 Answering Many Queries with Differential Privacy Instructors: Shafi Goldwasser, Yael Kalai, Leo Reyzin, Boaz Barak, and Salil Vadhan Lecturer: Jonathan

More information

Lecture 14: Newton s Method

Lecture 14: Newton s Method 10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

arxiv: v1 [cs.lg] 27 May 2014

arxiv: v1 [cs.lg] 27 May 2014 Private Empirical Risk Minimization, Revisited Raef Bassily Adam Smith Abhradeep Thakurta May 29, 2014 arxiv:1405.7085v1 [cs.lg] 27 May 2014 Abstract In this paper, we initiate a systematic investigation

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Accelerated gradient methods

Accelerated gradient methods ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Information-theoretic foundations of differential privacy

Information-theoretic foundations of differential privacy Information-theoretic foundations of differential privacy Darakhshan J. Mir Rutgers University, Piscataway NJ 08854, USA, mir@cs.rutgers.edu Abstract. We examine the information-theoretic foundations of

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

arxiv: v1 [math.oc] 20 Jul 2018

arxiv: v1 [math.oc] 20 Jul 2018 Continuous-Time Accelerated Methods via a Hybrid Control Lens ARMAN SHARIFI KOLARIJANI, PEYMAN MOHAJERIN ESFAHANI, TAMÁS KEVICZKY arxiv:1807.07805v1 [math.oc] 20 Jul 2018 Abstract. Treating optimization

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Converse Lyapunov theorem and Input-to-State Stability

Converse Lyapunov theorem and Input-to-State Stability Converse Lyapunov theorem and Input-to-State Stability April 6, 2014 1 Converse Lyapunov theorem In the previous lecture, we have discussed few examples of nonlinear control systems and stability concepts

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Integration Methods and Accelerated Optimization Algorithms

Integration Methods and Accelerated Optimization Algorithms Integration Methods and Accelerated Optimization Algorithms Damien Scieur Vincent Roulet Francis Bach Alexandre d Aspremont INRIA - Sierra Project-team Département d Informatique de l Ecole Normale Supérieure

More information