Big Data Analytics: Optimization and Randomization

Size: px
Start display at page:

Download "Big Data Analytics: Optimization and Randomization"

Transcription

1 Big Data Analytics: Optimization and Randomization Tianbao Yang 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov. 20, / 210

2 URL tyng/acml15-tutorial.pdf Yang Tutorial for ACML 15 Nov. 20, / 210

3 Some Claims No Yes This tutorial is not an exhaustive literature survey It is not a survey on different machine learning algorithms It is about how to efficiently solve machine learning (formulated as optimization) problems for big data Yang Tutorial for ACML 15 Nov. 20, / 210

4 Outline Part I: Basics Part II: Optimization Part III: Randomization Yang Tutorial for ACML 15 Nov. 20, / 210

5 Big Data Analytics: Optimization and Randomization Part I: Basics Yang Tutorial for ACML 15 Nov. 20, / 210

6 Basics Introduction Outline 1 Basics Introduction Notations and Definitions Yang Tutorial for ACML 15 Nov. 20, / 210

7 Basics Introduction Three Steps for Machine Learning distance to optimal objective T 1/T 2 1/T iterations Data Model Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

8 Basics Introduction Big Data Challenge Big Data Yang Tutorial for ACML 15 Nov. 20, / 210

9 Basics Introduction Big Data Challenge Big Model 60 million parameters Yang Tutorial for ACML 15 Nov. 20, / 210

10 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, / 210

11 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 }{{} Empirical Loss x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, / 210

12 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 }{{} Regularization x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, / 210

13 Basics Introduction Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang Tutorial for ACML 15 Nov. 20, / 210

14 Basics Introduction Learning as Optimization Feature Selection: 1 min w R d n n l(w x i, y i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang Tutorial for ACML 15 Nov. 20, / 210

15 Basics Introduction Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 ( ) l(w x i, y i )+λ w 1 + γ w 2 2 Elastic net regularizer, more robust than l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, / 210

16 Basics Introduction Learning as Optimization Multi-class/Multi-task Learning: W R K d min W 1 n n l(wx i, y i ) + λr(w) i=1 r(w) = W 2 F = K dj=1 k=1 Wkj 2 : Frobenius Norm r(w) = W = i σ i: Nuclear Norm (sum of singular values) r(w) = W 1, = d j=1 W :j : l 1, mixed norm Yang Tutorial for ACML 15 Nov. 20, / 210

17 Basics Introduction Learning as Optimization Regularized Empirical Loss Minimization 1 min w R d n Both l and R are convex functions n l(w x i, y i ) + R(w) i=1 Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches Extensions to Non-convex (e.g., deep learning) are in progress Yang Tutorial for ACML 15 Nov. 20, / 210

18 Basics Introduction Data Matrices and Machine Learning The Instance-feature Matrix: X R n d X = x 1 x 2 x n Yang Tutorial for ACML 15 Nov. 20, / 210

19 Basics Introduction Data Matrices and Machine Learning y1 y2 The output vector: y = Rn 1 yn continuous yi R: regression (e.g., house price) discrete, e.g., yi {1, 2, 3}: classification (e.g., species of iris) Yang Tutorial for ACML 15 Nov. 20, / 210

20 Basics Introduction Data Matrices and Machine Learning The Instance-Instance Matrix: K R n n Similarity Matrix Kernel Matrix Yang Tutorial for ACML 15 Nov. 20, / 210

21 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks are formulated on the kernel matrix Clustering Kernel Methods Yang Tutorial for ACML 15 Nov. 20, / 210

22 Basics Introduction Data Matrices and Machine Learning The Feature-Feature Matrix: C R d d Covariance Matrix Distance Metric Matrix Yang Tutorial for ACML 15 Nov. 20, / 210

23 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks requires the covariance matrix Principal Component Analysis Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix Yang Tutorial for ACML 15 Nov. 20, / 210

24 Basics Introduction Why Learning from Big Data is Challenging? High per-iteration cost High memory cost High communication cost Large iteration complexity Yang Tutorial for ACML 15 Nov. 20, / 210

25 Basics Notations and Definitions Outline 1 Basics Introduction Notations and Definitions Yang Tutorial for ACML 15 Nov. 20, / 210

26 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, / 210

27 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, / 210

28 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, / 210

29 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, / 210

30 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, / 210

31 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, / 210

32 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 X 2 = σ 1 (maximum singular value) Yang Tutorial for ACML 15 Nov. 20, / 210

33 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 X 2 = σ 1 (maximum singular value) Yang Tutorial for ACML 15 Nov. 20, / 210

34 Basics Notations and Definitions Convex Optimization min x X f (x) X is a convex domain for any x, y X, their convex combination αx + (1 α)y X f (x) is a convex function Yang Tutorial for ACML 15 Nov. 20, / 210

35 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang Tutorial for ACML 15 Nov. 20, / 210

36 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang Tutorial for ACML 15 Nov. 20, / 210

37 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique e.g., λ 2 w 2 2 is λ-strongly convex Yang Tutorial for ACML 15 Nov. 20, / 210

38 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique e.g., λ 2 w 2 2 is λ-strongly convex Yang Tutorial for ACML 15 Nov. 20, / 210

39 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang Tutorial for ACML 15 Nov. 20, / 210

40 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang Tutorial for ACML 15 Nov. 20, / 210

41 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang Tutorial for ACML 15 Nov. 20, / 210

42 Basics Notations and Definitions Next... Part II: Optimization 1 min w R d n stochastic optimization distributed optimization n l(w x i, y i ) + R(w) i=1 Reduce Iteration Complexity: utilizing properties of functions and the structure of the problem Yang Tutorial for ACML 15 Nov. 20, / 210

43 Basics Notations and Definitions Next... Part III: Randomization Classification, Regression SVD, K-means, Kernel methods Reduce Data Size: utilizing properties of data Please stay tuned! Yang Tutorial for ACML 15 Nov. 20, / 210

44 Optimization Big Data Analytics: Optimization and Randomization Part II: Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

45 Optimization (Sub)Gradient Methods Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

46 Optimization (Sub)Gradient Methods Learning as Optimization Regularized Empirical Loss Minimization 1 n min l(w x i, y i ) + R(w) w R d n i=1 }{{} F (w) Yang Tutorial for ACML 15 Nov. 20, / 210

47 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t objective ε 0 T iterations Yang Tutorial for ACML 15 Nov. 20, / 210

48 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective ε 0 T iterations Yang Tutorial for ACML 15 Nov. 20, / 210

49 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Yang Tutorial for ACML 15 Nov. 20, / 210

50 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Total Runtime = Per-iteration Cost Iteration Complexity Yang Tutorial for ACML 15 Nov. 20, / 210

51 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang Tutorial for ACML 15 Nov. 20, / 210

52 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang Tutorial for ACML 15 Nov. 20, / 210

53 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α 0.5 distance to optimum seconds 0.5 T iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

54 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes 0.5 T 1/T iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

55 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

56 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T 0.5 Theoretically, we consider ( ) 1 log 1 1 ɛ ɛ ɛ 1 ɛ iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

57 Optimization (Sub)Gradient Methods Non-smooth V.S. Smooth Smooth l(z) squared hinge loss: l(w x, y) = max(0, 1 yw x) 2 logistic loss: l(w x, y) = log(1 + exp( yw x)) square loss: l(w x, y) = (w x y) 2 Non-smooth l(z) hinge loss: l(w x, y) = max(0, 1 yw x) absolute loss: l(w x, y) = w x y Yang Tutorial for ACML 15 Nov. 20, / 210

58 Optimization (Sub)Gradient Methods Strongly convex V.S. Non-strongly convex λ-strongly convex R(w) λ l 2 regularizer: 2 w 2 2 Elastic net regularizer: τ w 1 + λ 2 w 2 2 Non-strongly convex R(w) unregularized problem: R(w) 0 l 1 regularizer: τ w 1 Yang Tutorial for ACML 15 Nov. 20, / 210

59 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is smooth Full gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) step size γ t = constant, e.g., 1 L Yang Tutorial for ACML 15 Nov. 20, / 210

60 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is smooth Full gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) step size γ t = constant, e.g., 1 L Yang Tutorial for ACML 15 Nov. 20, / 210

61 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ) l(w x i, y i ) + λ 2 w 2 2 }{{} R(w) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )) Yang Tutorial for ACML 15 Nov. 20, / 210

62 Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) Nesterov s Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang Tutorial for ACML 15 Nov. 20, / 210

63 Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) Nesterov s Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang Tutorial for ACML 15 Nov. 20, / 210

64 Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ), better than O( 1 ɛ ) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )), better than O( 1 λ log( 1 ɛ )) for small λ Yang Tutorial for ACML 15 Nov. 20, / 210

65 Optimization (Sub)Gradient Methods Deal with non-smooth regularizer Consider l 1 norm regularization min F (w) = 1 w R d n n l(w x i, y i ) } i=1 {{ } f (w) + τ w 1 }{{} R(w) f (w): smooth R(w): non-smooth and non-strongly convex Yang Tutorial for ACML 15 Nov. 20, / 210

66 Optimization (Sub)Gradient Methods Accelerated Proximal Gradient (APG) Proximal Accelerated Gradient Descent mapping w t = arg min w R d f (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain the same as for smooth and non-strongly convex, i.e., O( 1 ɛ ) Yang Tutorial for ACML 15 Nov. 20, / 210

67 Optimization (Sub)Gradient Methods Accelerated Proximal Gradient (APG) Proximal Accelerated Gradient Descent mapping w t = arg min w R d f (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain the same as for smooth and non-strongly convex, i.e., O( 1 ɛ ) Yang Tutorial for ACML 15 Nov. 20, / 210

68 Optimization (Sub)Gradient Methods Deal with non-smooth but strongly convex regularizer Consider the elastic net regularization min w R d F (w) = 1 n n i=1 R(w): non-smooth but strongly convex min F (w) = 1 w R d n f (w): smooth and strongly convex l(w x i, y i ) + λ 2 w τ w 1 }{{} R(w) n l(w x i, y i ) + λ 2 w 2 2 i=1 }{{} f (w) R (w): non-smooth and non-strongly convex ( ( )) Iteration Complexity: O λ 1 log 1 ɛ + τ w 1 }{{} R (w) Yang Tutorial for ACML 15 Nov. 20, / 210

69 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is non-smooth Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) step size γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

70 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is non-smooth Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) step size γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

71 Optimization (Sub)Gradient Methods Sub-Gradient Method n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex generalizes to l 1 norm and other non-strongly convex regularizer Iteration complexity O( 1 ) ɛ 2 If λ > 0: R(w) is λ-strongly convex generalizes to elastic net and other strongly convex regularizer Iteration complexity O( 1 λɛ ) No efficient acceleration scheme in general Yang Tutorial for ACML 15 Nov. 20, / 210

72 Optimization (Sub)Gradient Methods Problem Classes and Iteration Complexity Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ɛ ( ɛ 2 ) ( ( )) λ-strongly convex O 1 λɛ O λ 1 log 1 ɛ Per-iteration cost: O(nd), too high if n or d are large. Yang Tutorial for ACML 15 Nov. 20, / 210

73 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

74 Optimization Stochastic Optimization Algorithms for Big Data Stochastic First-Order Method by Sampling Randomly Sample Example 1 Stochastic Gradient Descent (SGD) 2 Stochastic Variance Reduced Gradient (SVRG) 3 Stochastic Average Gradient Algorithm (SAGA) 4 Stochastic Dual Coordinate Ascent (SDCA) Randomly Sample Feature 1 Randomized Coordinate Descent (RCD) 2 Accelerated Proximal Coordinate Gradient (APCG) Yang Tutorial for ACML 15 Nov. 20, / 210

75 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Randomly sample i {1,..., n} Stochastic sub-gradient: l(w T x i, y i ) + λw E i [ l(w T x i, y i ) + λw] = F (w) Yang Tutorial for ACML 15 Nov. 20, / 210

76 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 step size: γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

77 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 step size: γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

78 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex generalizes to l 1 norm and other non-strongly convex regularizer Iteration complexity O( 1 ) ɛ 2 If λ > 0: R(w) is λ-strongly convex generalizes to elastic net and other strongly convex regularizer Iteration complexity O( 1 λɛ ) Exactly the same as sub-gradient descent! Yang Tutorial for ACML 15 Nov. 20, / 210

79 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Per-iteration cost: O(d) Much lower than full gradient method e.g. hinge loss (SVM) y it x it, 1 y it w x it > 0 stochastic gradient: l(w x it, y it ) = 0, otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

80 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ( ɛ 2 ) ( ɛ 2 ) λ-strongly convex O 1 λɛ O 1 λɛ For SGD, only strongly convexity helps but the smoothness does not make any difference! The reason: the step size has to be decreasing due to stochastic gradient does not approach 0 Yang Tutorial for ACML 15 Nov. 20, / 210

81 Optimization Stochastic Optimization Algorithms for Big Data Variance Reduction Stochastic Variance Reduced Gradient (SVRG) Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA) Yang Tutorial for ACML 15 Nov. 20, / 210

82 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang Tutorial for ACML 15 Nov. 20, / 210

83 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang Tutorial for ACML 15 Nov. 20, / 210

84 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Compute the full gradient at a reference point w F ( w) = 1 n g i ( w) n i=1 Stochastic variance reduced gradient: g it (w) = g it (w) g it ( w) + F ( w) E it [ g it (w)] = F (w) Var [ g it (w)] 0 as w, w w Yang Tutorial for ACML 15 Nov. 20, / 210

85 Optimization Stochastic Optimization Algorithms for Big Data Variance Reduction (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) At optimal solution w : F (w ) = 0 It does not mean g it (w) 0 as w w However, we have g it (w) = g it (w) g it ( w) + F ( w) 0 as w, w w Yang Tutorial for ACML 15 Nov. 20, / 210

86 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., m g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 m mt=1 w t output: w T ) ( m = O 1 λ γ t = constant Yang Tutorial for ACML 15 Nov. 20, / 210

87 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Per-iteration cost: O (d) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. (( N.A. ) 1 ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( Total Runtime: O d ( ( )) O nd λ log 1 ɛ ( ) ( )) n + 1 λ log 1 ɛ Use proximal mapping for elastic net regularizer Better than AFG 1 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

88 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Per-iteration cost: O (d) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. (( N.A. ) 1 ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( Total Runtime: O d ( ( )) O nd λ log 1 ɛ ( ) ( )) n + 1 λ log 1 ɛ Use proximal mapping for elastic net regularizer Better than AFG 1 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

89 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 A new version of SAG (Roux et al. (2012)) Applicable when l(z) is smooth Strong convexity is not necessary. Yang Tutorial for ACML 15 Nov. 20, / 210

90 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = g it (w) g it ( w) + F ( w) F ( w) = 1 n g i ( w) n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } g it (w) = g it (w) g it ( w it ) + G G = 1 n g i ( w i ) n i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

91 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

92 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

93 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

94 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

95 Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update G t = 1 n g i = G t 1 1 n n g i t + 1 ) ( l(w n t 1x it, y it ) + λw t 1 i=1 computation cost: O(d) Yang Tutorial for ACML 15 Nov. 20, / 210

96 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, / 210

97 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, / 210

98 Optimization Stochastic Optimization Algorithms for Big Data Compare the Runtime of SGD and SVRG/SAGA Smooth but non-strongly convex: SGD: O ( ) d ɛ 2 SAGA: O ( ) dn ɛ Smooth and strongly convex: SGD: O ( ) d λɛ SVRG/SAGA: O ( d ( ( n + λ) 1 log 1 )) ɛ For small ɛ, use SVRG/SAGA Satisfied with large ɛ, use SGD Yang Tutorial for ACML 15 Nov. 20, / 210

99 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

100 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

101 Optimization Stochastic Optimization Algorithms for Big Data The Dual Problem From Primal problem to Dual problem: min w 1 n = min w 1 = max α R n n n l(w x i, y }{{} i ) + λ 2 w 2 2 i=1 z 1 n [ ] max α i (w x i ) l i ( α i ) + λ n α i=1 i R 2 w 2 2 n l i ( α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Primal solution w = 1 λn ni=1 α i x i Yang Tutorial for ACML 15 Nov. 20, / 210

102 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required Solve Dual Problem: 1 max α R n n n l i ( α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Sample i t {1,..., n}. Optimize α it while fixing others Yang Tutorial for ACML 15 Nov. 20, / 210

103 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Maintain a primal solution: w t = 1 λn ni=1 α t i x i Optimize the increment α it 1 max α R n l i t ( (αi t t + α it )) λ 2 1 max α R n l i t ( (αi t t + α it )) λ 2 ( n ) 1 2 αi t x i + α it x it λn i=1 2 w t λn α i t x it 2 Yang Tutorial for ACML 15 Nov. 20, / 210

104 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Dual Coordinate Updates α it = max α i t R 1 n l i t ( (α t i t + α it )) λ 2 αi t+1 t = αi t t + α it w t+1 = w t + 1 λn α i t x i w t + 1 λn α 2 i t x it 2 Yang Tutorial for ACML 15 Nov. 20, / 210

105 Optimization Stochastic Optimization Algorithms for Big Data SDCA updates Close-form solution for α i : hinge loss, squared hinge loss, absolute loss and square loss (Shalev-Shwartz & Zhang (2013)) e.g. square loss Per-iteration cost: O(d) α i = y i w t x i α t i 1 + x i 2 2 /(λn) Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013)) Yang Tutorial for ACML 15 Nov. 20, / 210

106 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. R(w) ( 2 ) (( N.A. ) 2 ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! also equivalent to some kind of variance reduction Proximal variant for elastic net regularizer Wang & Lin (2014) shows linear convergence is achievable for non-smooth loss 2 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

107 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. R(w) ( 2 ) (( N.A. ) 2 ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! also equivalent to some kind of variance reduction Proximal variant for elastic net regularizer Wang & Lin (2014) shows linear convergence is achievable for non-smooth loss 2 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

108 Optimization Stochastic Optimization Algorithms for Big Data SDCA vs SVRG/SAGA Advantages of SDCA Can handle non-smooth loss functions Can explore the data sparsity for efficient update Parameter free Yang Tutorial for ACML 15 Nov. 20, / 210

109 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Updates Randomized Coordinate Descent Accelerated Proximal Coordinate Gradient Yang Tutorial for ACML 15 Nov. 20, / 210

110 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Updates min w R d F (w) = 1 n n l(w x i, y i ) + R(w) i=1 Suppose d >> n. Per-iteration cost O(d) is too high Sample over features instead of data Per-iteration cost becomes O(n) Applicable when l(z, y) is smooth and R(w) is decomposable Strong convexity is not necessary Yang Tutorial for ACML 15 Nov. 20, / 210

111 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) min w R d F (w) = 1 2 Xw y λ 2 w 2 2 X = [ x 1, x 2,, x d ] R n d Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i step size γ t : constant i F (w t ) can be updated in O(n) Yang Tutorial for ACML 15 Nov. 20, / 210

112 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) min w R d F (w) = 1 2 Xw y λ 2 w 2 2 X = [ x 1, x 2,, x d ] R n d Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i step size γ t : constant i F (w t ) can be updated in O(n) Yang Tutorial for ACML 15 Nov. 20, / 210

113 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i maintain and update u = Xw y R n in O(n) u t = u t 1 + x it (wi t t w t 1 ) = u t 1 + x it w partial gradient can be computed in O(n) i t i F (w t ) = x i u t Yang Tutorial for ACML 15 Nov. 20, / 210

114 Optimization Stochastic Optimization Algorithms for Big Data RCD Per-iteration Cost O(n) Iteration complexity l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d R(w) ( ɛ ( )) λ-strongly convex N.A O d λ log 1 ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as Gradient Descent Method! In practice, could be much faster Yang Tutorial for ACML 15 Nov. 20, / 210

115 Optimization Stochastic Optimization Algorithms for Big Data RCD Per-iteration Cost O(n) Iteration complexity l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d R(w) ( ɛ ( )) λ-strongly convex N.A O d λ log 1 ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as Gradient Descent Method! In practice, could be much faster Yang Tutorial for ACML 15 Nov. 20, / 210

116 Optimization Stochastic Optimization Algorithms for Big Data Accelerated Proximal Coordinate Gradient (APCG) min w R d F (w) = 1 2 Xw y λ 2 w τ w 1 Using Acceleration Using Proximal Mapping APCG (Lin et al., 2014) w t i = { arg minwi R i F (v t 1 )w i + 1 2γ t (w i vi t 1 ) 2 + τ w i if i = i t otherwise w t 1 i v t = w t + η t (w t w t 1 ) Yang Tutorial for ACML 15 Nov. 20, / 210

117 Optimization Stochastic Optimization Algorithms for Big Data APCG Per-iteration cost: O(n) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d ɛ ( ( λ-strongly convex N.A. O d log 1 λ ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as APG!, in practice, could be much faster )) Yang Tutorial for ACML 15 Nov. 20, / 210

118 Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates Sample i t {1,..., n} β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

119 Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates Sample i t {1,..., n} β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

120 Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual R(w) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex ( N.A. 3 ) (( N.A. ) 4 ( )) λ-strongly convex O n + n λɛ O n + n λ log 1 ɛ ( Total Runtime (smooth): O ( ( )) than SDCA O d(n + 1 λ ) log 1 ɛ d(n + n λ ) log ( 1 ɛ when λ 1 n )). could be faster 3 A small trick can fix this 4 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

121 Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA Lin et al. (2014) Yang Tutorial for ACML 15 Nov. 20, / 210

122 Optimization Stochastic Optimization Algorithms for Big Data Summary R(w) l(z) l(z, y) Non-smooth Smooth Non str-cvx SGD RCD, APCG, SAGA str-cvx SDCA, APCG RCD, APCG, SDCA SVRG, SAGA Red: stochastic gradient, primal Blue: randomized coordinate, primal Green: stochastic coordinate, dual Yang Tutorial for ACML 15 Nov. 20, / 210

123 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

124 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

125 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

126 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

127 Optimization Stochastic Optimization Algorithms for Big Data Trick for generalizing to non-strongly convex regularizer (Shalev-Shwartz & Zhang, 2012) 1 min w R d n n l(w x i, y i ) + τ w 1 i=1 Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + τ w 1 + λ 2 w 2 2 i=1 If w 2 B, we can set λ = ɛ B 2. An ɛ/2-suboptimal solution for the new problem is ɛ-suboptimal for the original problem Yang Tutorial for ACML 15 Nov. 20, / 210

128 Optimization Stochastic Optimization Algorithms for Big Data Trick for generalizing to non-strongly convex regularizer (Shalev-Shwartz & Zhang, 2012) 1 min w R d n n l(w x i, y i ) + τ w 1 i=1 Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + τ w 1 + λ 2 w 2 2 i=1 If w 2 B, we can set λ = ɛ B 2. An ɛ/2-suboptimal solution for the new problem is ɛ-suboptimal for the original problem Yang Tutorial for ACML 15 Nov. 20, / 210

129 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

130 Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single machine suffers low network bandwidth limited disk or memory communication V.S. computation RAM 100 nanoseconds standard network connection 250, 000 nanoseconds Yang Tutorial for ACML 15 Nov. 20, / 210

131 Optimization Stochastic Optimization Algorithms for Big Data Distributed Data N data points are partitioned and distributed to m machines [x 1, x 2,..., x n ] = S 1 S 2 S K Machine j only has access to S j. W.L.O.G: S j = n k = n K S 1 S 2 S 3 S 4 S 5 S 6 Yang Tutorial for ACML 15 Nov. 20, / 210

132 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n k i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 K w j, Issue: Will not converge to w K j=1 Yang Tutorial for ACML 15 Nov. 20, / 210

133 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n k i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 K w j, Issue: Will not converge to w K j=1 Yang Tutorial for ACML 15 Nov. 20, / 210

134 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Single machine Total Runtime = Per-iteration Cost Iteration Complexity Distributed optimization Total Runtime = (Communication Time Per-round+Local Runtime Per-round) Rounds of Communication Trading Computation for Communication: Increase Local Computation Balance between Communication Reduce the Rounds of Communication Yang Tutorial for ACML 15 Nov. 20, / 210

135 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al., 2015) Applicable when R(w) is strongly convex, e.g. R(w) = λ 2 w 2 2 Global dual problem 1 n max l α R n i ( α i ) λ 1 n 2 α n 2 i x i λn i=1 i=1 2 Incremental variable α i max α 1 n l i ( (α t i + α i )) λ 2 wt + 1 λn n 2 α i x i i=1 2 Primal solution: w t = 1 λn n αi t x i i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

136 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al., 2015) Applicable when R(w) is strongly convex, e.g. R(w) = λ 2 w 2 2 Global dual problem 1 n max l α R n i ( α i ) λ 1 n 2 α n 2 i x i λn i=1 i=1 2 Incremental variable α i max α 1 n l i ( (α t i + α i )) λ 2 wt + 1 λn n 2 α i x i i=1 2 Primal solution: w t = 1 λn n αi t x i i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

137 Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication α ij = arg max l i j ( α t i j α ij ) λn 2K ut j + K 2 λn α i j x ij 2 u t j+1 = u t j + K λn α i j x ij Yang Tutorial for ACML 15 Nov. 20, / 210

138 Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication Yang Tutorial for ACML 15 Nov. 20, / 210

139 Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication Yang Tutorial for ACML 15 Nov. 20, / 210

140 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t j arg max α Sj R n l i ( (αi t + α i )) w t, i S j i S j α i x i K 2λn α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j α i x i i S j 2 2 m Center computes: w t+1 = w t + wj t j=1 Yang Tutorial for ACML 15 Nov. 20, / 210

141 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Local objective value G j ( α Sj, w t ) = 1 l i ( (αi t + α i )) 1 n n wt, α i x i K 2λn 2 α i x i i S j i S j i S j Solve α t S j ( by any local solver as long as max G j ( α Sj, w t ) G j ( αs t α j, w t ) Sj 0 < Θ < 1 ) Θ ( max G j ( α Sj, w t ) G j (0, w t ) α Sj CoCoA+ is equivalent to DisDCA when employing SDCA to solve local problems with m iterations 2 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

142 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of m (i.e., the number of inner iterations) the larger m, the higher local computation cost, the lower communication costs Choice of K (i.e., the number of machines) the larger K, the lower local computation costs, the higher communication costs Yang Tutorial for ACML 15 Nov. 20, / 210

143 Optimization Stochastic Optimization Algorithms for Big Data DisDCA is implemented tyng/software.html Classification and Regression Loss 1 Hinge loss and squared hinge loss (SVM) 2 Logistic loss (Logistic Regression) 3 Square loss (Ridge Regression/LASSO) Regularizer 1 l 2 norm 2 mixture of l 1 norm and l 2 norm Multi-class : one-vs-all Yang Tutorial for ACML 15 Nov. 20, / 210

144 Optimization Stochastic Optimization Algorithms for Big Data Alternating Direction Method of Multipliers (ADMM) min F (w) = K 1 l(w x i, y i ) + λ w R d n 2 w 2 2 k=1 i S k }{{} f k (w) each f k (w) on individual machines but w are coupled together K min F (w) = f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 s.t. w k = w, k = 1,..., K Yang Tutorial for ACML 15 Nov. 20, / 210

145 Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w k=1 is the Lagrangian function of K f k (w k ) + λ 2 w ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

146 Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w k=1 is the Lagrangian function of K f k (w k ) + λ 2 w ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

147 Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w k=1 is the Lagrangian function of K f k (w k ) + λ 2 w ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

148 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

149 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

150 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

151 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

152 Optimization Stochastic Optimization Algorithms for Big Data ADMM w t+1 k = arg min f k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K Each local problem can be solved by a local solver (e.g., SDCA) Optimization can be inexact (trading computation for communication) Yang Tutorial for ACML 15 Nov. 20, / 210

153 Optimization Stochastic Optimization Algorithms for Big Data Complexity of ADMM Assume local problems are solved exactly. ( ( )) Communication Complexity: O log 1 ɛ due to the strong convexity of R(w) Applicable to Non-strongly Convex Regularizer R(w) = w 1 min F (w) = K 1 w R d n k=1 Communication Complexity: O ( 1 ɛ i S k l(w x i, y i ) + τ w 1 ) Yang Tutorial for ACML 15 Nov. 20, / 210

154 Optimization Stochastic Optimization Algorithms for Big Data Thank You! Questions? Yang Tutorial for ACML 15 Nov. 20, / 210

155 Optimization Stochastic Optimization Algorithms for Big Data Research Assistant Positions Available for PhD Candidates! Start Fall 16 Optimization and Randomization Online Learning Deep Learning Machine Learning send to Yang Tutorial for ACML 15 Nov. 20, / 210

156 Randomized Dimension Reduction Big Data Analytics: Optimization and Randomization Part III: Randomization Yang Tutorial for ACML 15 Nov. 20, / 210

157 Randomized Dimension Reduction Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang Tutorial for ACML 15 Nov. 20, / 210

158 Randomized Dimension Reduction Random Sketch Approximate a large data matrix by a much smaller sketch Yang Tutorial for ACML 15 Nov. 20, / 210

159 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

160 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

161 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

162 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

163 Randomized Dimension Reduction Why randomized dimension reduction? Efficient Robust (e.g., dropout) Formal Guarantees Can explore parallel algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

164 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang Tutorial for ACML 15 Nov. 20, / 210

165 Randomized Dimension Reduction JL Lemma JL Lemma (Johnson & Lindenstrauss, 1984) For any 0 < ɛ, δ < 1/2, there exists a probability distribution on m d real matrices A such that there exists a small universal constant c > 0 and for any fixed x R d with a probability at least 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) m x 2 2 or for m = Θ(ɛ 2 log(1/δ)), then with a probability at least 1 δ Ax 2 2 x 2 2 ɛ x 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

166 Randomized Dimension Reduction Embedding a set of points into low dimensional space Given a set of points x 1,..., x n R d, we can embed them into a low dimensional space Ax 1,..., Ax n R m such that the pairwise distance between any two points are well preserved in the low dimensional space Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 + ɛ) x i x j 2 2 Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 ɛ) x i x j 2 2 In other words, in order to have all pairwise Euclidean distances preserved up to 1 ± ɛ, only m = Θ(ɛ 2 log(n 2 /δ)) dimensions are necessary Yang Tutorial for ACML 15 Nov. 20, / 210

167 Randomized Dimension Reduction JL transforms: Gaussian Random Projection Gaussian Random Projection (Dasgupta & Gupta, 2003): A R m d A ij N (0, 1/m) m = Θ(ɛ 2 log(1/δ)) Computational cost of AX: where X R d n mnd for dense matrices nnz(x)m for sparse matrices Computational Cost is very High (could be as high as solving many problems) Yang Tutorial for ACML 15 Nov. 20, / 210

168 Randomized Dimension Reduction Accelerate JL transforms: using discrete distributions Using Discrete Distributions (Achlioptas, 2003): Pr(A ij = ± 1 m ) = 0.5 Pr(A ij = ± 3 m ) = 1 6, Pr(A ij = 0) = 2 3 Database friendly Replace multiplications by additions and subtractions Yang Tutorial for ACML 15 Nov. 20, / 210

169 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, / 210

170 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, / 210

171 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, / 210

172 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, / 210

173 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, / 210

174 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, / 210

175 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform (Tropp, 2011): A = d m PHD yields Ax 2 2 x log(2/δ) log(d/δ) x 2 2 m m = Θ(ɛ 2 log(1/δ) log(d/δ)) suffice for 1 ± ɛ additional factor log(d/δ) can be removed Computational cost of AX: O(nd log(m)) Yang Tutorial for ACML 15 Nov. 20, / 210

176 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang Tutorial for ACML 15 Nov. 20, / 210

177 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang Tutorial for ACML 15 Nov. 20, / 210

178 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) key properties: E[ HDx 1, HDx 2 ] = x 1, x 2 and norm perserving HDx 2 2 x 2 2 ɛ x 2 2, only when x x 2 1 c Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks of randomized Hadmard transform Px Px 2 1 c Yang Tutorial for ACML 15 Nov. 20, / 210

179 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (II) Sparse JL transform based on block random hashing (Kane & Nelson, 2014) A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(ɛ 1 ) and s = Θ(ɛ 1 log(1/δ)) ( [ ]) nnz(x) 1 Computational Cost of AX: O log ɛ δ Yang Tutorial for ACML 15 Nov. 20, / 210

180 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang Tutorial for ACML 15 Nov. 20, / 210

181 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, / 210

182 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, / 210

183 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, / 210

184 Randomized Dimension Reduction Subspace Embeddings From a JL transform to a Subspace Embedding (Sarlós, 2006). Let A R m d be a JL transform. If ] [ m = O k log k δɛ Then w.h.p 1 δ k, A R m d is a subspace embedding w.r.t a k-dimensional space in R d ɛ 2 Yang Tutorial for ACML 15 Nov. 20, / 210

185 Randomized Dimension Reduction Subspace Embeddings Making block random hashing a Subspace Embedding (Nelson & Nguyen, 2013). A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(kɛ 1 log 5 (k/δ)) and s = Θ(ɛ 1 log 3 (k/δ)) w.h.p 1 δ, A R m d with m = Θ ( ) k log 8 (k/δ) ɛ 2 embedding w.r.t a k-dimensional space in R d ( [ ]) nnz(x) k Computational Cost of AX: O log 3 ɛ δ is a subspace Yang Tutorial for ACML 15 Nov. 20, / 210

186 Randomized Dimension Reduction Sparse Subspace Embedding (SSE) Random hashing is SSE with a Constant Probability (Nelson & Nguyen, 2013) A = HD where D R d d and H R m d m = Ω(k 2 /ɛ 2 ) suffice for a subspace embedding with a probability 2/3 Computational Cost AX: O(nnz(X)) Yang Tutorial for ACML 15 Nov. 20, / 210

187 Randomized Dimension Reduction Randomized Dimensionality Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column (Row) sampling Yang Tutorial for ACML 15 Nov. 20, / 210

188 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, / 210

189 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, / 210

190 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, / 210

191 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

192 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

193 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

194 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

195 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

196 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang Tutorial for ACML 15 Nov. 20, / 210

197 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang Tutorial for ACML 15 Nov. 20, / 210

198 Randomized Dimension Reduction When uniform sampling makes sense? Coherence measure µ k = d k max 1 i d U i 2 2 Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang Tutorial for ACML 15 Nov. 20, / 210

199 Randomized Algorithms Randomized Classification (Regression) Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

200 Randomized Algorithms Randomized Classification (Regression) Classification Classification problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang Tutorial for ACML 15 Nov. 20, / 210

201 Randomized Algorithms Randomized Classification (Regression) Randomized Classification For large-scale high-dimensional problems, the computational cost of optimization is O((nd + dκ) log(1/ɛ)). Use random reduction A R d m (m d), we reduce X R n d to X = XA R n m. Then solve JL transforms 1 min u R m n Sparse subspace embeddings n l(y i u x i ) + λ 2 u 2 2 i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

202 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

203 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

204 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

205 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

206 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

207 Randomized Algorithms Randomized Classification (Regression) The Dual probelm Using Fenchel conjugate Primal: Dual: From dual to primal: l i (α i ) = max α i α i z l(z, y i ) 1 w = arg min w R d n α = arg max α R n 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 n i=1 l i (α i ) 1 2λn 2 α XX α w = 1 λn X α Yang Tutorial for ACML 15 Nov. 20, / 210

208 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

209 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

210 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

211 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

212 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

213 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

214 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery RCV1 text data, n = 677, 399, and d = 47, 236 relative dual error L2 norm Dual Error λ= τ m=1024 m=2048 m=4096 m=8192 Primal Error relative primal error L2 norm λ= τ m=1024 m=2048 m=4096 m=8192 Yang Tutorial for ACML 15 Nov. 20, / 210

215 Randomized Algorithms Randomized Least-Squares Regression Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

216 Randomized Algorithms Randomized Least-Squares Regression Least-squares regression Let X R n d with d n and b R n. The least-squares regression problem is to find w such that w = arg min w R d Xw b 2 Computational Cost: O(nd 2 ) Goal of RA: o(nd 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

217 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Let A R m n be a random reduction matrix. Solve ŵ = arg min w R d A(Xw b) 2 = AXw Ab 2 Computational Cost: O(md 2 ) + reduction time Yang Tutorial for ACML 15 Nov. 20, / 210

218 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

219 Randomized Algorithms Randomized K-means Clustering Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

220 Randomized Algorithms Randomized K-means Clustering K-means Clustering Let x 1,..., x n R d be a set of data points. K-means clustering aims to solve min C 1,...,C k k j=1 x i C j x i µ j 2 2 Computational Cost: O(ndkt), where t is number of iterations. Yang Tutorial for ACML 15 Nov. 20, / 210

221 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

222 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

223 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering For random sketch: JL transforms, sparse subspace embedding all work JL transform: m = O( k log(k/(ɛδ)) ɛ 2 ) Sparse subspace embedding: m = O( k2 ɛ 2 δ ) ɛ relates to the approximation accuracy Analysis of approximation error for K-means can be formulates as Constrained Low-rank Approximation (Cohen et al., 2015) where Q is orthonormal. min X QQ X 2 Q F Q=I Yang Tutorial for ACML 15 Nov. 20, / 210

224 Randomized Algorithms Randomized Kernel methods Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

225 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methods The Nyström method Random Fourier features Yang Tutorial for ACML 15 Nov. 20, / 210

226 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methods The Nyström method Random Fourier features Yang Tutorial for ACML 15 Nov. 20, / 210

227 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

228 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

229 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine The dual problem: arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α BC B α Solve it like solving a linear method: X = BC 1/2 R n l arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α X X α Yang Tutorial for ACML 15 Nov. 20, / 210

230 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine Yang Tutorial for ACML 15 Nov. 20, / 210

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information