Big Data Analytics: Optimization and Randomization

Size: px

Start display at page:

Download "Big Data Analytics: Optimization and Randomization"

Amberlynn Carson
5 years ago
Views:

1 Big Data Analytics: Optimization and Randomization Tianbao Yang 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov. 20, / 210

2 URL tyng/acml15-tutorial.pdf Yang Tutorial for ACML 15 Nov. 20, / 210

3 Some Claims No Yes This tutorial is not an exhaustive literature survey It is not a survey on different machine learning algorithms It is about how to efficiently solve machine learning (formulated as optimization) problems for big data Yang Tutorial for ACML 15 Nov. 20, / 210

4 Outline Part I: Basics Part II: Optimization Part III: Randomization Yang Tutorial for ACML 15 Nov. 20, / 210

5 Big Data Analytics: Optimization and Randomization Part I: Basics Yang Tutorial for ACML 15 Nov. 20, / 210

6 Basics Introduction Outline 1 Basics Introduction Notations and Definitions Yang Tutorial for ACML 15 Nov. 20, / 210

Basics Introduction Three Steps for Machine Learning distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.

7 Basics Introduction Three Steps for Machine Learning distance to optimal objective T 1/T 2 1/T iterations Data Model Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

8 Basics Introduction Big Data Challenge Big Data Yang Tutorial for ACML 15 Nov. 20, / 210

9 Basics Introduction Big Data Challenge Big Model 60 million parameters Yang Tutorial for ACML 15 Nov. 20, / 210

10 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, / 210

11 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 }{{} Empirical Loss x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, / 210

12 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 }{{} Regularization x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, / 210

13 Basics Introduction Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang Tutorial for ACML 15 Nov. 20, / 210

14 Basics Introduction Learning as Optimization Feature Selection: 1 min w R d n n l(w x i, y i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang Tutorial for ACML 15 Nov. 20, / 210

15 Basics Introduction Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 ( ) l(w x i, y i )+λ w 1 + γ w 2 2 Elastic net regularizer, more robust than l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, / 210

16 Basics Introduction Learning as Optimization Multi-class/Multi-task Learning: W R K d min W 1 n n l(wx i, y i ) + λr(w) i=1 r(w) = W 2 F = K dj=1 k=1 Wkj 2 : Frobenius Norm r(w) = W = i σ i: Nuclear Norm (sum of singular values) r(w) = W 1, = d j=1 W :j : l 1, mixed norm Yang Tutorial for ACML 15 Nov. 20, / 210

17 Basics Introduction Learning as Optimization Regularized Empirical Loss Minimization 1 min w R d n Both l and R are convex functions n l(w x i, y i ) + R(w) i=1 Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches Extensions to Non-convex (e.g., deep learning) are in progress Yang Tutorial for ACML 15 Nov. 20, / 210

18 Basics Introduction Data Matrices and Machine Learning The Instance-feature Matrix: X R n d X = x 1 x 2 x n Yang Tutorial for ACML 15 Nov. 20, / 210

19 Basics Introduction Data Matrices and Machine Learning y1 y2 The output vector: y = Rn 1 yn continuous yi R: regression (e.g., house price) discrete, e.g., yi {1, 2, 3}: classification (e.g., species of iris) Yang Tutorial for ACML 15 Nov. 20, / 210

20 Basics Introduction Data Matrices and Machine Learning The Instance-Instance Matrix: K R n n Similarity Matrix Kernel Matrix Yang Tutorial for ACML 15 Nov. 20, / 210

21 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks are formulated on the kernel matrix Clustering Kernel Methods Yang Tutorial for ACML 15 Nov. 20, / 210

22 Basics Introduction Data Matrices and Machine Learning The Feature-Feature Matrix: C R d d Covariance Matrix Distance Metric Matrix Yang Tutorial for ACML 15 Nov. 20, / 210

23 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks requires the covariance matrix Principal Component Analysis Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix Yang Tutorial for ACML 15 Nov. 20, / 210

24 Basics Introduction Why Learning from Big Data is Challenging? High per-iteration cost High memory cost High communication cost Large iteration complexity Yang Tutorial for ACML 15 Nov. 20, / 210

25 Basics Notations and Definitions Outline 1 Basics Introduction Notations and Definitions Yang Tutorial for ACML 15 Nov. 20, / 210

26 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, / 210

27 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, / 210

28 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, / 210

29 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, / 210

30 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, / 210

31 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, / 210

32 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 X 2 = σ 1 (maximum singular value) Yang Tutorial for ACML 15 Nov. 20, / 210

33 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 X 2 = σ 1 (maximum singular value) Yang Tutorial for ACML 15 Nov. 20, / 210

34 Basics Notations and Definitions Convex Optimization min x X f (x) X is a convex domain for any x, y X, their convex combination αx + (1 α)y X f (x) is a convex function Yang Tutorial for ACML 15 Nov. 20, / 210

35 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang Tutorial for ACML 15 Nov. 20, / 210

36 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang Tutorial for ACML 15 Nov. 20, / 210

37 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique e.g., λ 2 w 2 2 is λ-strongly convex Yang Tutorial for ACML 15 Nov. 20, / 210

38 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique e.g., λ 2 w 2 2 is λ-strongly convex Yang Tutorial for ACML 15 Nov. 20, / 210

39 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang Tutorial for ACML 15 Nov. 20, / 210

40 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang Tutorial for ACML 15 Nov. 20, / 210

41 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang Tutorial for ACML 15 Nov. 20, / 210

42 Basics Notations and Definitions Next... Part II: Optimization 1 min w R d n stochastic optimization distributed optimization n l(w x i, y i ) + R(w) i=1 Reduce Iteration Complexity: utilizing properties of functions and the structure of the problem Yang Tutorial for ACML 15 Nov. 20, / 210

43 Basics Notations and Definitions Next... Part III: Randomization Classification, Regression SVD, K-means, Kernel methods Reduce Data Size: utilizing properties of data Please stay tuned! Yang Tutorial for ACML 15 Nov. 20, / 210

44 Optimization Big Data Analytics: Optimization and Randomization Part II: Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

45 Optimization (Sub)Gradient Methods Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

46 Optimization (Sub)Gradient Methods Learning as Optimization Regularized Empirical Loss Minimization 1 n min l(w x i, y i ) + R(w) w R d n i=1 }{{} F (w) Yang Tutorial for ACML 15 Nov. 20, / 210

47 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t objective ε 0 T iterations Yang Tutorial for ACML 15 Nov. 20, / 210

48 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective ε 0 T iterations Yang Tutorial for ACML 15 Nov. 20, / 210

49 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Yang Tutorial for ACML 15 Nov. 20, / 210

50 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Total Runtime = Per-iteration Cost Iteration Complexity Yang Tutorial for ACML 15 Nov. 20, / 210

51 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang Tutorial for ACML 15 Nov. 20, / 210

52 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang Tutorial for ACML 15 Nov. 20, / 210

53 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α 0.5 distance to optimum seconds 0.5 T iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

54 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes 0.5 T 1/T iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

55 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

56 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T 0.5 Theoretically, we consider ( ) 1 log 1 1 ɛ ɛ ɛ 1 ɛ iterations (T) Yang Tutorial for ACML 15 Nov. 20, / 210

57 Optimization (Sub)Gradient Methods Non-smooth V.S. Smooth Smooth l(z) squared hinge loss: l(w x, y) = max(0, 1 yw x) 2 logistic loss: l(w x, y) = log(1 + exp( yw x)) square loss: l(w x, y) = (w x y) 2 Non-smooth l(z) hinge loss: l(w x, y) = max(0, 1 yw x) absolute loss: l(w x, y) = w x y Yang Tutorial for ACML 15 Nov. 20, / 210

58 Optimization (Sub)Gradient Methods Strongly convex V.S. Non-strongly convex λ-strongly convex R(w) λ l 2 regularizer: 2 w 2 2 Elastic net regularizer: τ w 1 + λ 2 w 2 2 Non-strongly convex R(w) unregularized problem: R(w) 0 l 1 regularizer: τ w 1 Yang Tutorial for ACML 15 Nov. 20, / 210

59 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is smooth Full gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) step size γ t = constant, e.g., 1 L Yang Tutorial for ACML 15 Nov. 20, / 210

60 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is smooth Full gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) step size γ t = constant, e.g., 1 L Yang Tutorial for ACML 15 Nov. 20, / 210

61 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ) l(w x i, y i ) + λ 2 w 2 2 }{{} R(w) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )) Yang Tutorial for ACML 15 Nov. 20, / 210

62 Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) Nesterov s Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang Tutorial for ACML 15 Nov. 20, / 210

63 Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) Nesterov s Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang Tutorial for ACML 15 Nov. 20, / 210

64 Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ), better than O( 1 ɛ ) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )), better than O( 1 λ log( 1 ɛ )) for small λ Yang Tutorial for ACML 15 Nov. 20, / 210

65 Optimization (Sub)Gradient Methods Deal with non-smooth regularizer Consider l 1 norm regularization min F (w) = 1 w R d n n l(w x i, y i ) } i=1 {{ } f (w) + τ w 1 }{{} R(w) f (w): smooth R(w): non-smooth and non-strongly convex Yang Tutorial for ACML 15 Nov. 20, / 210

66 Optimization (Sub)Gradient Methods Accelerated Proximal Gradient (APG) Proximal Accelerated Gradient Descent mapping w t = arg min w R d f (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain the same as for smooth and non-strongly convex, i.e., O( 1 ɛ ) Yang Tutorial for ACML 15 Nov. 20, / 210

67 Optimization (Sub)Gradient Methods Accelerated Proximal Gradient (APG) Proximal Accelerated Gradient Descent mapping w t = arg min w R d f (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain the same as for smooth and non-strongly convex, i.e., O( 1 ɛ ) Yang Tutorial for ACML 15 Nov. 20, / 210

68 Optimization (Sub)Gradient Methods Deal with non-smooth but strongly convex regularizer Consider the elastic net regularization min w R d F (w) = 1 n n i=1 R(w): non-smooth but strongly convex min F (w) = 1 w R d n f (w): smooth and strongly convex l(w x i, y i ) + λ 2 w τ w 1 }{{} R(w) n l(w x i, y i ) + λ 2 w 2 2 i=1 }{{} f (w) R (w): non-smooth and non-strongly convex ( ( )) Iteration Complexity: O λ 1 log 1 ɛ + τ w 1 }{{} R (w) Yang Tutorial for ACML 15 Nov. 20, / 210

69 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is non-smooth Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) step size γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

70 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is non-smooth Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) step size γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

71 Optimization (Sub)Gradient Methods Sub-Gradient Method n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex generalizes to l 1 norm and other non-strongly convex regularizer Iteration complexity O( 1 ) ɛ 2 If λ > 0: R(w) is λ-strongly convex generalizes to elastic net and other strongly convex regularizer Iteration complexity O( 1 λɛ ) No efficient acceleration scheme in general Yang Tutorial for ACML 15 Nov. 20, / 210

72 Optimization (Sub)Gradient Methods Problem Classes and Iteration Complexity Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ɛ ( ɛ 2 ) ( ( )) λ-strongly convex O 1 λɛ O λ 1 log 1 ɛ Per-iteration cost: O(nd), too high if n or d are large. Yang Tutorial for ACML 15 Nov. 20, / 210

73 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

74 Optimization Stochastic Optimization Algorithms for Big Data Stochastic First-Order Method by Sampling Randomly Sample Example 1 Stochastic Gradient Descent (SGD) 2 Stochastic Variance Reduced Gradient (SVRG) 3 Stochastic Average Gradient Algorithm (SAGA) 4 Stochastic Dual Coordinate Ascent (SDCA) Randomly Sample Feature 1 Randomized Coordinate Descent (RCD) 2 Accelerated Proximal Coordinate Gradient (APCG) Yang Tutorial for ACML 15 Nov. 20, / 210

75 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Randomly sample i {1,..., n} Stochastic sub-gradient: l(w T x i, y i ) + λw E i [ l(w T x i, y i ) + λw] = F (w) Yang Tutorial for ACML 15 Nov. 20, / 210

76 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 step size: γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

77 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 step size: γ t 0 Yang Tutorial for ACML 15 Nov. 20, / 210

78 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex generalizes to l 1 norm and other non-strongly convex regularizer Iteration complexity O( 1 ) ɛ 2 If λ > 0: R(w) is λ-strongly convex generalizes to elastic net and other strongly convex regularizer Iteration complexity O( 1 λɛ ) Exactly the same as sub-gradient descent! Yang Tutorial for ACML 15 Nov. 20, / 210

79 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Per-iteration cost: O(d) Much lower than full gradient method e.g. hinge loss (SVM) y it x it, 1 y it w x it > 0 stochastic gradient: l(w x it, y it ) = 0, otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

80 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ( ɛ 2 ) ( ɛ 2 ) λ-strongly convex O 1 λɛ O 1 λɛ For SGD, only strongly convexity helps but the smoothness does not make any difference! The reason: the step size has to be decreasing due to stochastic gradient does not approach 0 Yang Tutorial for ACML 15 Nov. 20, / 210

81 Optimization Stochastic Optimization Algorithms for Big Data Variance Reduction Stochastic Variance Reduced Gradient (SVRG) Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA) Yang Tutorial for ACML 15 Nov. 20, / 210

82 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang Tutorial for ACML 15 Nov. 20, / 210

83 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang Tutorial for ACML 15 Nov. 20, / 210

84 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Compute the full gradient at a reference point w F ( w) = 1 n g i ( w) n i=1 Stochastic variance reduced gradient: g it (w) = g it (w) g it ( w) + F ( w) E it [ g it (w)] = F (w) Var [ g it (w)] 0 as w, w w Yang Tutorial for ACML 15 Nov. 20, / 210

85 Optimization Stochastic Optimization Algorithms for Big Data Variance Reduction (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) At optimal solution w : F (w ) = 0 It does not mean g it (w) 0 as w w However, we have g it (w) = g it (w) g it ( w) + F ( w) 0 as w, w w Yang Tutorial for ACML 15 Nov. 20, / 210

86 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., m g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 m mt=1 w t output: w T ) ( m = O 1 λ γ t = constant Yang Tutorial for ACML 15 Nov. 20, / 210

87 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Per-iteration cost: O (d) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. (( N.A. ) 1 ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( Total Runtime: O d ( ( )) O nd λ log 1 ɛ ( ) ( )) n + 1 λ log 1 ɛ Use proximal mapping for elastic net regularizer Better than AFG 1 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

88 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Per-iteration cost: O (d) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. (( N.A. ) 1 ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( Total Runtime: O d ( ( )) O nd λ log 1 ɛ ( ) ( )) n + 1 λ log 1 ɛ Use proximal mapping for elastic net regularizer Better than AFG 1 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

89 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 A new version of SAG (Roux et al. (2012)) Applicable when l(z) is smooth Strong convexity is not necessary. Yang Tutorial for ACML 15 Nov. 20, / 210

90 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = g it (w) g it ( w) + F ( w) F ( w) = 1 n g i ( w) n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } g it (w) = g it (w) g it ( w it ) + G G = 1 n g i ( w i ) n i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

91 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

92 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

93 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

94 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, / 210

95 Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update G t = 1 n g i = G t 1 1 n n g i t + 1 ) ( l(w n t 1x it, y it ) + λw t 1 i=1 computation cost: O(d) Yang Tutorial for ACML 15 Nov. 20, / 210

96 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, / 210

97 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, / 210

98 Optimization Stochastic Optimization Algorithms for Big Data Compare the Runtime of SGD and SVRG/SAGA Smooth but non-strongly convex: SGD: O ( ) d ɛ 2 SAGA: O ( ) dn ɛ Smooth and strongly convex: SGD: O ( ) d λɛ SVRG/SAGA: O ( d ( ( n + λ) 1 log 1 )) ɛ For small ɛ, use SVRG/SAGA Satisfied with large ɛ, use SGD Yang Tutorial for ACML 15 Nov. 20, / 210

99 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

100 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

101 Optimization Stochastic Optimization Algorithms for Big Data The Dual Problem From Primal problem to Dual problem: min w 1 n = min w 1 = max α R n n n l(w x i, y }{{} i ) + λ 2 w 2 2 i=1 z 1 n [ ] max α i (w x i ) l i ( α i ) + λ n α i=1 i R 2 w 2 2 n l i ( α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Primal solution w = 1 λn ni=1 α i x i Yang Tutorial for ACML 15 Nov. 20, / 210

102 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required Solve Dual Problem: 1 max α R n n n l i ( α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Sample i t {1,..., n}. Optimize α it while fixing others Yang Tutorial for ACML 15 Nov. 20, / 210

103 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Maintain a primal solution: w t = 1 λn ni=1 α t i x i Optimize the increment α it 1 max α R n l i t ( (αi t t + α it )) λ 2 1 max α R n l i t ( (αi t t + α it )) λ 2 ( n ) 1 2 αi t x i + α it x it λn i=1 2 w t λn α i t x it 2 Yang Tutorial for ACML 15 Nov. 20, / 210

104 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Dual Coordinate Updates α it = max α i t R 1 n l i t ( (α t i t + α it )) λ 2 αi t+1 t = αi t t + α it w t+1 = w t + 1 λn α i t x i w t + 1 λn α 2 i t x it 2 Yang Tutorial for ACML 15 Nov. 20, / 210

105 Optimization Stochastic Optimization Algorithms for Big Data SDCA updates Close-form solution for α i : hinge loss, squared hinge loss, absolute loss and square loss (Shalev-Shwartz & Zhang (2013)) e.g. square loss Per-iteration cost: O(d) α i = y i w t x i α t i 1 + x i 2 2 /(λn) Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013)) Yang Tutorial for ACML 15 Nov. 20, / 210

106 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. R(w) ( 2 ) (( N.A. ) 2 ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! also equivalent to some kind of variance reduction Proximal variant for elastic net regularizer Wang & Lin (2014) shows linear convergence is achievable for non-smooth loss 2 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

107 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. R(w) ( 2 ) (( N.A. ) 2 ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! also equivalent to some kind of variance reduction Proximal variant for elastic net regularizer Wang & Lin (2014) shows linear convergence is achievable for non-smooth loss 2 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

108 Optimization Stochastic Optimization Algorithms for Big Data SDCA vs SVRG/SAGA Advantages of SDCA Can handle non-smooth loss functions Can explore the data sparsity for efficient update Parameter free Yang Tutorial for ACML 15 Nov. 20, / 210

109 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Updates Randomized Coordinate Descent Accelerated Proximal Coordinate Gradient Yang Tutorial for ACML 15 Nov. 20, / 210

110 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Updates min w R d F (w) = 1 n n l(w x i, y i ) + R(w) i=1 Suppose d >> n. Per-iteration cost O(d) is too high Sample over features instead of data Per-iteration cost becomes O(n) Applicable when l(z, y) is smooth and R(w) is decomposable Strong convexity is not necessary Yang Tutorial for ACML 15 Nov. 20, / 210

111 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) min w R d F (w) = 1 2 Xw y λ 2 w 2 2 X = [ x 1, x 2,, x d ] R n d Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i step size γ t : constant i F (w t ) can be updated in O(n) Yang Tutorial for ACML 15 Nov. 20, / 210

112 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) min w R d F (w) = 1 2 Xw y λ 2 w 2 2 X = [ x 1, x 2,, x d ] R n d Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i step size γ t : constant i F (w t ) can be updated in O(n) Yang Tutorial for ACML 15 Nov. 20, / 210

113 Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i maintain and update u = Xw y R n in O(n) u t = u t 1 + x it (wi t t w t 1 ) = u t 1 + x it w partial gradient can be computed in O(n) i t i F (w t ) = x i u t Yang Tutorial for ACML 15 Nov. 20, / 210

114 Optimization Stochastic Optimization Algorithms for Big Data RCD Per-iteration Cost O(n) Iteration complexity l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d R(w) ( ɛ ( )) λ-strongly convex N.A O d λ log 1 ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as Gradient Descent Method! In practice, could be much faster Yang Tutorial for ACML 15 Nov. 20, / 210

115 Optimization Stochastic Optimization Algorithms for Big Data RCD Per-iteration Cost O(n) Iteration complexity l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d R(w) ( ɛ ( )) λ-strongly convex N.A O d λ log 1 ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as Gradient Descent Method! In practice, could be much faster Yang Tutorial for ACML 15 Nov. 20, / 210

116 Optimization Stochastic Optimization Algorithms for Big Data Accelerated Proximal Coordinate Gradient (APCG) min w R d F (w) = 1 2 Xw y λ 2 w τ w 1 Using Acceleration Using Proximal Mapping APCG (Lin et al., 2014) w t i = { arg minwi R i F (v t 1 )w i + 1 2γ t (w i vi t 1 ) 2 + τ w i if i = i t otherwise w t 1 i v t = w t + η t (w t w t 1 ) Yang Tutorial for ACML 15 Nov. 20, / 210

117 Optimization Stochastic Optimization Algorithms for Big Data APCG Per-iteration cost: O(n) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d ɛ ( ( λ-strongly convex N.A. O d log 1 λ ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as APG!, in practice, could be much faster )) Yang Tutorial for ACML 15 Nov. 20, / 210

118 Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates Sample i t {1,..., n} β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

119 Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates Sample i t {1,..., n} β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

120 Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual R(w) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex ( N.A. 3 ) (( N.A. ) 4 ( )) λ-strongly convex O n + n λɛ O n + n λ log 1 ɛ ( Total Runtime (smooth): O ( ( )) than SDCA O d(n + 1 λ ) log 1 ɛ d(n + n λ ) log ( 1 ɛ when λ 1 n )). could be faster 3 A small trick can fix this 4 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, / 210

121 Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA Lin et al. (2014) Yang Tutorial for ACML 15 Nov. 20, / 210

122 Optimization Stochastic Optimization Algorithms for Big Data Summary R(w) l(z) l(z, y) Non-smooth Smooth Non str-cvx SGD RCD, APCG, SAGA str-cvx SDCA, APCG RCD, APCG, SDCA SVRG, SAGA Red: stochastic gradient, primal Blue: randomized coordinate, primal Green: stochastic coordinate, dual Yang Tutorial for ACML 15 Nov. 20, / 210

123 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

124 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

125 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

126 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, / 210

127 Optimization Stochastic Optimization Algorithms for Big Data Trick for generalizing to non-strongly convex regularizer (Shalev-Shwartz & Zhang, 2012) 1 min w R d n n l(w x i, y i ) + τ w 1 i=1 Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + τ w 1 + λ 2 w 2 2 i=1 If w 2 B, we can set λ = ɛ B 2. An ɛ/2-suboptimal solution for the new problem is ɛ-suboptimal for the original problem Yang Tutorial for ACML 15 Nov. 20, / 210

128 Optimization Stochastic Optimization Algorithms for Big Data Trick for generalizing to non-strongly convex regularizer (Shalev-Shwartz & Zhang, 2012) 1 min w R d n n l(w x i, y i ) + τ w 1 i=1 Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + τ w 1 + λ 2 w 2 2 i=1 If w 2 B, we can set λ = ɛ B 2. An ɛ/2-suboptimal solution for the new problem is ɛ-suboptimal for the original problem Yang Tutorial for ACML 15 Nov. 20, / 210

129 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, / 210

Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single

130 Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single machine suffers low network bandwidth limited disk or memory communication V.S. computation RAM 100 nanoseconds standard network connection 250, 000 nanoseconds Yang Tutorial for ACML 15 Nov. 20, / 210

131 Optimization Stochastic Optimization Algorithms for Big Data Distributed Data N data points are partitioned and distributed to m machines [x 1, x 2,..., x n ] = S 1 S 2 S K Machine j only has access to S j. W.L.O.G: S j = n k = n K S 1 S 2 S 3 S 4 S 5 S 6 Yang Tutorial for ACML 15 Nov. 20, / 210

132 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n k i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 K w j, Issue: Will not converge to w K j=1 Yang Tutorial for ACML 15 Nov. 20, / 210

133 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n k i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 K w j, Issue: Will not converge to w K j=1 Yang Tutorial for ACML 15 Nov. 20, / 210

134 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Single machine Total Runtime = Per-iteration Cost Iteration Complexity Distributed optimization Total Runtime = (Communication Time Per-round+Local Runtime Per-round) Rounds of Communication Trading Computation for Communication: Increase Local Computation Balance between Communication Reduce the Rounds of Communication Yang Tutorial for ACML 15 Nov. 20, / 210

135 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al., 2015) Applicable when R(w) is strongly convex, e.g. R(w) = λ 2 w 2 2 Global dual problem 1 n max l α R n i ( α i ) λ 1 n 2 α n 2 i x i λn i=1 i=1 2 Incremental variable α i max α 1 n l i ( (α t i + α i )) λ 2 wt + 1 λn n 2 α i x i i=1 2 Primal solution: w t = 1 λn n αi t x i i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

136 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al., 2015) Applicable when R(w) is strongly convex, e.g. R(w) = λ 2 w 2 2 Global dual problem 1 n max l α R n i ( α i ) λ 1 n 2 α n 2 i x i λn i=1 i=1 2 Incremental variable α i max α 1 n l i ( (α t i + α i )) λ 2 wt + 1 λn n 2 α i x i i=1 2 Primal solution: w t = 1 λn n αi t x i i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

137 Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication α ij = arg max l i j ( α t i j α ij ) λn 2K ut j + K 2 λn α i j x ij 2 u t j+1 = u t j + K λn α i j x ij Yang Tutorial for ACML 15 Nov. 20, / 210

138 Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication Yang Tutorial for ACML 15 Nov. 20, / 210

139 Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication Yang Tutorial for ACML 15 Nov. 20, / 210

140 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t j arg max α Sj R n l i ( (αi t + α i )) w t, i S j i S j α i x i K 2λn α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j α i x i i S j 2 2 m Center computes: w t+1 = w t + wj t j=1 Yang Tutorial for ACML 15 Nov. 20, / 210

141 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Local objective value G j ( α Sj, w t ) = 1 l i ( (αi t + α i )) 1 n n wt, α i x i K 2λn 2 α i x i i S j i S j i S j Solve α t S j ( by any local solver as long as max G j ( α Sj, w t ) G j ( αs t α j, w t ) Sj 0 < Θ < 1 ) Θ ( max G j ( α Sj, w t ) G j (0, w t ) α Sj CoCoA+ is equivalent to DisDCA when employing SDCA to solve local problems with m iterations 2 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

142 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of m (i.e., the number of inner iterations) the larger m, the higher local computation cost, the lower communication costs Choice of K (i.e., the number of machines) the larger K, the lower local computation costs, the higher communication costs Yang Tutorial for ACML 15 Nov. 20, / 210

143 Optimization Stochastic Optimization Algorithms for Big Data DisDCA is implemented tyng/software.html Classification and Regression Loss 1 Hinge loss and squared hinge loss (SVM) 2 Logistic loss (Logistic Regression) 3 Square loss (Ridge Regression/LASSO) Regularizer 1 l 2 norm 2 mixture of l 1 norm and l 2 norm Multi-class : one-vs-all Yang Tutorial for ACML 15 Nov. 20, / 210

144 Optimization Stochastic Optimization Algorithms for Big Data Alternating Direction Method of Multipliers (ADMM) min F (w) = K 1 l(w x i, y i ) + λ w R d n 2 w 2 2 k=1 i S k }{{} f k (w) each f k (w) on individual machines but w are coupled together K min F (w) = f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 s.t. w k = w, k = 1,..., K Yang Tutorial for ACML 15 Nov. 20, / 210

145 Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w k=1 is the Lagrangian function of K f k (w k ) + λ 2 w ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

146 Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w k=1 is the Lagrangian function of K f k (w k ) + λ 2 w ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

147 Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w k=1 is the Lagrangian function of K f k (w k ) + λ 2 w ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

148 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

149 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

150 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

151 Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w z k (w k w) + ρ K w k w Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w (z t k) w + ρ K Machines wk t+1 w w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, / 210

152 Optimization Stochastic Optimization Algorithms for Big Data ADMM w t+1 k = arg min f k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K Each local problem can be solved by a local solver (e.g., SDCA) Optimization can be inexact (trading computation for communication) Yang Tutorial for ACML 15 Nov. 20, / 210

153 Optimization Stochastic Optimization Algorithms for Big Data Complexity of ADMM Assume local problems are solved exactly. ( ( )) Communication Complexity: O log 1 ɛ due to the strong convexity of R(w) Applicable to Non-strongly Convex Regularizer R(w) = w 1 min F (w) = K 1 w R d n k=1 Communication Complexity: O ( 1 ɛ i S k l(w x i, y i ) + τ w 1 ) Yang Tutorial for ACML 15 Nov. 20, / 210

154 Optimization Stochastic Optimization Algorithms for Big Data Thank You! Questions? Yang Tutorial for ACML 15 Nov. 20, / 210

155 Optimization Stochastic Optimization Algorithms for Big Data Research Assistant Positions Available for PhD Candidates! Start Fall 16 Optimization and Randomization Online Learning Deep Learning Machine Learning send to Yang Tutorial for ACML 15 Nov. 20, / 210

156 Randomized Dimension Reduction Big Data Analytics: Optimization and Randomization Part III: Randomization Yang Tutorial for ACML 15 Nov. 20, / 210

157 Randomized Dimension Reduction Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang Tutorial for ACML 15 Nov. 20, / 210

158 Randomized Dimension Reduction Random Sketch Approximate a large data matrix by a much smaller sketch Yang Tutorial for ACML 15 Nov. 20, / 210

159 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

160 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

161 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

162 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

163 Randomized Dimension Reduction Why randomized dimension reduction? Efficient Robust (e.g., dropout) Formal Guarantees Can explore parallel algorithms Yang Tutorial for ACML 15 Nov. 20, / 210

164 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang Tutorial for ACML 15 Nov. 20, / 210

165 Randomized Dimension Reduction JL Lemma JL Lemma (Johnson & Lindenstrauss, 1984) For any 0 < ɛ, δ < 1/2, there exists a probability distribution on m d real matrices A such that there exists a small universal constant c > 0 and for any fixed x R d with a probability at least 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) m x 2 2 or for m = Θ(ɛ 2 log(1/δ)), then with a probability at least 1 δ Ax 2 2 x 2 2 ɛ x 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

166 Randomized Dimension Reduction Embedding a set of points into low dimensional space Given a set of points x 1,..., x n R d, we can embed them into a low dimensional space Ax 1,..., Ax n R m such that the pairwise distance between any two points are well preserved in the low dimensional space Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 + ɛ) x i x j 2 2 Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 ɛ) x i x j 2 2 In other words, in order to have all pairwise Euclidean distances preserved up to 1 ± ɛ, only m = Θ(ɛ 2 log(n 2 /δ)) dimensions are necessary Yang Tutorial for ACML 15 Nov. 20, / 210

167 Randomized Dimension Reduction JL transforms: Gaussian Random Projection Gaussian Random Projection (Dasgupta & Gupta, 2003): A R m d A ij N (0, 1/m) m = Θ(ɛ 2 log(1/δ)) Computational cost of AX: where X R d n mnd for dense matrices nnz(x)m for sparse matrices Computational Cost is very High (could be as high as solving many problems) Yang Tutorial for ACML 15 Nov. 20, / 210

168 Randomized Dimension Reduction Accelerate JL transforms: using discrete distributions Using Discrete Distributions (Achlioptas, 2003): Pr(A ij = ± 1 m ) = 0.5 Pr(A ij = ± 3 m ) = 1 6, Pr(A ij = 0) = 2 3 Database friendly Replace multiplications by additions and subtractions Yang Tutorial for ACML 15 Nov. 20, / 210

169 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, / 210

170 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, / 210

171 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, / 210

172 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, / 210

173 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, / 210

174 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, / 210

175 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform (Tropp, 2011): A = d m PHD yields Ax 2 2 x log(2/δ) log(d/δ) x 2 2 m m = Θ(ɛ 2 log(1/δ) log(d/δ)) suffice for 1 ± ɛ additional factor log(d/δ) can be removed Computational cost of AX: O(nd log(m)) Yang Tutorial for ACML 15 Nov. 20, / 210

176 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang Tutorial for ACML 15 Nov. 20, / 210

177 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang Tutorial for ACML 15 Nov. 20, / 210

178 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) key properties: E[ HDx 1, HDx 2 ] = x 1, x 2 and norm perserving HDx 2 2 x 2 2 ɛ x 2 2, only when x x 2 1 c Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks of randomized Hadmard transform Px Px 2 1 c Yang Tutorial for ACML 15 Nov. 20, / 210

179 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (II) Sparse JL transform based on block random hashing (Kane & Nelson, 2014) A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(ɛ 1 ) and s = Θ(ɛ 1 log(1/δ)) ( [ ]) nnz(x) 1 Computational Cost of AX: O log ɛ δ Yang Tutorial for ACML 15 Nov. 20, / 210

180 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang Tutorial for ACML 15 Nov. 20, / 210

181 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, / 210

182 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, / 210

183 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, / 210

184 Randomized Dimension Reduction Subspace Embeddings From a JL transform to a Subspace Embedding (Sarlós, 2006). Let A R m d be a JL transform. If ] [ m = O k log k δɛ Then w.h.p 1 δ k, A R m d is a subspace embedding w.r.t a k-dimensional space in R d ɛ 2 Yang Tutorial for ACML 15 Nov. 20, / 210

185 Randomized Dimension Reduction Subspace Embeddings Making block random hashing a Subspace Embedding (Nelson & Nguyen, 2013). A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(kɛ 1 log 5 (k/δ)) and s = Θ(ɛ 1 log 3 (k/δ)) w.h.p 1 δ, A R m d with m = Θ ( ) k log 8 (k/δ) ɛ 2 embedding w.r.t a k-dimensional space in R d ( [ ]) nnz(x) k Computational Cost of AX: O log 3 ɛ δ is a subspace Yang Tutorial for ACML 15 Nov. 20, / 210

186 Randomized Dimension Reduction Sparse Subspace Embedding (SSE) Random hashing is SSE with a Constant Probability (Nelson & Nguyen, 2013) A = HD where D R d d and H R m d m = Ω(k 2 /ɛ 2 ) suffice for a subspace embedding with a probability 2/3 Computational Cost AX: O(nnz(X)) Yang Tutorial for ACML 15 Nov. 20, / 210

187 Randomized Dimension Reduction Randomized Dimensionality Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column (Row) sampling Yang Tutorial for ACML 15 Nov. 20, / 210

188 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, / 210

189 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, / 210

190 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, / 210

191 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

192 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

193 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

194 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

195 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, / 210

196 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang Tutorial for ACML 15 Nov. 20, / 210

197 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang Tutorial for ACML 15 Nov. 20, / 210

198 Randomized Dimension Reduction When uniform sampling makes sense? Coherence measure µ k = d k max 1 i d U i 2 2 Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang Tutorial for ACML 15 Nov. 20, / 210

199 Randomized Algorithms Randomized Classification (Regression) Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

200 Randomized Algorithms Randomized Classification (Regression) Classification Classification problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang Tutorial for ACML 15 Nov. 20, / 210

201 Randomized Algorithms Randomized Classification (Regression) Randomized Classification For large-scale high-dimensional problems, the computational cost of optimization is O((nd + dκ) log(1/ɛ)). Use random reduction A R d m (m d), we reduce X R n d to X = XA R n m. Then solve JL transforms 1 min u R m n Sparse subspace embeddings n l(y i u x i ) + λ 2 u 2 2 i=1 Yang Tutorial for ACML 15 Nov. 20, / 210

202 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

203 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

204 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

205 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

206 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, / 210

207 Randomized Algorithms Randomized Classification (Regression) The Dual probelm Using Fenchel conjugate Primal: Dual: From dual to primal: l i (α i ) = max α i α i z l(z, y i ) 1 w = arg min w R d n α = arg max α R n 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 n i=1 l i (α i ) 1 2λn 2 α XX α w = 1 λn X α Yang Tutorial for ACML 15 Nov. 20, / 210

208 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

209 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

210 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

211 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

212 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

213 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, / 210

214 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery RCV1 text data, n = 677, 399, and d = 47, 236 relative dual error L2 norm Dual Error λ= τ m=1024 m=2048 m=4096 m=8192 Primal Error relative primal error L2 norm λ= τ m=1024 m=2048 m=4096 m=8192 Yang Tutorial for ACML 15 Nov. 20, / 210

215 Randomized Algorithms Randomized Least-Squares Regression Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

216 Randomized Algorithms Randomized Least-Squares Regression Least-squares regression Let X R n d with d n and b R n. The least-squares regression problem is to find w such that w = arg min w R d Xw b 2 Computational Cost: O(nd 2 ) Goal of RA: o(nd 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

217 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Let A R m n be a random reduction matrix. Solve ŵ = arg min w R d A(Xw b) 2 = AXw Ab 2 Computational Cost: O(md 2 ) + reduction time Yang Tutorial for ACML 15 Nov. 20, / 210

218 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

219 Randomized Algorithms Randomized K-means Clustering Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

220 Randomized Algorithms Randomized K-means Clustering K-means Clustering Let x 1,..., x n R d be a set of data points. K-means clustering aims to solve min C 1,...,C k k j=1 x i C j x i µ j 2 2 Computational Cost: O(ndkt), where t is number of iterations. Yang Tutorial for ACML 15 Nov. 20, / 210

221 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

222 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang Tutorial for ACML 15 Nov. 20, / 210

223 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering For random sketch: JL transforms, sparse subspace embedding all work JL transform: m = O( k log(k/(ɛδ)) ɛ 2 ) Sparse subspace embedding: m = O( k2 ɛ 2 δ ) ɛ relates to the approximation accuracy Analysis of approximation error for K-means can be formulates as Constrained Low-rank Approximation (Cohen et al., 2015) where Q is orthonormal. min X QQ X 2 Q F Q=I Yang Tutorial for ACML 15 Nov. 20, / 210

224 Randomized Algorithms Randomized Kernel methods Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, / 210

225 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methods The Nyström method Random Fourier features Yang Tutorial for ACML 15 Nov. 20, / 210

226 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methods The Nyström method Random Fourier features Yang Tutorial for ACML 15 Nov. 20, / 210

227 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

228 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang Tutorial for ACML 15 Nov. 20, / 210

229 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine The dual problem: arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α BC B α Solve it like solving a linear method: X = BC 1/2 R n l arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α X X α Yang Tutorial for ACML 15 Nov. 20, / 210

230 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine Yang Tutorial for ACML 15 Nov. 20, / 210

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed