Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov. 20, 2015 1 / 210

URL http://www.cs.uiowa.edu/ tyng/acml15-tutorial.pdf Yang Tutorial for ACML 15 Nov. 20, 2015 2 / 210

Some Claims No Yes This tutorial is not an exhaustive literature survey It is not a survey on different machine learning algorithms It is about how to efficiently solve machine learning (formulated as optimization) problems for big data Yang Tutorial for ACML 15 Nov. 20, 2015 3 / 210

Outline Part I: Basics Part II: Optimization Part III: Randomization Yang Tutorial for ACML 15 Nov. 20, 2015 4 / 210

Big Data Analytics: Optimization and Randomization Part I: Basics Yang Tutorial for ACML 15 Nov. 20, 2015 5 / 210

Basics Introduction Outline 1 Basics Introduction Notations and Definitions Yang Tutorial for ACML 15 Nov. 20, 2015 6 / 210

Basics Introduction Three Steps for Machine Learning distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 0.5 T 1/T 2 1/T 20 40 60 80 100 iterations Data Model Optimization Yang Tutorial for ACML 15 Nov. 20, 2015 7 / 210

Basics Introduction Big Data Challenge Big Data Yang Tutorial for ACML 15 Nov. 20, 2015 8 / 210

Basics Introduction Big Data Challenge Big Model 60 million parameters Yang Tutorial for ACML 15 Nov. 20, 2015 9 / 210

Basics Introduction Learning as Optimization Ridge Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, 2015 10 / 210

Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 }{{} Empirical Loss x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, 2015 11 / 210

Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 }{{} Regularization x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang Tutorial for ACML 15 Nov. 20, 2015 12 / 210

Basics Introduction Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang Tutorial for ACML 15 Nov. 20, 2015 13 / 210

Basics Introduction Learning as Optimization Feature Selection: 1 min w R d n n l(w x i, y i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang Tutorial for ACML 15 Nov. 20, 2015 14 / 210

Basics Introduction Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 ( ) l(w x i, y i )+λ w 1 + γ w 2 2 Elastic net regularizer, more robust than l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, 2015 15 / 210

Basics Introduction Learning as Optimization Multi-class/Multi-task Learning: W R K d min W 1 n n l(wx i, y i ) + λr(w) i=1 r(w) = W 2 F = K dj=1 k=1 Wkj 2 : Frobenius Norm r(w) = W = i σ i: Nuclear Norm (sum of singular values) r(w) = W 1, = d j=1 W :j : l 1, mixed norm Yang Tutorial for ACML 15 Nov. 20, 2015 16 / 210

Basics Introduction Learning as Optimization Regularized Empirical Loss Minimization 1 min w R d n Both l and R are convex functions n l(w x i, y i ) + R(w) i=1 Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches Extensions to Non-convex (e.g., deep learning) are in progress Yang Tutorial for ACML 15 Nov. 20, 2015 17 / 210

Basics Introduction Data Matrices and Machine Learning The Instance-feature Matrix: X R n d X = x 1 x 2 x n Yang Tutorial for ACML 15 Nov. 20, 2015 18 / 210

Basics Introduction Data Matrices and Machine Learning y1 y2 The output vector: y = Rn 1 yn continuous yi R: regression (e.g., house price) discrete, e.g., yi {1, 2, 3}: classification (e.g., species of iris) Yang Tutorial for ACML 15 Nov. 20, 2015 19 / 210

Basics Introduction Data Matrices and Machine Learning The Instance-Instance Matrix: K R n n Similarity Matrix Kernel Matrix Yang Tutorial for ACML 15 Nov. 20, 2015 20 / 210

Basics Introduction Data Matrices and Machine Learning Some machine learning tasks are formulated on the kernel matrix Clustering Kernel Methods Yang Tutorial for ACML 15 Nov. 20, 2015 21 / 210

Basics Introduction Data Matrices and Machine Learning The Feature-Feature Matrix: C R d d Covariance Matrix Distance Metric Matrix Yang Tutorial for ACML 15 Nov. 20, 2015 22 / 210

Basics Introduction Data Matrices and Machine Learning Some machine learning tasks requires the covariance matrix Principal Component Analysis Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix Yang Tutorial for ACML 15 Nov. 20, 2015 23 / 210

Basics Introduction Why Learning from Big Data is Challenging? High per-iteration cost High memory cost High communication cost Large iteration complexity Yang Tutorial for ACML 15 Nov. 20, 2015 24 / 210

Basics Notations and Definitions Outline 1 Basics Introduction Notations and Definitions Yang Tutorial for ACML 15 Nov. 20, 2015 25 / 210

Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang Tutorial for ACML 15 Nov. 20, 2015 26 / 210

Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang Tutorial for ACML 15 Nov. 20, 2015 27 / 210

Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 X 2 = σ 1 (maximum singular value) Yang Tutorial for ACML 15 Nov. 20, 2015 28 / 210

Basics Notations and Definitions Convex Optimization min x X f (x) X is a convex domain for any x, y X, their convex combination αx + (1 α)y X f (x) is a convex function Yang Tutorial for ACML 15 Nov. 20, 2015 29 / 210

Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang Tutorial for ACML 15 Nov. 20, 2015 30 / 210

Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique e.g., λ 2 w 2 2 is λ-strongly convex Yang Tutorial for ACML 15 Nov. 20, 2015 31 / 210

Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) 0.8 0.6 0.4 0.2 0 non smooth sub gradient 0.2 1 0.5 0 0.5 1 x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) 4 3 2 Quadratic Function 1 y 0 f(y)+f (y)(x y) 1 5 4 3 2 1 0 1 2 3 4 5 Yang Tutorial for ACML 15 Nov. 20, 2015 32 / 210

Basics Notations and Definitions Next... Part II: Optimization 1 min w R d n stochastic optimization distributed optimization n l(w x i, y i ) + R(w) i=1 Reduce Iteration Complexity: utilizing properties of functions and the structure of the problem Yang Tutorial for ACML 15 Nov. 20, 2015 33 / 210

Basics Notations and Definitions Next... Part III: Randomization Classification, Regression SVD, K-means, Kernel methods Reduce Data Size: utilizing properties of data Please stay tuned! Yang Tutorial for ACML 15 Nov. 20, 2015 34 / 210

Optimization Big Data Analytics: Optimization and Randomization Part II: Optimization Yang Tutorial for ACML 15 Nov. 20, 2015 35 / 210

Optimization (Sub)Gradient Methods Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, 2015 36 / 210

Optimization (Sub)Gradient Methods Learning as Optimization Regularized Empirical Loss Minimization 1 n min l(w x i, y i ) + R(w) w R d n i=1 }{{} F (w) Yang Tutorial for ACML 15 Nov. 20, 2015 37 / 210

Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t 0.5 0.4 objective 0.3 0.2 0.1 ε 0 T 0 20 40 60 80 100 iterations Yang Tutorial for ACML 15 Nov. 20, 2015 38 / 210

Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective 0.5 0.4 0.3 0.2 0.1 Convergence Rate: after T iterations, how good is the solution ε 0 T 0 20 40 60 80 100 iterations F (ŵ T ) min F (w) ɛ(t ) w Yang Tutorial for ACML 15 Nov. 20, 2015 38 / 210

Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang Tutorial for ACML 15 Nov. 20, 2015 39 / 210

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α 0.5 distance to optimum 0.4 0.3 0.2 0.1 0 seconds 0.5 T 0 20 40 60 80 100 iterations (T) Yang Tutorial for ACML 15 Nov. 20, 2015 40 / 210

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum 0.5 0.4 0.3 0.2 0.1 seconds minutes 0.5 T 1/T 0 0 20 40 60 80 100 iterations (T) Yang Tutorial for ACML 15 Nov. 20, 2015 41 / 210

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum 0.5 0.4 0.3 0.2 0.1 seconds minutes hours 0.5 T 1/T 1/T 0.5 0 0 20 40 60 80 100 iterations (T) Yang Tutorial for ACML 15 Nov. 20, 2015 42 / 210

Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum 0.5 0.4 0.3 0.2 0.1 seconds minutes hours 0.5 T 1/T 1/T 0.5 Theoretically, we consider ( ) 1 log 1 1 ɛ ɛ ɛ 1 ɛ 2 0 0 20 40 60 80 100 iterations (T) Yang Tutorial for ACML 15 Nov. 20, 2015 43 / 210

Optimization (Sub)Gradient Methods Non-smooth V.S. Smooth Smooth l(z) squared hinge loss: l(w x, y) = max(0, 1 yw x) 2 logistic loss: l(w x, y) = log(1 + exp( yw x)) square loss: l(w x, y) = (w x y) 2 Non-smooth l(z) hinge loss: l(w x, y) = max(0, 1 yw x) absolute loss: l(w x, y) = w x y Yang Tutorial for ACML 15 Nov. 20, 2015 44 / 210

Optimization (Sub)Gradient Methods Strongly convex V.S. Non-strongly convex λ-strongly convex R(w) λ l 2 regularizer: 2 w 2 2 Elastic net regularizer: τ w 1 + λ 2 w 2 2 Non-strongly convex R(w) unregularized problem: R(w) 0 l 1 regularizer: τ w 1 Yang Tutorial for ACML 15 Nov. 20, 2015 45 / 210

Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is smooth Full gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) step size γ t = constant, e.g., 1 L Yang Tutorial for ACML 15 Nov. 20, 2015 46 / 210

Optimization (Sub)Gradient Methods Gradient Method in Machine Learning F (w) = 1 n n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ) l(w x i, y i ) + λ 2 w 2 2 }{{} R(w) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )) Yang Tutorial for ACML 15 Nov. 20, 2015 47 / 210

Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) Nesterov s Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang Tutorial for ACML 15 Nov. 20, 2015 48 / 210

Optimization (Sub)Gradient Methods Accelerated Full Gradient (AFG) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ), better than O( 1 ɛ ) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )), better than O( 1 λ log( 1 ɛ )) for small λ Yang Tutorial for ACML 15 Nov. 20, 2015 49 / 210

Optimization (Sub)Gradient Methods Deal with non-smooth regularizer Consider l 1 norm regularization min F (w) = 1 w R d n n l(w x i, y i ) } i=1 {{ } f (w) + τ w 1 }{{} R(w) f (w): smooth R(w): non-smooth and non-strongly convex Yang Tutorial for ACML 15 Nov. 20, 2015 50 / 210

Optimization (Sub)Gradient Methods Accelerated Proximal Gradient (APG) Proximal Accelerated Gradient Descent mapping w t = arg min w R d f (v t 1 ) w + 1 2γ t w v t 1 2 2 + τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain the same as for smooth and non-strongly convex, i.e., O( 1 ɛ ) Yang Tutorial for ACML 15 Nov. 20, 2015 51 / 210

Optimization (Sub)Gradient Methods Deal with non-smooth but strongly convex regularizer Consider the elastic net regularization min w R d F (w) = 1 n n i=1 R(w): non-smooth but strongly convex min F (w) = 1 w R d n f (w): smooth and strongly convex l(w x i, y i ) + λ 2 w 2 2 + τ w 1 }{{} R(w) n l(w x i, y i ) + λ 2 w 2 2 i=1 }{{} f (w) R (w): non-smooth and non-strongly convex ( ( )) Iteration Complexity: O λ 1 log 1 ɛ + τ w 1 }{{} R (w) Yang Tutorial for ACML 15 Nov. 20, 2015 52 / 210

Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning F (w) = 1 n l(w x i, y i ) + λ n 2 w 2 2 i=1 Suppose l(z, y) is non-smooth Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) step size γ t 0 Yang Tutorial for ACML 15 Nov. 20, 2015 53 / 210

Optimization (Sub)Gradient Methods Sub-Gradient Method n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex generalizes to l 1 norm and other non-strongly convex regularizer Iteration complexity O( 1 ) ɛ 2 If λ > 0: R(w) is λ-strongly convex generalizes to elastic net and other strongly convex regularizer Iteration complexity O( 1 λɛ ) No efficient acceleration scheme in general Yang Tutorial for ACML 15 Nov. 20, 2015 54 / 210

Optimization (Sub)Gradient Methods Problem Classes and Iteration Complexity Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ɛ ( ɛ 2 ) ( ( )) λ-strongly convex O 1 λɛ O λ 1 log 1 ɛ Per-iteration cost: O(nd), too high if n or d are large. Yang Tutorial for ACML 15 Nov. 20, 2015 55 / 210

Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang Tutorial for ACML 15 Nov. 20, 2015 56 / 210

Optimization Stochastic Optimization Algorithms for Big Data Stochastic First-Order Method by Sampling Randomly Sample Example 1 Stochastic Gradient Descent (SGD) 2 Stochastic Variance Reduced Gradient (SVRG) 3 Stochastic Average Gradient Algorithm (SAGA) 4 Stochastic Dual Coordinate Ascent (SDCA) Randomly Sample Feature 1 Randomized Coordinate Descent (RCD) 2 Accelerated Proximal Coordinate Gradient (APCG) Yang Tutorial for ACML 15 Nov. 20, 2015 57 / 210

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Randomly sample i {1,..., n} Stochastic sub-gradient: l(w T x i, y i ) + λw E i [ l(w T x i, y i ) + λw] = F (w) Yang Tutorial for ACML 15 Nov. 20, 2015 58 / 210

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 step size: γ t 0 Yang Tutorial for ACML 15 Nov. 20, 2015 59 / 210

Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex generalizes to l 1 norm and other non-strongly convex regularizer Iteration complexity O( 1 ) ɛ 2 If λ > 0: R(w) is λ-strongly convex generalizes to elastic net and other strongly convex regularizer Iteration complexity O( 1 λɛ ) Exactly the same as sub-gradient descent! Yang Tutorial for ACML 15 Nov. 20, 2015 60 / 210

Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Per-iteration cost: O(d) Much lower than full gradient method e.g. hinge loss (SVM) y it x it, 1 y it w x it > 0 stochastic gradient: l(w x it, y it ) = 0, otherwise Yang Tutorial for ACML 15 Nov. 20, 2015 61 / 210

Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ( ɛ 2 ) ( ɛ 2 ) λ-strongly convex O 1 λɛ O 1 λɛ For SGD, only strongly convexity helps but the smoothness does not make any difference! The reason: the step size has to be decreasing due to stochastic gradient does not approach 0 Yang Tutorial for ACML 15 Nov. 20, 2015 62 / 210

Optimization Stochastic Optimization Algorithms for Big Data Variance Reduction Stochastic Variance Reduced Gradient (SVRG) Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA) Yang Tutorial for ACML 15 Nov. 20, 2015 63 / 210

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang Tutorial for ACML 15 Nov. 20, 2015 64 / 210

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Compute the full gradient at a reference point w F ( w) = 1 n g i ( w) n i=1 Stochastic variance reduced gradient: g it (w) = g it (w) g it ( w) + F ( w) E it [ g it (w)] = F (w) Var [ g it (w)] 0 as w, w w Yang Tutorial for ACML 15 Nov. 20, 2015 65 / 210

Optimization Stochastic Optimization Algorithms for Big Data Variance Reduction (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) At optimal solution w : F (w ) = 0 It does not mean g it (w) 0 as w w However, we have g it (w) = g it (w) g it ( w) + F ( w) 0 as w, w w Yang Tutorial for ACML 15 Nov. 20, 2015 66 / 210

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., m g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 m mt=1 w t output: w T ) ( m = O 1 λ γ t = constant Yang Tutorial for ACML 15 Nov. 20, 2015 67 / 210

Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Per-iteration cost: O (d) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. (( N.A. ) 1 ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( Total Runtime: O d ( ( )) O nd λ log 1 ɛ ( ) ( )) n + 1 λ log 1 ɛ Use proximal mapping for elastic net regularizer Better than AFG 1 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, 2015 68 / 210

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 A new version of SAG (Roux et al. (2012)) Applicable when l(z) is smooth Strong convexity is not necessary. Yang Tutorial for ACML 15 Nov. 20, 2015 69 / 210

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = g it (w) g it ( w) + F ( w) F ( w) = 1 n g i ( w) n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } g it (w) = g it (w) g it ( w it ) + G G = 1 n g i ( w i ) n i=1 Yang Tutorial for ACML 15 Nov. 20, 2015 70 / 210

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, g i = l(w0 x i, y i ) + λw 0 n i=1 average gradient G t 1 = 1 ni=1 n g i stochastic variance reduced gradient: ) g it (w t 1 ) = ( l(w t 1x it, y it ) + λw t 1 g it + G t 1 w t = w t 1 γ t g it (w t 1 ) Update the selected component of the average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang Tutorial for ACML 15 Nov. 20, 2015 71 / 210

Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update G t = 1 n g i = G t 1 1 n n g i t + 1 ) ( l(w n t 1x it, y it ) + λw t 1 i=1 computation cost: O(d) Yang Tutorial for ACML 15 Nov. 20, 2015 72 / 210

Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang Tutorial for ACML 15 Nov. 20, 2015 73 / 210

Optimization Stochastic Optimization Algorithms for Big Data Compare the Runtime of SGD and SVRG/SAGA Smooth but non-strongly convex: SGD: O ( ) d ɛ 2 SAGA: O ( ) dn ɛ Smooth and strongly convex: SGD: O ( ) d λɛ SVRG/SAGA: O ( d ( ( n + λ) 1 log 1 )) ɛ For small ɛ, use SVRG/SAGA Satisfied with large ɛ, use SGD Yang Tutorial for ACML 15 Nov. 20, 2015 74 / 210

Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α 2 4 + αy i if αy i 0 + otherwise Yang Tutorial for ACML 15 Nov. 20, 2015 75 / 210

Optimization Stochastic Optimization Algorithms for Big Data The Dual Problem From Primal problem to Dual problem: min w 1 n = min w 1 = max α R n n n l(w x i, y }{{} i ) + λ 2 w 2 2 i=1 z 1 n [ ] max α i (w x i ) l i ( α i ) + λ n α i=1 i R 2 w 2 2 n l i ( α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Primal solution w = 1 λn ni=1 α i x i Yang Tutorial for ACML 15 Nov. 20, 2015 76 / 210

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required Solve Dual Problem: 1 max α R n n n l i ( α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Sample i t {1,..., n}. Optimize α it while fixing others Yang Tutorial for ACML 15 Nov. 20, 2015 77 / 210

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Maintain a primal solution: w t = 1 λn ni=1 α t i x i Optimize the increment α it 1 max α R n l i t ( (αi t t + α it )) λ 2 1 max α R n l i t ( (αi t t + α it )) λ 2 ( n ) 1 2 αi t x i + α it x it λn i=1 2 w t + 1 2 λn α i t x it 2 Yang Tutorial for ACML 15 Nov. 20, 2015 78 / 210

Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Dual Coordinate Updates α it = max α i t R 1 n l i t ( (α t i t + α it )) λ 2 αi t+1 t = αi t t + α it w t+1 = w t + 1 λn α i t x i w t + 1 λn α 2 i t x it 2 Yang Tutorial for ACML 15 Nov. 20, 2015 79 / 210

Optimization Stochastic Optimization Algorithms for Big Data SDCA updates Close-form solution for α i : hinge loss, squared hinge loss, absolute loss and square loss (Shalev-Shwartz & Zhang (2013)) e.g. square loss Per-iteration cost: O(d) α i = y i w t x i α t i 1 + x i 2 2 /(λn) Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013)) Yang Tutorial for ACML 15 Nov. 20, 2015 80 / 210

Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. R(w) ( 2 ) (( N.A. ) 2 ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! also equivalent to some kind of variance reduction Proximal variant for elastic net regularizer Wang & Lin (2014) shows linear convergence is achievable for non-smooth loss 2 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, 2015 81 / 210

Optimization Stochastic Optimization Algorithms for Big Data SDCA vs SVRG/SAGA Advantages of SDCA Can handle non-smooth loss functions Can explore the data sparsity for efficient update Parameter free Yang Tutorial for ACML 15 Nov. 20, 2015 82 / 210

Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Updates Randomized Coordinate Descent Accelerated Proximal Coordinate Gradient Yang Tutorial for ACML 15 Nov. 20, 2015 83 / 210

Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Updates min w R d F (w) = 1 n n l(w x i, y i ) + R(w) i=1 Suppose d >> n. Per-iteration cost O(d) is too high Sample over features instead of data Per-iteration cost becomes O(n) Applicable when l(z, y) is smooth and R(w) is decomposable Strong convexity is not necessary Yang Tutorial for ACML 15 Nov. 20, 2015 84 / 210

Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) min w R d F (w) = 1 2 Xw y 2 2 + λ 2 w 2 2 X = [ x 1, x 2,, x d ] R n d Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i step size γ t : constant i F (w t ) can be updated in O(n) Yang Tutorial for ACML 15 Nov. 20, 2015 85 / 210

Optimization Stochastic Optimization Algorithms for Big Data Randomized Coordinate Descent (Nesterov (2012)) Partial gradient: i F (w) = x i T (Xw y) + λw i Randomly sample i t {1,..., d} Randomized Coordinate Descent (RCD) w t i = { w t 1 i γ t i F (w t 1 ) if i = i t otherwise w t 1 i maintain and update u = Xw y R n in O(n) u t = u t 1 + x it (wi t t w t 1 ) = u t 1 + x it w partial gradient can be computed in O(n) i t i F (w t ) = x i u t Yang Tutorial for ACML 15 Nov. 20, 2015 86 / 210

Optimization Stochastic Optimization Algorithms for Big Data RCD Per-iteration Cost O(n) Iteration complexity l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d R(w) ( ɛ ( )) λ-strongly convex N.A O d λ log 1 ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as Gradient Descent Method! In practice, could be much faster Yang Tutorial for ACML 15 Nov. 20, 2015 87 / 210

Optimization Stochastic Optimization Algorithms for Big Data Accelerated Proximal Coordinate Gradient (APCG) min w R d F (w) = 1 2 Xw y 2 2 + λ 2 w 2 2 + τ w 1 Using Acceleration Using Proximal Mapping APCG (Lin et al., 2014) w t i = { arg minwi R i F (v t 1 )w i + 1 2γ t (w i vi t 1 ) 2 + τ w i if i = i t otherwise w t 1 i v t = w t + η t (w t w t 1 ) Yang Tutorial for ACML 15 Nov. 20, 2015 88 / 210

Optimization Stochastic Optimization Algorithms for Big Data APCG Per-iteration cost: O(n) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d ɛ ( ( λ-strongly convex N.A. O d log 1 λ ɛ ( ( )) Total Runtime (strongly convex): O nd λ log 1 ɛ. The same as APG!, in practice, could be much faster )) Yang Tutorial for ACML 15 Nov. 20, 2015 89 / 210

Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates Sample i t {1,..., n} β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang Tutorial for ACML 15 Nov. 20, 2015 90 / 210

Optimization Stochastic Optimization Algorithms for Big Data APCG applied to the Dual R(w) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex ( N.A. 3 ) (( N.A. ) 4 ( )) λ-strongly convex O n + n λɛ O n + n λ log 1 ɛ ( Total Runtime (smooth): O ( ( )) than SDCA O d(n + 1 λ ) log 1 ɛ d(n + n λ ) log ( 1 ɛ when λ 1 n )). could be faster 3 A small trick can fix this 4 A small trick can fix this Yang Tutorial for ACML 15 Nov. 20, 2015 91 / 210

Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA Lin et al. (2014) Yang Tutorial for ACML 15 Nov. 20, 2015 92 / 210

Optimization Stochastic Optimization Algorithms for Big Data Summary R(w) l(z) l(z, y) Non-smooth Smooth Non str-cvx SGD RCD, APCG, SAGA str-cvx SDCA, APCG RCD, APCG, SDCA SVRG, SAGA Red: stochastic gradient, primal Blue: randomized coordinate, primal Green: stochastic coordinate, dual Yang Tutorial for ACML 15 Nov. 20, 2015 93 / 210

Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Parameters γ t γ t, m γ t None η t non-smooth loss smooth loss strongly cvx Non-strongly cvx Primal Dual Yang Tutorial for ACML 15 Nov. 20, 2015 94 / 210

Optimization Stochastic Optimization Algorithms for Big Data Trick for generalizing to non-strongly convex regularizer (Shalev-Shwartz & Zhang, 2012) 1 min w R d n n l(w x i, y i ) + τ w 1 i=1 Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + τ w 1 + λ 2 w 2 2 i=1 If w 2 B, we can set λ = ɛ B 2. An ɛ/2-suboptimal solution for the new problem is ɛ-suboptimal for the original problem Yang Tutorial for ACML 15 Nov. 20, 2015 95 / 210

Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single machine suffers low network bandwidth limited disk or memory communication V.S. computation RAM 100 nanoseconds standard network connection 250, 000 nanoseconds Yang Tutorial for ACML 15 Nov. 20, 2015 97 / 210

Optimization Stochastic Optimization Algorithms for Big Data Distributed Data N data points are partitioned and distributed to m machines [x 1, x 2,..., x n ] = S 1 S 2 S K Machine j only has access to S j. W.L.O.G: S j = n k = n K S 1 S 2 S 3 S 4 S 5 S 6 Yang Tutorial for ACML 15 Nov. 20, 2015 98 / 210

Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n k i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 K w j, Issue: Will not converge to w K j=1 Yang Tutorial for ACML 15 Nov. 20, 2015 99 / 210

Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Single machine Total Runtime = Per-iteration Cost Iteration Complexity Distributed optimization Total Runtime = (Communication Time Per-round+Local Runtime Per-round) Rounds of Communication Trading Computation for Communication: Increase Local Computation Balance between Communication Reduce the Rounds of Communication Yang Tutorial for ACML 15 Nov. 20, 2015 100 / 210

Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al., 2015) Applicable when R(w) is strongly convex, e.g. R(w) = λ 2 w 2 2 Global dual problem 1 n max l α R n i ( α i ) λ 1 n 2 α n 2 i x i λn i=1 i=1 2 Incremental variable α i max α 1 n l i ( (α t i + α i )) λ 2 wt + 1 λn n 2 α i x i i=1 2 Primal solution: w t = 1 λn n αi t x i i=1 Yang Tutorial for ACML 15 Nov. 20, 2015 101 / 210

Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication α ij = arg max l i j ( α t i j α ij ) λn 2K ut j + K 2 λn α i j x ij 2 u t j+1 = u t j + K λn α i j x ij Yang Tutorial for ACML 15 Nov. 20, 2015 102 / 210

Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication Yang Tutorial for ACML 15 Nov. 20, 2015 103 / 210

Optimization Stochastic Optimization Algorithms for Big Data DisDCA: Trading Computation for Communication Yang Tutorial for ACML 15 Nov. 20, 2015 104 / 210

Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t j arg max α Sj R n l i ( (αi t + α i )) w t, i S j i S j α i x i K 2λn α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j α i x i i S j 2 2 m Center computes: w t+1 = w t + wj t j=1 Yang Tutorial for ACML 15 Nov. 20, 2015 105 / 210

Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Local objective value G j ( α Sj, w t ) = 1 l i ( (αi t + α i )) 1 n n wt, α i x i K 2λn 2 α i x i i S j i S j i S j Solve α t S j ( by any local solver as long as max G j ( α Sj, w t ) G j ( αs t α j, w t ) Sj 0 < Θ < 1 ) Θ ( max G j ( α Sj, w t ) G j (0, w t ) α Sj CoCoA+ is equivalent to DisDCA when employing SDCA to solve local problems with m iterations 2 2 ) Yang Tutorial for ACML 15 Nov. 20, 2015 106 / 210

Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of m (i.e., the number of inner iterations) the larger m, the higher local computation cost, the lower communication costs Choice of K (i.e., the number of machines) the larger K, the lower local computation costs, the higher communication costs Yang Tutorial for ACML 15 Nov. 20, 2015 107 / 210

Optimization Stochastic Optimization Algorithms for Big Data DisDCA is implemented http://cs.uiowa.edu/ tyng/software.html Classification and Regression Loss 1 Hinge loss and squared hinge loss (SVM) 2 Logistic loss (Logistic Regression) 3 Square loss (Ridge Regression/LASSO) Regularizer 1 l 2 norm 2 mixture of l 1 norm and l 2 norm Multi-class : one-vs-all Yang Tutorial for ACML 15 Nov. 20, 2015 108 / 210

Optimization Stochastic Optimization Algorithms for Big Data Alternating Direction Method of Multipliers (ADMM) min F (w) = K 1 l(w x i, y i ) + λ w R d n 2 w 2 2 k=1 i S k }{{} f k (w) each f k (w) on individual machines but w are coupled together K min F (w) = f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 s.t. w k = w, k = 1,..., K Yang Tutorial for ACML 15 Nov. 20, 2015 109 / 210

Optimization Stochastic Optimization Algorithms for Big Data The Augmented Lagrangian Function K min f k (w k ) + λ w 1,...,w K,w R d 2 w 2 2 k=1 Lagrangian Multipliers s.t. w k = w, k = 1,..., K The Augmented Lagrangian function L ({w k }, {z k }, w) K = f k (w k ) + λ 2 w 2 2 + k=1 is the Lagrangian function of K f k (w k ) + λ 2 w 2 2 + ρ 2 min w 1,...,w K,w R d s.t. k=1 K z k (w k w) + ρ K w k w 2 2 2 k=1 k=1 w k = w, k = 1,..., K K w k w 2 2 k=1 Yang Tutorial for ACML 15 Nov. 20, 2015 110 / 210

Optimization Stochastic Optimization Algorithms for Big Data ADMM L ({w k }, {z k }, w) K = f k (w k ) + λ K 2 w 2 2 + z k (w k w) + ρ K w k w 2 2 2 Optimize k=1on k=1 k=1 Individual Update Machines from (wk t, zt k, wt ) to (wk t+1, z t+1 k, w t+1 ) Aggregate and wk t+1 Update = arg min onf k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K One Machine Update on w t+1 Individual λ K = arg min 2 w 2 2 + (z t k) w + ρ K Machines wk t+1 w 2 2 2 w k=1 z t+1 k = z k + ρ(w t+1 k w t+1 ) k=1 Yang Tutorial for ACML 15 Nov. 20, 2015 111 / 210

Optimization Stochastic Optimization Algorithms for Big Data ADMM w t+1 k = arg min f k (w k ) + (z t w k k) (w k w t ) + ρ 2 w k w t 2 2, k = 1,..., K Each local problem can be solved by a local solver (e.g., SDCA) Optimization can be inexact (trading computation for communication) Yang Tutorial for ACML 15 Nov. 20, 2015 112 / 210

Optimization Stochastic Optimization Algorithms for Big Data Complexity of ADMM Assume local problems are solved exactly. ( ( )) Communication Complexity: O log 1 ɛ due to the strong convexity of R(w) Applicable to Non-strongly Convex Regularizer R(w) = w 1 min F (w) = K 1 w R d n k=1 Communication Complexity: O ( 1 ɛ i S k l(w x i, y i ) + τ w 1 ) Yang Tutorial for ACML 15 Nov. 20, 2015 113 / 210

Optimization Stochastic Optimization Algorithms for Big Data Thank You! Questions? Yang Tutorial for ACML 15 Nov. 20, 2015 114 / 210

Optimization Stochastic Optimization Algorithms for Big Data Research Assistant Positions Available for PhD Candidates! Start Fall 16 Optimization and Randomization Online Learning Deep Learning Machine Learning send email to tianbao-yang@uiowa.edu Yang Tutorial for ACML 15 Nov. 20, 2015 115 / 210

Randomized Dimension Reduction Big Data Analytics: Optimization and Randomization Part III: Randomization Yang Tutorial for ACML 15 Nov. 20, 2015 116 / 210

Randomized Dimension Reduction Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang Tutorial for ACML 15 Nov. 20, 2015 117 / 210

Randomized Dimension Reduction Random Sketch Approximate a large data matrix by a much smaller sketch Yang Tutorial for ACML 15 Nov. 20, 2015 118 / 210

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, 2015 119 / 210

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, 2015 120 / 210

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, 2015 121 / 210

Randomized Dimension Reduction The Framework of Randomized Algorithms Yang Tutorial for ACML 15 Nov. 20, 2015 122 / 210

Randomized Dimension Reduction Why randomized dimension reduction? Efficient Robust (e.g., dropout) Formal Guarantees Can explore parallel algorithms Yang Tutorial for ACML 15 Nov. 20, 2015 123 / 210

Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang Tutorial for ACML 15 Nov. 20, 2015 124 / 210

Randomized Dimension Reduction JL Lemma JL Lemma (Johnson & Lindenstrauss, 1984) For any 0 < ɛ, δ < 1/2, there exists a probability distribution on m d real matrices A such that there exists a small universal constant c > 0 and for any fixed x R d with a probability at least 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) m x 2 2 or for m = Θ(ɛ 2 log(1/δ)), then with a probability at least 1 δ Ax 2 2 x 2 2 ɛ x 2 2 Yang Tutorial for ACML 15 Nov. 20, 2015 125 / 210

Randomized Dimension Reduction Embedding a set of points into low dimensional space Given a set of points x 1,..., x n R d, we can embed them into a low dimensional space Ax 1,..., Ax n R m such that the pairwise distance between any two points are well preserved in the low dimensional space Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 + ɛ) x i x j 2 2 Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 ɛ) x i x j 2 2 In other words, in order to have all pairwise Euclidean distances preserved up to 1 ± ɛ, only m = Θ(ɛ 2 log(n 2 /δ)) dimensions are necessary Yang Tutorial for ACML 15 Nov. 20, 2015 126 / 210

Randomized Dimension Reduction JL transforms: Gaussian Random Projection Gaussian Random Projection (Dasgupta & Gupta, 2003): A R m d A ij N (0, 1/m) m = Θ(ɛ 2 log(1/δ)) Computational cost of AX: where X R d n mnd for dense matrices nnz(x)m for sparse matrices Computational Cost is very High (could be as high as solving many problems) Yang Tutorial for ACML 15 Nov. 20, 2015 127 / 210

Randomized Dimension Reduction Accelerate JL transforms: using discrete distributions Using Discrete Distributions (Achlioptas, 2003): Pr(A ij = ± 1 m ) = 0.5 Pr(A ij = ± 3 m ) = 1 6, Pr(A ij = 0) = 2 3 Database friendly Replace multiplications by additions and subtractions Yang Tutorial for ACML 15 Nov. 20, 2015 128 / 210

Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang Tutorial for ACML 15 Nov. 20, 2015 129 / 210

Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ 1 1 1 1 ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang Tutorial for ACML 15 Nov. 20, 2015 130 / 210

Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform (Tropp, 2011): A = d m PHD yields Ax 2 2 x 2 2 3 log(2/δ) log(d/δ) x 2 2 m m = Θ(ɛ 2 log(1/δ) log(d/δ)) suffice for 1 ± ɛ additional factor log(d/δ) can be removed Computational cost of AX: O(nd log(m)) Yang Tutorial for ACML 15 Nov. 20, 2015 131 / 210

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang Tutorial for ACML 15 Nov. 20, 2015 132 / 210

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) key properties: E[ HDx 1, HDx 2 ] = x 1, x 2 and norm perserving HDx 2 2 x 2 2 ɛ x 2 2, only when x x 2 1 c Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks of randomized Hadmard transform Px Px 2 1 c Yang Tutorial for ACML 15 Nov. 20, 2015 133 / 210

Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (II) Sparse JL transform based on block random hashing (Kane & Nelson, 2014) A = 1 s Q 1... 1 s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(ɛ 1 ) and s = Θ(ɛ 1 log(1/δ)) ( [ ]) nnz(x) 1 Computational Cost of AX: O log ɛ δ Yang Tutorial for ACML 15 Nov. 20, 2015 134 / 210

Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang Tutorial for ACML 15 Nov. 20, 2015 135 / 210

Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang Tutorial for ACML 15 Nov. 20, 2015 136 / 210

Randomized Dimension Reduction Subspace Embeddings From a JL transform to a Subspace Embedding (Sarlós, 2006). Let A R m d be a JL transform. If ] [ m = O k log k δɛ Then w.h.p 1 δ k, A R m d is a subspace embedding w.r.t a k-dimensional space in R d ɛ 2 Yang Tutorial for ACML 15 Nov. 20, 2015 137 / 210

Randomized Dimension Reduction Subspace Embeddings Making block random hashing a Subspace Embedding (Nelson & Nguyen, 2013). A = 1 s Q 1... 1 s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(kɛ 1 log 5 (k/δ)) and s = Θ(ɛ 1 log 3 (k/δ)) w.h.p 1 δ, A R m d with m = Θ ( ) k log 8 (k/δ) ɛ 2 embedding w.r.t a k-dimensional space in R d ( [ ]) nnz(x) k Computational Cost of AX: O log 3 ɛ δ is a subspace Yang Tutorial for ACML 15 Nov. 20, 2015 138 / 210

Randomized Dimension Reduction Sparse Subspace Embedding (SSE) Random hashing is SSE with a Constant Probability (Nelson & Nguyen, 2013) A = HD where D R d d and H R m d m = Ω(k 2 /ɛ 2 ) suffice for a subspace embedding with a probability 2/3 Computational Cost AX: O(nnz(X)) Yang Tutorial for ACML 15 Nov. 20, 2015 139 / 210

Randomized Dimension Reduction Randomized Dimensionality Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column (Row) sampling Yang Tutorial for ACML 15 Nov. 20, 2015 140 / 210

Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang Tutorial for ACML 15 Nov. 20, 2015 141 / 210

Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang Tutorial for ACML 15 Nov. 20, 2015 142 / 210

Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang Tutorial for ACML 15 Nov. 20, 2015 143 / 210

Randomized Dimension Reduction When uniform sampling makes sense? Coherence measure µ k = d k max 1 i d U i 2 2 Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang Tutorial for ACML 15 Nov. 20, 2015 144 / 210

Randomized Algorithms Randomized Classification (Regression) Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, 2015 145 / 210

Randomized Algorithms Randomized Classification (Regression) Classification Classification problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang Tutorial for ACML 15 Nov. 20, 2015 146 / 210

Randomized Algorithms Randomized Classification (Regression) Randomized Classification For large-scale high-dimensional problems, the computational cost of optimization is O((nd + dκ) log(1/ɛ)). Use random reduction A R d m (m d), we reduce X R n d to X = XA R n m. Then solve JL transforms 1 min u R m n Sparse subspace embeddings n l(y i u x i ) + λ 2 u 2 2 i=1 Yang Tutorial for ACML 15 Nov. 20, 2015 147 / 210

Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang Tutorial for ACML 15 Nov. 20, 2015 148 / 210

Randomized Algorithms Randomized Classification (Regression) The Dual probelm Using Fenchel conjugate Primal: Dual: From dual to primal: l i (α i ) = max α i α i z l(z, y i ) 1 w = arg min w R d n α = arg max α R n 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 n i=1 l i (α i ) 1 2λn 2 α XX α w = 1 λn X α Yang Tutorial for ACML 15 Nov. 20, 2015 149 / 210

Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, 2015 150 / 210

Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang Tutorial for ACML 15 Nov. 20, 2015 151 / 210

Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery RCV1 text data, n = 677, 399, and d = 47, 236 relative dual error L2 norm 1 0.8 0.6 0.4 0.2 Dual Error λ=0.001 0 0.1 0.3 0.5 0.7 0.9 τ m=1024 m=2048 m=4096 m=8192 Primal Error relative primal error L2 norm 1 0.8 0.6 0.4 0.2 λ=0.001 0 0.1 0.3 0.5 0.7 0.9 τ m=1024 m=2048 m=4096 m=8192 Yang Tutorial for ACML 15 Nov. 20, 2015 152 / 210

Randomized Algorithms Randomized Least-Squares Regression Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, 2015 153 / 210

Randomized Algorithms Randomized Least-Squares Regression Least-squares regression Let X R n d with d n and b R n. The least-squares regression problem is to find w such that w = arg min w R d Xw b 2 Computational Cost: O(nd 2 ) Goal of RA: o(nd 2 ) Yang Tutorial for ACML 15 Nov. 20, 2015 154 / 210

Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Let A R m n be a random reduction matrix. Solve ŵ = arg min w R d A(Xw b) 2 = AXw Ab 2 Computational Cost: O(md 2 ) + reduction time Yang Tutorial for ACML 15 Nov. 20, 2015 155 / 210

Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang Tutorial for ACML 15 Nov. 20, 2015 156 / 210

Randomized Algorithms Randomized K-means Clustering Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, 2015 157 / 210

Randomized Algorithms Randomized K-means Clustering K-means Clustering Let x 1,..., x n R d be a set of data points. K-means clustering aims to solve min C 1,...,C k k j=1 x i C j x i µ j 2 2 Computational Cost: O(ndkt), where t is number of iterations. Yang Tutorial for ACML 15 Nov. 20, 2015 158 / 210

Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang Tutorial for ACML 15 Nov. 20, 2015 159 / 210

Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering For random sketch: JL transforms, sparse subspace embedding all work JL transform: m = O( k log(k/(ɛδ)) ɛ 2 ) Sparse subspace embedding: m = O( k2 ɛ 2 δ ) ɛ relates to the approximation accuracy Analysis of approximation error for K-means can be formulates as Constrained Low-rank Approximation (Cohen et al., 2015) where Q is orthonormal. min X QQ X 2 Q F Q=I Yang Tutorial for ACML 15 Nov. 20, 2015 160 / 210

Randomized Algorithms Randomized Kernel methods Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang Tutorial for ACML 15 Nov. 20, 2015 161 / 210

Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methods The Nyström method Random Fourier features Yang Tutorial for ACML 15 Nov. 20, 2015 162 / 210

Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang Tutorial for ACML 15 Nov. 20, 2015 163 / 210

Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine The dual problem: arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α BC B α Solve it like solving a linear method: X = BC 1/2 R n l arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α X X α Yang Tutorial for ACML 15 Nov. 20, 2015 164 / 210

Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine Yang Tutorial for ACML 15 Nov. 20, 2015 165 / 210