LAMDA Group. Oct 19, Wu-Jun Li ( Big Learning CS, NJU 1 / 125

Size: px

Start display at page:

Download "LAMDA Group. Oct 19, Wu-Jun Li ( Big Learning CS, NJU 1 / 125"

Dwain Holmes
5 years ago
Views:

1 ŒêâÅìÆS oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Oct 19, 2016 Wu-Jun Li ( Big Learning CS, NJU 1 / 125

2 Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 2 / 125

3 Introduction Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 3 / 125

4 Introduction Big Data Big data has attracted much attention from both academia and industry. Facebook: 750 million users Flickr: 6 billion photos Wal-Mart: 267 million items/day; 4PB data warehouse Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Wu-Jun Li ( Big Learning CS, NJU 4 / 125

5 Introduction Definition of Big Data Gartner (2012): Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. ( 3Vs ) International Data Corporation (IDC) (2011): Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis. ( 4Vs ) McKinsey Global Institute (MGI) (2011): Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. Why not hot until recent years? Big data: 7 Cloud computingµæ Eâ Big data machine learningµž7eâ Wu-Jun Li ( Big Learning CS, NJU 5 / 125

6 Introduction Big Data Machine Learning Definition: perform machine learning from big data. Role: key for big data Ultimate goal of big data processing is to mine value from data. Machine learning provides fundamental theory and computational techniques for big data mining and analysis. Wu-Jun Li ( Big Learning CS, NJU 6 / 125

7 Introduction Challenge Storage: memory and disk Computation: CPU Communication: network Wu-Jun Li ( Big Learning CS, NJU 7 / 125

8 Introduction Our Contribution Learning to hash (MFÆS): memory/disk/cpu/communication Parallel and distributed stochastic learning ( 1 Ùª ÅÆS): memory/disk/cpu; but increase communication cost Other application-driven distributed learning (Ù A^ Ä Ùª ÆS) Wu-Jun Li ( Big Learning CS, NJU 8 / 125

9 Learning to Hash Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 9 / 125

10 Learning to Hash Nearest Neighbor Search (Retrieval) Given a query point q, return the points closest (similar) to q in the database (e.g., image retrieval). Underlying many machine learning, data mining, information retrieval problems Challenge in Big Data Applications: Curse of dimensionality Storage cost Query speed Wu-Jun Li ( Big Learning CS, NJU 10 / 125

11 Learning to Hash Similarity Preserving Hashing Wu-Jun Li ( Big Learning CS, NJU 11 / 125

12 Learning to Hash Reduce Dimensionality and Storage Cost Wu-Jun Li ( Big Learning CS, NJU 12 / 125

13 Learning to Hash Fast Query Speed By using hash-code to construct index, we can achieve constant or sub-linear search time complexity. In some cases, exhaustive search with linear time complexity is also acceptable because the distance calculation cost is low with binary representation. Wu-Jun Li ( Big Learning CS, NJU 13 / 125

14 Learning to Hash Two Stages of Hash Function Learning Two main categories: Category I: Projection Stage (Dimension Reduction) Projected with real-valued projection function Given a point x, each projected dimension i will be associated with a real-valued projection function f i(x) (e.g., f i(x) = w T i x) Quantization Stage Turn real into binary Essential difference between metric learning and learning to hash Category II: Binary-Code Learning Stage Hash Function Learning Stage Wu-Jun Li ( Big Learning CS, NJU 14 / 125

15 Learning to Hash Our Contribution Unsupervised Hashing Isotropic hashing (IsoHash) [NIPS 2012] Scalable graph hashing with feature transformation [IJCAI 2015] Supervised Hashing Supervised hashing with latent factor models [SIGIR 2014] Column sampling based discrete supervised hashing [AAAI 2016] Feature learning based deep supervised hashing with pairwise labels [IJCAI 2016] Multimodal Hashing Large-scale supervised multimodal hashing with semantic correlation maximization [AAAI 2014] Multiple-Bit Quantization Double-bit quantization (DBQ) [AAAI 2012] Manhattan quantization (MQ) [SIGIR 2012] Wu-Jun Li ( Big Learning CS, NJU 15 / 125

Learning to Hash Isotropic Hashing Motivation Problem: All existing methods use the same number of bits for different projected dimensions with different variances.

16 Learning to Hash Isotropic Hashing Motivation Problem: All existing methods use the same number of bits for different projected dimensions with different variances. Possible Solutions: Different number of bits for different dimensions (Variable bit quantization (Moran et al, ACL 2013)) Isotropic (equal) variances for all dimensions Wu-Jun Li ( Big Learning CS, NJU 16 / 125

17 Learning to Hash Isotropic Hashing PCA Hash To generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the matrix XX T as columns of the projection matrix W R d m. Here, top m eigenvectors are those corresponding to the m largest eigenvalues {λ k } m k=1, generally arranged with the non-increasing order λ 1 λ 2 λ m. Let λ = [λ 1, λ 2,, λ m ] T. Then Λ = W T XX T W = diag(λ) Define hash function h(x) = sgn(w T x) Wu-Jun Li ( Big Learning CS, NJU 17 / 125

18 Learning to Hash Isotropic Hashing Idea of IsoHash Learn an orthogonal matrix Q R m m which makes Q T W T XX T W Q become a matrix with equal diagonal values. Effect of Q: to make each projected dimension have the same variance while keeping the Euclidean distances between any two points unchanged. Wu-Jun Li ( Big Learning CS, NJU 18 / 125

19 Learning to Hash Isotropic Hashing Accuracy (map) Method CIFAR # bits IsoHash PCAH ITQ SH SIKH LSH Wu-Jun Li ( Big Learning CS, NJU 19 / 125

20 Learning to Hash Isotropic Hashing Training Time Training Time(s) IsoHash GF IsoHash LP ITQ SH SIKH LSH PCAH Number of training data x 10 4 Wu-Jun Li ( Big Learning CS, NJU 20 / 125

21 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Problem The memory cost and time complexity are at least O(n 2 ) for graph hashing if all pairwise similarities are explicitly computed. How to utilize the whole graph and avoid O(n 2 ) complexity? Scalable Graph Hashing(SGH) A feature transformation (Shrivastava and Li, NIPS 2014) method to effectively approximate the whole graph without explicitly computing it. A sequential method for bit-wise complementary learning. Linear complexity. Wu-Jun Li ( Big Learning CS, NJU 21 / 125

22 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Objective function min W c S sgn(k(x)w T )sgn(k(x)w T ) T 2 F s.t. WK(X) T K(X)W T = I x, define: K(x) = [φ(x, x 1 ) n i=1 φ(x i, x 1 )/n,..., φ(x, x m ) n i=1 φ(x i, x m )/n] S ij = 2S ij 1 ( 1, 1]. Notation X = {x 1,..., x n } T R n d : n data points. Pairwise similarity metric defined as: S ij = e x i x j 2 F /ρ (0, 1] Wu-Jun Li ( Big Learning CS, NJU 22 / 125

23 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) x, define P (x) and Q(x): 2(e P (x) = [ 2 1) eρ 2(e Q(x) = [ 2 1) eρ e x 2 F ρ e x; e e x 2 F ρ e x; e e x 2 F ρ ; 1] e x 2 F ρ ; 1] x i, x j X P (x i ) T Q(x j ) = 2[ e2 1 2e 2xT i x j ρ 2e x i 2 F x j 2 F +2xT i x j ρ + e e 1 = S ij ]e x i 2 F + x j 2 F ρ 1 Wu-Jun Li ( Big Learning CS, NJU 23 / 125

24 Learning to Hash Scalable Graph Hashing with Feature Transformation Feature Transformation Here, we use an approximation e2 1 2e x + e2 +1 2e e x We assume 1 2 ρ xt i x j 1. It is easy to prove that ρ = 2 max{ x i 2 F }n i=1 can make 1 2 ρ xt i x j 1. Then we have S P (X) T Q(X) Wu-Jun Li ( Big Learning CS, NJU 24 / 125

25 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Direct relaxation may lead to poor performance. We adopt a sequential learning strategy in a bit-wise complementary manner Residual definition: R t = c S t 1 i=1 sgn(k(x)w i)sgn(k(x)w i ) T R 1 = c S Objective function: min w t By relaxation, we can get: R t sgn(k(x)w t )sgn(k(x)w t ) T 2 F s.t. w T t K(X) T K(X)w t = 1 max w t tr(w T t K(X) T R t K(X)w t ) s.t. w T t K(X) T K(X)w t = 1 Wu-Jun Li ( Big Learning CS, NJU 25 / 125

26 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Then we obtain a generalized eigenvalue problem: Key component: K(X) T R t K(X)w t = λk(x) T K(X)w t ck(x) T SK(X) = ck(x) T P (X) T Q(X) K(X) }{{} S = c[k(x) T P (X) T ][Q(X)K(X)] Adopting R t = c S c i=1,i t sgn(k(x)w i)sgn(k(x)w i ) T, we continue the sequential learning procedure. The information in S is fully used, but is not explicitly computed. Time complexity is decreased from O(n 2 ) to O(n). Wu-Jun Li ( Big Learning CS, NJU 26 / 125

27 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Algorithm 1 Sequential learning algorithm for SGH Input: Feature vectors X R n d ; code length c; number of kernel bases m. Output: Weight matrix W R c m. Procedure Construct P (X) and Q(X); Construct K(X) based on the kernel bases, which are m points randomly selected from X; A 0 = [K(X) T P (X) T ][Q(X)K(X)]; A 1 = ca 0 ; Z = K(X) T K(X) + γi d ; for t = 1 c do Solve the following generalized eigenvalue problem A t w t = λzw t ; U = [K(X) T sgn(k(x)w t )][K(X) T sgn(k(x)w t )] T ; A t+1 = A t U; end for Â 0 = A c+1 Randomly permutate {1, 2,, c} to generate a random index set M; for t = 1 c do ˆt = M(t); Â 0 = Â0 + K(X) T sgn(k(x)wˆt )sgn(k(x)wˆt ) T K(X); Solve the following generalized eigenvalue problem Â 0 v = λzv; Update wˆt v Â 0 = Â0 K(X) T sgn(k(x)wˆt )sgn(k(x)wˆt ) T K(X); end for Wu-Jun Li ( Big Learning CS, NJU 27 / 125

28 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Top-1k Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH DGH-I DGH-R PCAH LSH Top-1k Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH DGH-I DGH-R PCAH LSH Wu-Jun Li ( Big Learning CS, NJU 28 / 125

29 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Training Here, t 1 = Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH t t t t t 1 DGH-I t t t t t 1 DGH-R t t t t t 1 PCAH LSH Training Here, t 2 = Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH t t t t t 2 DGH-I t t t t t 2 DGH-R t t t t t 2 PCAH LSH Wu-Jun Li ( Big Learning CS, NJU 29 / 125

30 Learning to Hash Supervised Hashing with Latent Factor Models Problem Definition Input: Feature vectors: x i R D, i = 1,..., N. (Compact form: X R N D ) Similarity labels: s ij, i, j = 1,..., N. (Compact form: S = {s ij }) s ij = 1 if points i and j belong to the same class. s ij = 0 if points i and j belong to different classes. Output: Binary codes: b i { 1, 1} Q, i = 1,..., N. (Compact form: B { 1, 1} N Q ) When s ij = 1, the Hamming distance between b i and b j should be low. When s ij = 0, the Hamming distance between b i and b j should be high. Wu-Jun Li ( Big Learning CS, NJU 30 / 125

31 Learning to Hash Supervised Hashing with Latent Factor Models Motivation Existing supervised methods: High training complexity Semantic information is poorly utilized Wu-Jun Li ( Big Learning CS, NJU 31 / 125

32 Learning to Hash Supervised Hashing with Latent Factor Models Model The likelihood on the observed similarity labels S is defined as: p(s B) = p(s ij B) p(s ij B) = a ij is defined as a ij = σ(θ ij ) with: σ(x) = s ij S { a ij, s ij = 1 1 a ij, s ij = e x Θ ij = 1 2 bt i b j Relationship between the Hamming distance and the inner product: dist H (b i, b j ) = 1 2 (Q bt i b j ) = 1 2 (Q 2Θ ij) Wu-Jun Li ( Big Learning CS, NJU 32 / 125

33 Learning to Hash Supervised Hashing with Latent Factor Models Relaxation Re-define Θ ij as: Θ ij = 1 2 UT i U j p(s B), p(b), p(b S) become p(s U), p(u), p(u S). Define a normal distribution of p(u) as: p(u) = Q N (U d 0, βi) d=1 The log posteriori of U can be derived as: L = log p(u S) = s ij S (s ij Θ ij log(1 + e Θ ij )) 1 2β U 2 F + c Wu-Jun Li ( Big Learning CS, NJU 33 / 125

(Q) of its columns and rows, we can further reduce the time cost to O ( NQ

34 Learning to Hash Supervised Hashing with Latent Factor Models Stochastic Learning Furthermore, if we choose the subset of S by randomly selecting O (Q) of its columns and rows, we can further reduce the time cost to O ( NQ 2) per iteration. Wu-Jun Li ( Big Learning CS, NJU 34 / 125

35 Learning to Hash Supervised Hashing with Latent Factor Models MAP (CIFAR-10) LFH KSH MLH ITQ AGH LSH PCAH SH SIKH MAP Code Length Wu-Jun Li ( Big Learning CS, NJU 35 / 125

36 Learning to Hash Supervised Hashing with Latent Factor Models Training Time (CIFAR-10) LFH KSH MLH ITQ AGH LSH PCAH SH SIKH Log Training Time Code Length Wu-Jun Li ( Big Learning CS, NJU 36 / 125

37 Learning to Hash Column Sampling based Discrete Supervised Hashing Motivation Learning to hash is essentially a discrete optimization problem Most existing methods solve relaxed continuous problems FastH [Lin et al., CVPR 2014] illustrates better accuracy with discrete optimization, but it cannot utilize all training points due to high time complexity. We propose a discrete supervised hashing method which can leverage all training points: COlumn Sampling based DIscrete Supervised Hashing (COSDISH). Wu-Jun Li ( Big Learning CS, NJU 37 / 125

38 Learning to Hash Column Sampling based Discrete Supervised Hashing Objective Function KSH [CVPR 12], TSH [ICCV 13], FastH [CVPR 14] all optimize the following objective function: min qs B { 1,1} BBT 2 n q F = n i=1 j=1 where S ij { 1, 1} denotes the supervised label. n (qs ij B i B T j ) 2 The inner product reflects the opposite of the Hamming distance: dist H (B i, B j ) = (q B i B T j )/2 Wu-Jun Li ( Big Learning CS, NJU 38 / 125

39 Learning to Hash Column Sampling based Discrete Supervised Hashing Column Sampling In each iteration, we randomly choose a subset Ω of N = {1, 2,...n}, and sample Ω columns of S with the column numbers indexed by Ω. We use S { 1, 1} n Ω to denote the sampled sub-similarity matrix. Let Γ = N Ω, we can split S and B into two parts: Ω part: S Ω { 1, 1} Ω Ω, and B Ω { 1, 1} Ω q Γ part: S Γ { 1, 1} Γ Ω, and B Γ { 1, 1} Γ q Based on sampled columns in each iteration, the objective can be reformulated as follows: min q S Γ B Γ [B Ω ] T 2 B Ω,B Γ F + q S Ω B Ω [B Ω ] T 2 F. Our learning strategy is to alternatively optimize B Ω and B Γ Wu-Jun Li ( Big Learning CS, NJU 39 / 125

40 Learning to Hash Column Sampling based Discrete Supervised Hashing Alternating Learning Update B Γ with B Ω Fixed B Γ = sgn( S Γ B Ω ) reaches the minimum of f 1 ( ) which changes the F -norm to L 1 norm. Theorem Suppose that f 1 (F 1 ) and f 2(F 2 ) reach their minimum at the points F 1 and F 2, respectively. We have f 2(F 1 ) 2qf 2(F 2 ). Wu-Jun Li ( Big Learning CS, NJU 40 / 125

41 Learning to Hash Column Sampling based Discrete Supervised Hashing Alternating Learning Update B Ω with B Γ Fixed When B Γ is fixed, the sub-problem of B Ω is given by: min q S Γ B Γ [B Ω ] T 2 B Ω { 1,1} Ω q F + q S Ω B Ω [B Ω ] T 2 F. We can transform the above problem to q binary quadratic programming (BQP) problems. The optimization of the kth bit of B Ω is given by: min b k { 1,1} Ω [b k ] T Q (k) b k + [b k ] T p (k) where b k denotes the kth column of B Ω, and Q (k) i,j = 2(q S i,j Ω k 1 m=1 bm i bm j ), Q(k) i,i = 0, i j p (k) i = 2 Γ l=1 BΓ l,k (q S Γ l,i k 1 m=1 BΓ l,m BΩ i,m ) Wu-Jun Li ( Big Learning CS, NJU 41 / 125

42 Learning to Hash Column Sampling based Discrete Supervised Hashing Experiment Accuracy (MAP) Wu-Jun Li ( Big Learning CS, NJU 42 / 125

43 Learning to Hash Column Sampling based Discrete Supervised Hashing Experiment Training Time Wu-Jun Li ( Big Learning CS, NJU 43 / 125

44 Learning to Hash Deep Supervised Hashing with Pairwise Labels Motivation and Contribution Motivation: Most existing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Deep hashing with simultaneous feature learning and hash-code learning can achieve better performance. Most existing deep hashing methods are supervised with ranking (triplet) labels. For pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. Our contribution: An end-to-end framework, called deep pairwise-supervised hashing (DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. The first work for this case. Wu-Jun Li ( Big Learning CS, NJU 44 / 125

Convolutions Pooling Feature Learning Part Objective Function Part Feature learning part: convolutional neural network

45 Learning to Hash Deep Supervised Hashing with Pairwise Labels Model The end-to-end deep learning architecture for DPSH Weight Sharing Binary Code Pairwise Similarity Convolutions Pooling Convolutions Pooling Feature Learning Part Objective Function Part Feature learning part: convolutional neural network (CNN) with 7 layers (five conv+pooling, two 4096 fully connected) Wu-Jun Li ( Big Learning CS, NJU 45 / 125

46 Learning to Hash Deep Supervised Hashing with Pairwise Labels Objective Function Part min J = (s ij Θ ij log(1 + e Θ ij )) B,W,v,θ s ij S n + η b i (W T φ(x i ; θ) + v) 2 2 i=1 n points (images) X = {x i } n i=1 pairwise labels S = {s ij } with s ij {0, 1} binary code b i { 1, 1} c for each point x i, B = {b i } n i=1 Θ ij = 1 2 ut i u j, where u i = W T φ(x i ; θ) + v θ is the CNN parameters of the feature learning part Wu-Jun Li ( Big Learning CS, NJU 46 / 125

47 Learning to Hash Deep Supervised Hashing with Pairwise Labels Accuracy Comparison to state-of-the-art baselines Method CIFAR-10 (MAP) NUS-WIDE (MAP) 12-bits 24-bits 32-bits 48-bits 12-bits 24-bits 32-bits 48-bits DPSH DPSH NINH CNNH FastH SDH KSH LFH SPLH ITQ SH NINH [CVPR 15]; CNNH [AAAI 14]; FastH [CVPR 14]; SDH [CVPR 15]; KSH [CVPR 12]; LFH [SIGIR 14]; SPLH [ICML 10]; ITQ [CVPR 11]; SH [NIPS 08] Wu-Jun Li ( Big Learning CS, NJU 47 / 125

48 Learning to Hash Deep Supervised Hashing with Pairwise Labels Accuracy Comparison to deep baselines with ranking (triplet) labels Method CIFAR-10 (MAP) NUS-WIDE (MAP) 16-bits 24-bits 32-bits 48-bits 16-bits 24-bits 32-bits 48-bits DPSH DRSCH DSCH DSRH DRSCH [TIP 15]; DSCH [TIP 15]; DSRH [CVPR 15] Wu-Jun Li ( Big Learning CS, NJU 48 / 125

Learning to Hash Supervised Multimodal Hashing with SCM Supervised Multimodal Similarity Search Given a query of either image or text, return

49 Learning to Hash Supervised Multimodal Hashing with SCM Supervised Multimodal Similarity Search Given a query of either image or text, return images or texts similar to it in both feature space and semantics (label information). Wu-Jun Li ( Big Learning CS, NJU 49 / 125

50 Learning to Hash Supervised Multimodal Hashing with SCM Motivation and Contribution Motivation Existing supervised methods are not scalable Contribution Avoiding explicitly computing the pairwise similarity matrix, linear-time complexity w.r.t. the size of training data. A sequential learning method with closed-form solution to each bit, no hyper-parameters and stopping conditions are needed. Wu-Jun Li ( Big Learning CS, NJU 50 / 125

51 Learning to Hash Supervised Multimodal Hashing with SCM In matrix form, we can rewrite the problem as follows min W x,w y sgn(xw x )sgn(y W y ) T cs 2 F s.t. sgn(xw x ) T sgn(xw x ) = ni c sgn(y W y ) T sgn(y W y ) = ni c. Wu-Jun Li ( Big Learning CS, NJU 51 / 125

52 Learning to Hash Supervised Multimodal Hashing with SCM Sequential Strategy Assuming that the projection vectors w x (1),..., w (t 1) have been learned, to learn the next projection vectors w x (t) Define a residue matrix k=1 x and w y (1) t 1 R t = cs sgn(xw x (k) )sgn(y w y (k) ) T. Objective function can be written as w (t) x min,w y (t) sgn(xw (t) x )sgn(y w y (t) ) T 2 R t F.,..., w (t 1) y and w (t) y. Wu-Jun Li ( Big Learning CS, NJU 52 / 125

53 Learning to Hash Supervised Multimodal Hashing with SCM Algorithm 2 Learning Algorithm of SCM Hashing Method. C xy (0) 2(X T L)(Y T L) T (X T 1 n )(Y T 1 n ) T ; C xy (1) c C xy (0) ; C xx X T X + γi dx ; C yy Y T Y + γi dy ; for t = 1 c do Solving the following generalized eigenvalue problem C xy (t) Cyy 1 [C xy (t) ] T w x = λ 2 C xx w x, we can obtain the optimal solution w x (t) corresponding to the largest eigenvalue λ max ; w (t) y h (t) x h (t) y C (t+1) xy end for C 1 yy CT xy w(t) x λ max ; sgn(xw x (t) ); sgn(y w y (t) ); C xy (t) (X T sgn(xw x (t) ))(Y T sgn(y w y (t) )) T ; Wu-Jun Li ( Big Learning CS, NJU 53 / 125

54 Learning to Hash Supervised Multimodal Hashing with SCM Scalability Table: Training time (in seconds) on NUS-WIDE dataset by varying the size of training set. Method\Size of Training Set SCM-Seq SCM-Orth CCA CCA-3V CVH CRH MLBE Wu-Jun Li ( Big Learning CS, NJU 54 / 125

55 Learning to Hash Supervised Multimodal Hashing with SCM Accuracy Table: MAP results on NUS-WIDE. The best performance is shown in boldface. Task Method Code Length c = 16 c = 24 c = 32 SCM-Seq SCM-Orth Image Query CCA v.s. CCA-3V Text Database CVH CRH MLBE SCM-Seq SCM-Orth Text Query CCA v.s. CCA-3V Image Database CVH CRH MLBE Wu-Jun Li ( Big Learning CS, NJU 55 / 125

56 Learning to Hash Multiple-Bit Quantization Double Bit Quantization 2000 Sample Numbe X (a) A 0 BC 1 D (b) (c) Point distribution of the real values computed by PCA on 22K LabelMe data set, and different coding results based on the distribution: (a) single-bit quantization (SBQ); (b) hierarchical hashing (HH); (c) double-bit quantization (DBQ). Wu-Jun Li ( Big Learning CS, NJU 56 / 125

57 Learning to Hash Multiple-Bit Quantization Experiment map on LabelMe data set # bits SBQ HH DBQ SBQ HH DBQ ITQ SH PCA LSH SIKH # bits SBQ HH DBQ SBQ HH DBQ ITQ SH PCA LSH SIKH Wu-Jun Li ( Big Learning CS, NJU 57 / 125

58 Learning to Hash Multiple-Bit Quantization Manhattan Quantization Wu-Jun Li ( Big Learning CS, NJU 58 / 125

59 Learning to Hash Multiple-Bit Quantization Experiment Table: map on ANN SIFT1M data set. The best map among SBQ, HQ and 2-MQ under the same setting is shown in bold face. # bits SBQ HQ 2-MQ SBQ HQ 2-MQ SBQ HQ 2-MH ITQ SIKH LSH SH PCA Wu-Jun Li ( Big Learning CS, NJU 59 / 125

60 Parallel and Distributed Stochastic Learning Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 60 / 125

61 Parallel and Distributed Stochastic Learning Machine Learning Supervised Learning: Given a set of training examples {(x i, y i )} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min f(w) = 1 n f i (w), w n where f i (w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n n i=1 [log(1 + e yixt i w ) + λ 2 w 2 ] SVM: f(w) = 1 n n i=1 [max{0, 1 y ix T i w} + λ 2 w 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li ( Big Learning CS, NJU 61 / 125 i=1

62 Parallel and Distributed Stochastic Learning Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): w t+1 w t η t [ 1 n f i (w t )], n where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example i t {1, 2,..., n}, then update i=1 w t+1 w t η t f it (w t ) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li ( Big Learning CS, NJU 62 / 125

63 Parallel and Distributed Stochastic Learning Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient ( f i ) evaluation to reach ɛ: GD: O(nκ log( 1 ɛ )) SGD: O(κ( 1 ɛ )) SAG: O(n log( 1 ɛ )) when n 8κ SDCA: O((n + κ) log( 1 ɛ )) SVRG: O((n + κ) log( 1 ɛ )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li ( Big Learning CS, NJU 63 / 125

64 Parallel and Distributed Stochastic Learning Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li ( Big Learning CS, NJU 64 / 125

65 Parallel and Distributed Stochastic Learning Our Contributions Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li ( Big Learning CS, NJU 65 / 125

66 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], CoCoA [Jaggi et al. 2014], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li ( Big Learning CS, NJU 66 / 125

67 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w 0, η; for t = 0, 1, 2,... do u 0 = w t ; All threads parallelly compute the full gradient f(u 0 ) = 1 n n i=1 f i(u 0 ); u = w t ; For each thread, do: for m = 1 to M do Read current value of u, denoted as û, from the shared memory. And randomly pick up an i from {1,..., n}; Compute the update vector: ˆv = f i (û) f i (u 0 ) + f(u 0 ); u u ηˆv; end for Take w t+1 to be the current value of u in the shared memory; end for Wu-Jun Li ( Big Learning CS, NJU 67 / 125

68 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis In all the GD or SGD methods to solve the objective function, the key step can be written as Notation u u + i,j : the j th update vector computed by the i th thread; u R d and u = (u (1), u (2),..., u (d) ); t (k) i,j : the time that the operation u(k) u (k) + (k) i,j has been completed (Not the time that the operation begins); Assuming i, j, t (1) i,j < t(2) i,j <... < t(d) i,j. Wu-Jun Li ( Big Learning CS, NJU 68 / 125

69 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis: update sequence Since u (1) can only be changed by at most one thread at any absolute time, these t (1) i,j are different from each other. So we can: Sort these t (1) i,j as t(1) 0 < t (1) 1 <... < t (1) M 1 ( M = p M); 0, 1,..., M 1 are the corresponding update vectors. Since it is lock-free, for each update vector m, the real update vector is B m m because of over-written. The B m is a diagonal matrix whose diagonal elements are 0 or 1. After all the inner-loop stop, we can get: w t+1 = u 0 + M 1 m=0 B m m (1) Wu-Jun Li ( Big Learning CS, NJU 69 / 125

70 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis: update sequence According to (1), we define a sequence {u m } as follow: which means u m+1 = u m + B m m. m 1 u m = u 0 + B i i (2) Note The sequence {u m } (m = 1, 2,..., M 1) is synthetic, and the whole u m may never occur in the shared memory. What we can get is only the final value of u M. i=0 Wu-Jun Li ( Big Learning CS, NJU 70 / 125

71 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis: read sequence Assume the old update vectors 0, 1,..., a(m) 1 have been completely applied to u when one thread is reading the shared variable. At the same time, some new update vectors might be updating u. So we can write û m read by the thread to compute m as follows: û m = u a(m) + b(m) i=a(m) P m,i a(m) i where P m,i a(m) is a diagonal matrix whose diagonal elements are 0 or 1. According to the principle of the order, i (i m) should not be read by û m. So b(m) < m, which means: û m = u a(m) + m 1 i=a(m) P m,i a(m) i Wu-Jun Li ( Big Learning CS, NJU 71 / 125

72 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Convergence result With some assumptions, our algorithm gets a linear convergence rate as follows: Ef(w t+1 ) f(w ) (c M 1 + c 2 1 c 1 )(Ef(w t ) f(w )), where c 1 = 1 αηµ + c 2 and c 2 = η 2 ( 8τL3 ηρ 2 (ρ τ 1) ρ 1 + 2L 2 ρ), M = p M is the total number of iterations of the inner-loop. Note Since it is lock-free, we do not know the exact B m and we can not take the average sum of B m u m to be w t+1. Wu-Jun Li ( Big Learning CS, NJU 72 / 125

73 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments Experimental platform: A server with 12 Intel cores and 64G memory. Model: Logistic regression with L2-norm f(w) = 1 n n log(1 + e y ix T i w ) + λ 2 w 2 i=1 Data set We set λ = dataset instances features memory type rcv1 20,242 47,236 36M sparse real-sim 72,309 20,958 90M sparse news20 19,996 1,355, M sparse epsilon 400,000 2,000 11G dense Wu-Jun Li ( Big Learning CS, NJU 73 / 125

74 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments: Computation Cost log(objective value min) log(objective value min) 10 0 rcv AsySVRG 1 AsySVRG lock 10 AsySVRG Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 real sim AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (a) rcv1 (b) realsim 10 0 news AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 epsilon AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (c) news20 (d) epsilon Wu-Jun Li ( Big Learning CS, NJU 74 / 125

75 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments: Total Time Cost log(objective value min) log(objective value min) rcv1 real sim 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (a) rcv1 (b) realsim news20 epsilon 10 0 AsySVRG 10 Hogwild time (c) news20 log(objective value min) 10 0 AsySVRG 10 Hogwild time (d) epsilon Wu-Jun Li ( Big Learning CS, NJU 75 / 125

76 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments: Speed up 6 rcv1 6 real sim AsySVRG lock AsySVRG AsySVRG lock AsySVRG 5 5 speedup 4 3 speedup number of threads number of threads (a) rcv1 (b) realsim 8 news20 6 epsilon 7 AsySVRG lock AsySVRG 5 AsySVRG lock AsySVRG unlock 6 speedup 5 4 speedup number of threads (c) news number of threads (d) epsilon Wu-Jun Li ( Big Learning CS, NJU 76 / 125

77 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Motivation and Contribution Motivation: Bulk synchronous parallel (BSP) models, such as MapReduce, are commonly considered to be inefficient for distributed stochastic learning. Is there any technique to solve the issues of BSP models? Contribution: A novel distributed stochastic learning method, called scalable composite optimization for learning (SCOPE), on BSP models Both computation-efficient and communication-efficient Linear convergence rate with theoretical proof Can be easily integrated into the data processing pipeline of Spark Outperform other state-of-the-art distributed learning methods on Spark Wu-Jun Li ( Big Learning CS, NJU 77 / 125

78 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Framework of SCOPE Master w Worker_1 Worker_ Worker_p D! D! D! Figure: Distributed framework of SCOPE. Wu-Jun Li ( Big Learning CS, NJU 78 / 125

79 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Optimization Algorithm: Master Task of Master in SCOPE: Initialization: p Workers, w 0 ; for t = 0, 1, 2,..., T do Send w t to the Workers; Wait until it receives z 1, z 2,..., z p from the p Workers; Compute the full gradient z = 1 n p k=1 z k, and then send z to each Worker; Wait until it receives ũ 1, ũ 2,..., ũ p from the p Workers; Compute w t+1 = 1 p p k=1 ũk; end for Wu-Jun Li ( Big Learning CS, NJU 79 / 125

80 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Optimization Algorithm: Workers Task of Workers in SCOPE: Initialization: initialize η and c > 0; For the Worker k: for t = 0, 1, 2,..., T do Wait until it gets the newest parameter w t from the Master; Let u k,0 = w t, compute the local gradient sum z k = i D k f i (w t ), and then send z k to the Master; Wait until it gets the full gradient z from the Master; for m = 0 to M 1 do Randomly pick up an instance with index i k,m from D k ; u k,m+1 = u k,m η( f ik,m (u k,m ) f ik,m (w t )+z+c(u k,m w t )); end for Send u k,m or the average of these {u k,m }, which is called the locally updated parameter and denoted as ũ k, to the Master; end for Wu-Jun Li ( Big Learning CS, NJU 80 / 125

81 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Convergence Let α = 1 η(2µ + c) < 1, β = cη + 3L 2 η 2 and α + β < 1. We have the following theorems: Theorem If we take w t+1 = 1 p p k=1 u k,m, then we can get the following convergence result: Theorem E w t+1 w 2 (α M + If we take w t+1 = 1 p p k=1 ũk with ũ k = 1 M the following convergence result: β 1 α )E w t w 2. M 1 m=0 u k,m, then we can get E w t+1 w 2 1 ( M(1 α) + β 1 α )E w t w 2. Wu-Jun Li ( Big Learning CS, NJU 81 / 125

82 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Communication Cost Traditional mini-batch based methods: O(T n) SCOPE: O(T ) Wu-Jun Li ( Big Learning CS, NJU 82 / 125

83 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment Logistic regression (LR) with a L 2 -norm regularization term: f(w) = 1 ] n n i=1 [log(1 + e y ix T i w ) + λ 2 w 2. Table: Datasets for evaluation. instances features memory λ MNIST-8M 8,100, G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691, G 1e-6 Two Spark clusters: small: 1 Master and 16 Workers large: 1 Master and 128 Workers Wu-Jun Li ( Big Learning CS, NJU 83 / 125

84 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment Baselines: MLlib [Meng et al., 2015]: MLlib is an open source library for distributed machine learning on Spark. We compare our method with distributed lbfgs for MLlib, which is a batch learning method. LibLinear [Lin et al., 2014a]: LibLinear is a distributed Newton method, which is also a batch learning method. Splash [Zhang and Jordan, 2015]: Splash is a distributed SGD method by using the local learning strategy to reduce communication cost. CoCoA [Jaggi et al., 2014]: CoCoA is a distributed dual coordinate ascent method. CoCoA+ [Ma et al., 2015]: CoCoA+ is an improved version of CoCoA. CoCoA+ adopts adding rather than average to combine local updates. Wu-Jun Li ( Big Learning CS, NJU 84 / 125

85 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment 10 0 MNIST 8M with 16 cores 10 0 epsilon with 16 cores objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 (a) MNIST-8M (b) epsilon 10 0 KDD12 with 16 cores 10 0 Data A with 128 cores objective value optimal SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 5 (c) KDD CPU Time(millisecond) x 10 4 (d) Data-A Wu-Jun Li ( Big Learning CS, NJU 85 / 125

86 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment speedup SCOPE Ideal CPU Time(milllisecond) per iteration computation time waiting time communication time #cores #cores (e) Speedup (f) Synchronization cost Wu-Jun Li ( Big Learning CS, NJU 86 / 125

87 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Conclusion Stochastic learning is becoming popular for big data machine learning. Lock-free strategy is key to get a good speedup in parallel stochastic learning. With properly designed techniques, BSP models are also efficient for distributed stochastic learning. Wu-Jun Li ( Big Learning CS, NJU 87 / 125

88 Other Application-Driven Distributed Learning Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 88 / 125

Other Application-Driven Distributed Learning Introduction Distributed learning: Perform machine learning on clusters with several machines (nodes).

89 Other Application-Driven Distributed Learning Introduction Distributed learning: Perform machine learning on clusters with several machines (nodes). Our contributions: Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising (ICML 2014) Distributed Power-law Graph Computing: Theoretical and Empirical Analysis (NIPS 2014) Distributed Stochastic ADMM for Matrix Factorization (CIKM 2014) Wu-Jun Li ( Big Learning CS, NJU 89 / 125

Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR

accounts for the majority of the income for the major internet companies.

Click through rate (CTR) prediction is the problem of estimating the probability that an ad

90 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction CTR Prediction for Online Advertising Multi-billion business on the web and accounts for the majority of the income for the major internet companies. Display advertising is a big part of online advertising. Click through rate (CTR) prediction is the problem of estimating the probability that an ad is clicked when displayed to a user in a specific context. (a) Google (b) Amazon (c) Taobao Wu-Jun Li ( Big Learning CS, NJU 90 / 125

91 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Notation Impression (instance): ad + user + context Training set {(x (i), y (i) ) i = 1,..., N} x T = (x T u, x T a, x T o ) y {0, 1} with y = 1 denoting click and y = 0 denoting non-click Learn h(x) = h(x u, x a, x o ) to predict CTR Ads s User Wu-Jun Li ( Big Learning CS, NJU 91 / 125

92 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Coupled Group Lasso (CGL) 1 Likelihood where h(x) = P r(y = 1 x, W, V, b) = g ( (x T u W)(x T a V) T + b T x o ) W R du k, V R da k, (x T u W)(x T a V) T = x T u (WV T )x a 2 Objective function N min W,V,b i=1 ( ξ W, V, b; x (i), y (i)) + λω(w, V), in which ξ(w, V, b; x (i), y (i) ) = log ((h(x (i) )) y(i) (1 h(x (i) )) 1 y(i)) Ω(W, V) = W 2,1 + V 2,1 = d u W i 2 + d a i=1 i=1 V i 2 Wu-Jun Li ( Big Learning CS, NJU 92 / 125

Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Learning Framework Compute gradient g p locally on each node p in parallel.

93 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Learning Framework Compute gradient g p locally on each node p in parallel. Compute gradient g = P p=1 g p with AllReduce. Add the gradient of the regularization term and take an L-BFGS step in the master node. Broadcast the updated parameters to each slaver node. Wu-Jun Li ( Big Learning CS, NJU 93 / 125

94 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Experiment Experiment Environment MPI-Cluster with hundreds of nodes, each of which is a 24-core server with 2.2GHz CPU and 96GB of RAM Baseline and Evaluation Metric Baseline: LR (with L2-norm) [Chapelle et al., 2013] Evalution Metric (Discrimination) RelaImpr = AUC(model) 0.5 AUC(baseline) % Wu-Jun Li ( Big Learning CS, NJU 94 / 125

95 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Data Sets Data set # Instances (Billion) CTR (%) # Ads # Users (Million) Storage (TB) Train , Test , Train , Test , Train , Test , Real-world data sets collected from taobao of Alibaba Group Wu-Jun Li ( Big Learning CS, NJU 95 / 125

96 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Performance (c) RelaImpr w.r.t. Baseline (d) Speedup Accuracy and Scalability Wu-Jun Li ( Big Learning CS, NJU 96 / 125

97 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Feature Selection. Important Features Useless Features Ad Part Women s clothes, Skirt, Dress,Children s wear, Shoes, Cellphone Movie, Act, Takeout, Food booking service User Part Watch, Underwear, Fur clothing, Furniture Stage Costume, Flooring, Pencil, Outdoor sock Feature Selection Results Wu-Jun Li ( Big Learning CS, NJU 97 / 125

relational learning, manifold learning, PageRank, community detection, etc Distributed graph computing frameworks for

98 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Graph-based Machine Learning Big graphs emerge in many real applications Graph-based machine learning is a hot research topic with wide applications: relational learning, manifold learning, PageRank, community detection, etc Distributed graph computing frameworks for big graphs (a) Social Network (b) Biological Network Wu-Jun Li ( Big Learning CS, NJU 98 / 125

strategies for graph partitioning. Shaded vertices are ghosts and mirrors, respectively.

99 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Graph Partitioning Graph partitioning (GP) plays a key role to affect the performance of distributed graph computing: Workload balance Communication cost Two strategies for graph partitioning. Shaded vertices are ghosts and mirrors, respectively. (a) Edge-Cut (b) Vertex-Cut Theoretical and empirical results show vertex-cut is better than edge-cut. Wu-Jun Li ( Big Learning CS, NJU 99 / 125

world typically follow skewed power-law degree distributions: Pr(d) d α.

Intuition: cut the high-degree vertices (b) Bad partitioning (g) Sample

100 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Power-Law Graph Partitioning Natural graphs from real world typically follow skewed power-law degree distributions: Pr(d) d α. Different vertex-cut methods can result in different performance. Intuition: cut the high-degree vertices (b) Bad partitioning (g) Sample (c) Good partitioning Wu-Jun Li ( Big Learning CS, NJU 100 / 125

101 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Degree-based Hashing (DBH) Existing GP methods, such as the Random in PowerGraph (Joseph E Gonzalez et al, 2012) and Grid in GraphBuilder (Nilesh Jain et al, 2013), do not make effective use of the power-law degree distribution. We propose a novel GP method called degree-based hashing (DBH): Algorithm 3 GP with DBH Input: The set of edges E; the set of vertices V ; number of machines p. Output: The assignment M(e) {1,..., p} for each edge e. Initialization: count the degree d i for each vertex i {1,..., n} in parallel for all e = (v i, v j ) E do Hash each edge in parallel: if d i < d j then M(e) vertex hash(v i ) else M(e) vertex hash(v j ) end if end for Wu-Jun Li ( Big Learning CS, NJU 101 / 125

102 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Theoretical Analysis We theoretically prove that our DBH method can outperform Random and Grid in terms of: reducing replication factor (communication and storage cost) keeping good edge-balance (workload balance) Nice property: Our DBH reduces more replication factor when the power-law graph is more skewed. Wu-Jun Li ( Big Learning CS, NJU 102 / 125

103 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Data Set Table: Datasets Alias Graph V E Tw Twitter 42M 1.47B Arab Arabic M 0.6B Wiki Wiki 5.7M 130M LJ LiveJournal 5.4M 79M WG WebGoogle 0.9M 5.1M Wu-Jun Li ( Big Learning CS, NJU 103 / 125

104 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Empirical Results (PageRank) Replication Factor Random Grid DBH Speedup(%) Random Grid 60.6% 26.5% 8.42% 21.2% 4.28% 23.6% 6.06% 31.5% 13.3% 25% 0 WG LJ Wiki Arab Tw WG LJ Wiki Arab Tw (d) Replication Factor (e) Speedup relative to baselines Figure: Experiments on real-world graphs. The number of machines is Random Grid DBH Random Grid DBH Replication Factor Number of Machines Number of Machines (a) Replication Factor (b) Execution Time Figure: The number of machines ranges from 8 to 64 on Twitter graph. Execution Time (Sec) Wu-Jun Li ( Big Learning CS, NJU 104 / 125

customers (users) by utilizing the customers historic preferences.

105 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Recommender System Recommend products (items) to customers (users) by utilizing the customers historic preferences. (a) Amazon (b) Sina Weibo Wu-Jun Li ( Big Learning CS, NJU 105 / 125

Other Application-Driven Distributed Learning m Users m Distributed Stochastic ADMM for Matrix Factorization k Matrix Factorization Popular in recommender systems for its promising performance Ann

106 Other Application-Driven Distributed Learning m Users m Distributed Stochastic ADMM for Matrix Factorization k Matrix Factorization Popular in recommender systems for its promising performance Ann James I 3 N CE I M D P AT I TI O T O R IX S N k n Kate 2 Helen min U,V 2 n Items (i,j) Ω User Factors R U T V Item Factors [(R i,j U T iv j ) 2 + λ 1 U T iu i + λ 2 V T jv j ] Wu-Jun Li ( Big Learning CS, NJU 106 / 125

as we can: R 1 R 2 R 1 [U 1 ] T V 1 [U 1 ] T [U 2 ] T R p V R 2 V 2 [U 2 ] T R p [U

107 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Data Split Strategy (P machines) Decouple U and V as possible as we can: R 1 R 2 R 1 [U 1 ] T V 1 [U 1 ] T [U 2 ] T R p V R 2 V 2 [U 2 ] T R p [U p ] T V p V [U p ] T Wu-Jun Li ( Big Learning CS, NJU 107 / 125

108 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Reformulated MF problem : min U,V,V 1 2 P p=1 (i,j) Ω p [ ] (R i,j U T iv p j )2 + λ 1 U T iu i + λ 2 [V p j ]T V p j s.t. : V p V = 0; p {1, 2,..., P } where V = {V p } P p=1, Ωp denotes the (i, j) indices of the ratings located in node p Wu-Jun Li ( Big Learning CS, NJU 108 / 125

109 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Define: we can get L p (U, V p, Θ p, V) = f p (U, V p ) + l p (V p, V, Θ p ) = ˆfi,j (U i, V p j ) (i,j) Ω [ p ρ + 2 Vp V 2 F + tr ( [Θ p ] T (V p V) ) ], L(U, V, O, V) = P L p (U, V p, Θ p, V). p=1 Wu-Jun Li ( Big Learning CS, NJU 109 / 125

110 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Get the solutions by repeating the following three steps: U t+1, V p t+1 argmin U,V p L p (U, V p, Θ p t, V t), p {1, 2,..., P } V t+1 argmin L(U t+1, V t+1, O t, V), V Θ p t+1 Θp t + ρ(vp t+1 V t+1), p {1, 2,..., P }. The solution for V t+1 is: V t+1 = 1 P which can be calculated efficiently. P p=1 V p t+1 The problem lies in getting U t+1,v t+1 efficiently Wu-Jun Li ( Big Learning CS, NJU 110 / 125

111 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Stochastic Learning Batch learning is still not very efficient: U t+1 = U t τ t T Uf p (U t, V p t ), V p t+1 = τ t 1 + ρτ t [ Vp t τ t + ρv t Θ p t T V pf p (U t, V p t )] Stochastic Learning (U i ) t+1 =(U i ) t + τ t (ɛ ij (V p j ) t λ 1 (U i ) t ), (V p j ) t+1 = τ t [ 1 λ 2τ t (V p j 1 + ρτ ) t t +ɛ ij (U i ) t + ρ(v j ) t (Θ p j ) t], where ɛ ij = R ij [(U i ) t ] T (V p j ) t. τ t Wu-Jun Li ( Big Learning CS, NJU 111 / 125

112 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Scheduler Comparison Baselines: CCD++ (H.-F Yu etc, ICDM 12); DSGD (R. Gemulla etc, KDD 11) Number of synchronization operations for one iteration to fully scan all the ratings: Set d = 0 Generate Statum Sequence for Each Node Parallelly Update U p and V p Parallelly Update Ud* Synchronize Ud* Parallelly Update Vd* Set p = 0 Parallelly Update U and V Synchronize to get V Synchronize Vd* d++ d < k (a) CCD++ synchronizes k times Synchronize V p++ p < P (b) DSGD synchronizes P times Parallelly Update Θ p (c) DS-ADMM synchronizes 1 time Wu-Jun Li ( Big Learning CS, NJU 112 / 125

113 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Experiment Experiment environment MPI-Cluster with 20 nodes, each of which is a 24-core server with 2.2GHz CPU and 96GB of RAM; One core and 10GB memory for each node are actually used. Baseline and evaluation metric Baseline: CCD++, DSGD, DSGD-Bias Evalution metric: 1 test RMSE : Q (Ri,j U T i V j) 2 Wu-Jun Li ( Big Learning CS, NJU 113 / 125

114 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Data Sets Table: Data sets and parameter settings Data Set Netflix Yahoo! Music R1 Yahoo! Music R2 m 480,190 1,938,361 1,823,179 n 17,770 49, ,736 #Train 99,072,112 73,578, ,640,226 #Test 1,408,395 7,534,592 18,231,790 k η 0 /τ λ 1 /λ ρ α β P Wu-Jun Li ( Big Learning CS, NJU 114 / 125

115 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Accuracy and Efficiency CCD++ DSGD DSGD Bias DS ADMM 1 CCD++ DSGD DSGD Bias 0.96 DS ADMM Test RMSE 0.94 Test RMSE Test RMSE Time(s) (a) Netflix CCD++ DSGD DSGD Bias DS ADMM Time(s) Log Running Time to get Test RMSE Time(s) (b) Yahoo-Music-R1 DS ADMM DSGD DSGD Bias CCD Node Number (c) Yahoo-Music-R2 (d) Time to fixed-rmse (0.922) Wu-Jun Li ( Big Learning CS, NJU 115 / 125

116 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Speedup 6 5 CCD++ DSGD DS ADMM Linear Speedup Speedup Node Number Wu-Jun Li ( Big Learning CS, NJU 116 / 125

117 Conclusion Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 117 / 125

118 Conclusion Our Contribution Learning to hash (MFÆS): memory/disk/cpu/communication Parallel and distributed stochastic learning ( 1 Ùª ÅÆS): memory/disk/cpu; but increase communication cost Other application-driven distributed learning (Ù A^ Ä Ùª ÆS) Wu-Jun Li ( Big Learning CS, NJU 118 / 125

119 Conclusion Future Work Learning to hash for decreasing communication cost Distributed programming models and platforms for machine learning: MPI (fault tolerance, asynchronous), GraphLab, Spark, Parameter Server, Petuum, MapReduce, Storm, GPU, etc Wu-Jun Li ( Big Learning CS, NJU 119 / 125

120 Conclusion Future Work Open source project: LIBBLE: A library for big learning LIBBLE-Spark: Classification: LR, SVM Regression: Linear Regression Dimensionality Reduction: PCA Matrix Decomposition/Factorization: SVD LIBBLE-MPI LIBBLE-GPU LIBBLE-MultiCore Wu-Jun Li ( Big Learning CS, NJU 120 / 125

121 Conclusion 统计模式Future Work Big data machine learning (big learning) framework 搜索分类聚类大数据机器学习核心引擎 (AIOS 内核 ) 推荐排序验知识接口先新型机器学习与数据挖掘算法数值计算并行库云计算等分布式环境特定数据形式矩阵计算最优化方法编程模型数据分块布局关系数据高维数据动态数据 Wu-Jun Li ( Big Learning CS, NJU 121 / 125

122 Conclusion Related Publication (1/3)[*indicate my students] Shen-Yi Zhao*, Ru Xiang*, Ying-Hao Shi*, Peng Gao*, Wu-Jun Li. SCOPE: Scalable Composite Optimization for Learning on Spark. arxiv: v4, Shen-Yi Zhao*, Wu-Jun Li. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li, Sheng Wang*, Wang-Cheng Kang*. Feature Learning based Deep Supervised Hashing with Pairwise Labels. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), Wang-Cheng Kang*, Wu-Jun Li, Zhi-Hua Zhou. Column Sampling based Discrete Supervised Hashing. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Qing-Yuan Jiang*, Wu-Jun Li. Scalable Graph Hashing with Feature Transformation. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Wu-Jun Li ( Big Learning CS, NJU 122 / 125

123 Conclusion Related Publication (2/3)[*indicate my students] Ling Yan*, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st International Conference on Machine Learning (ICML), Cong Xie*, Ling Yan*, Wu-Jun Li, Zhihua Zhang. Distributed Power-law Graph Computing: Theoretical and Empirical Analysis. Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), Peichao Zhang*, Wei Zhang*, Wu-Jun Li, Minyi Guo. Supervised Hashing with Latent Factor Models. Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Zhi-Qin Yu*, Xing-Jian Shi*, Ling Yan*, Wu-Jun Li. Distributed Stochastic ADMM for Matrix Factorization. Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM), Wu-Jun Li ( Big Learning CS, NJU 123 / 125

124 Conclusion Related Publication (3/3)[*indicate my students] Dongqing Zhang*, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), Weihao Kong*, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), Weihao Kong*, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale Image Retrieval. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Weihao Kong*, Wu-Jun Li. Double-Bit Quantization for Hashing. Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li ( Big Learning CS, NJU 124 / 125

125 Conclusion Q & A Thanks! Wu-Jun Li ( Big Learning CS, NJU 125 / 125

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)