LAMDA Group. Oct 19, Wu-Jun Li ( Big Learning CS, NJU 1 / 125

Size: px
Start display at page:

Download "LAMDA Group. Oct 19, Wu-Jun Li ( Big Learning CS, NJU 1 / 125"

Transcription

1 ŒêâÅìÆS oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Oct 19, 2016 Wu-Jun Li ( Big Learning CS, NJU 1 / 125

2 Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 2 / 125

3 Introduction Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 3 / 125

4 Introduction Big Data Big data has attracted much attention from both academia and industry. Facebook: 750 million users Flickr: 6 billion photos Wal-Mart: 267 million items/day; 4PB data warehouse Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Wu-Jun Li ( Big Learning CS, NJU 4 / 125

5 Introduction Definition of Big Data Gartner (2012): Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. ( 3Vs ) International Data Corporation (IDC) (2011): Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis. ( 4Vs ) McKinsey Global Institute (MGI) (2011): Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. Why not hot until recent years? Big data: 7 Cloud computingµæ Eâ Big data machine learningµž7eâ Wu-Jun Li ( Big Learning CS, NJU 5 / 125

6 Introduction Big Data Machine Learning Definition: perform machine learning from big data. Role: key for big data Ultimate goal of big data processing is to mine value from data. Machine learning provides fundamental theory and computational techniques for big data mining and analysis. Wu-Jun Li ( Big Learning CS, NJU 6 / 125

7 Introduction Challenge Storage: memory and disk Computation: CPU Communication: network Wu-Jun Li ( Big Learning CS, NJU 7 / 125

8 Introduction Our Contribution Learning to hash (MFÆS): memory/disk/cpu/communication Parallel and distributed stochastic learning ( 1 Ùª ÅÆS): memory/disk/cpu; but increase communication cost Other application-driven distributed learning (Ù A^ Ä Ùª ÆS) Wu-Jun Li ( Big Learning CS, NJU 8 / 125

9 Learning to Hash Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 9 / 125

10 Learning to Hash Nearest Neighbor Search (Retrieval) Given a query point q, return the points closest (similar) to q in the database (e.g., image retrieval). Underlying many machine learning, data mining, information retrieval problems Challenge in Big Data Applications: Curse of dimensionality Storage cost Query speed Wu-Jun Li ( Big Learning CS, NJU 10 / 125

11 Learning to Hash Similarity Preserving Hashing Wu-Jun Li ( Big Learning CS, NJU 11 / 125

12 Learning to Hash Reduce Dimensionality and Storage Cost Wu-Jun Li ( Big Learning CS, NJU 12 / 125

13 Learning to Hash Fast Query Speed By using hash-code to construct index, we can achieve constant or sub-linear search time complexity. In some cases, exhaustive search with linear time complexity is also acceptable because the distance calculation cost is low with binary representation. Wu-Jun Li ( Big Learning CS, NJU 13 / 125

14 Learning to Hash Two Stages of Hash Function Learning Two main categories: Category I: Projection Stage (Dimension Reduction) Projected with real-valued projection function Given a point x, each projected dimension i will be associated with a real-valued projection function f i(x) (e.g., f i(x) = w T i x) Quantization Stage Turn real into binary Essential difference between metric learning and learning to hash Category II: Binary-Code Learning Stage Hash Function Learning Stage Wu-Jun Li ( Big Learning CS, NJU 14 / 125

15 Learning to Hash Our Contribution Unsupervised Hashing Isotropic hashing (IsoHash) [NIPS 2012] Scalable graph hashing with feature transformation [IJCAI 2015] Supervised Hashing Supervised hashing with latent factor models [SIGIR 2014] Column sampling based discrete supervised hashing [AAAI 2016] Feature learning based deep supervised hashing with pairwise labels [IJCAI 2016] Multimodal Hashing Large-scale supervised multimodal hashing with semantic correlation maximization [AAAI 2014] Multiple-Bit Quantization Double-bit quantization (DBQ) [AAAI 2012] Manhattan quantization (MQ) [SIGIR 2012] Wu-Jun Li ( Big Learning CS, NJU 15 / 125

16 Learning to Hash Isotropic Hashing Motivation Problem: All existing methods use the same number of bits for different projected dimensions with different variances. Possible Solutions: Different number of bits for different dimensions (Variable bit quantization (Moran et al, ACL 2013)) Isotropic (equal) variances for all dimensions Wu-Jun Li ( Big Learning CS, NJU 16 / 125

17 Learning to Hash Isotropic Hashing PCA Hash To generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the matrix XX T as columns of the projection matrix W R d m. Here, top m eigenvectors are those corresponding to the m largest eigenvalues {λ k } m k=1, generally arranged with the non-increasing order λ 1 λ 2 λ m. Let λ = [λ 1, λ 2,, λ m ] T. Then Λ = W T XX T W = diag(λ) Define hash function h(x) = sgn(w T x) Wu-Jun Li ( Big Learning CS, NJU 17 / 125

18 Learning to Hash Isotropic Hashing Idea of IsoHash Learn an orthogonal matrix Q R m m which makes Q T W T XX T W Q become a matrix with equal diagonal values. Effect of Q: to make each projected dimension have the same variance while keeping the Euclidean distances between any two points unchanged. Wu-Jun Li ( Big Learning CS, NJU 18 / 125

19 Learning to Hash Isotropic Hashing Accuracy (map) Method CIFAR # bits IsoHash PCAH ITQ SH SIKH LSH Wu-Jun Li ( Big Learning CS, NJU 19 / 125

20 Learning to Hash Isotropic Hashing Training Time Training Time(s) IsoHash GF IsoHash LP ITQ SH SIKH LSH PCAH Number of training data x 10 4 Wu-Jun Li ( Big Learning CS, NJU 20 / 125

21 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Problem The memory cost and time complexity are at least O(n 2 ) for graph hashing if all pairwise similarities are explicitly computed. How to utilize the whole graph and avoid O(n 2 ) complexity? Scalable Graph Hashing(SGH) A feature transformation (Shrivastava and Li, NIPS 2014) method to effectively approximate the whole graph without explicitly computing it. A sequential method for bit-wise complementary learning. Linear complexity. Wu-Jun Li ( Big Learning CS, NJU 21 / 125

22 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Objective function min W c S sgn(k(x)w T )sgn(k(x)w T ) T 2 F s.t. WK(X) T K(X)W T = I x, define: K(x) = [φ(x, x 1 ) n i=1 φ(x i, x 1 )/n,..., φ(x, x m ) n i=1 φ(x i, x m )/n] S ij = 2S ij 1 ( 1, 1]. Notation X = {x 1,..., x n } T R n d : n data points. Pairwise similarity metric defined as: S ij = e x i x j 2 F /ρ (0, 1] Wu-Jun Li ( Big Learning CS, NJU 22 / 125

23 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) x, define P (x) and Q(x): 2(e P (x) = [ 2 1) eρ 2(e Q(x) = [ 2 1) eρ e x 2 F ρ e x; e e x 2 F ρ e x; e e x 2 F ρ ; 1] e x 2 F ρ ; 1] x i, x j X P (x i ) T Q(x j ) = 2[ e2 1 2e 2xT i x j ρ 2e x i 2 F x j 2 F +2xT i x j ρ + e e 1 = S ij ]e x i 2 F + x j 2 F ρ 1 Wu-Jun Li ( Big Learning CS, NJU 23 / 125

24 Learning to Hash Scalable Graph Hashing with Feature Transformation Feature Transformation Here, we use an approximation e2 1 2e x + e2 +1 2e e x We assume 1 2 ρ xt i x j 1. It is easy to prove that ρ = 2 max{ x i 2 F }n i=1 can make 1 2 ρ xt i x j 1. Then we have S P (X) T Q(X) Wu-Jun Li ( Big Learning CS, NJU 24 / 125

25 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Direct relaxation may lead to poor performance. We adopt a sequential learning strategy in a bit-wise complementary manner Residual definition: R t = c S t 1 i=1 sgn(k(x)w i)sgn(k(x)w i ) T R 1 = c S Objective function: min w t By relaxation, we can get: R t sgn(k(x)w t )sgn(k(x)w t ) T 2 F s.t. w T t K(X) T K(X)w t = 1 max w t tr(w T t K(X) T R t K(X)w t ) s.t. w T t K(X) T K(X)w t = 1 Wu-Jun Li ( Big Learning CS, NJU 25 / 125

26 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Then we obtain a generalized eigenvalue problem: Key component: K(X) T R t K(X)w t = λk(x) T K(X)w t ck(x) T SK(X) = ck(x) T P (X) T Q(X) K(X) }{{} S = c[k(x) T P (X) T ][Q(X)K(X)] Adopting R t = c S c i=1,i t sgn(k(x)w i)sgn(k(x)w i ) T, we continue the sequential learning procedure. The information in S is fully used, but is not explicitly computed. Time complexity is decreased from O(n 2 ) to O(n). Wu-Jun Li ( Big Learning CS, NJU 26 / 125

27 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Algorithm 1 Sequential learning algorithm for SGH Input: Feature vectors X R n d ; code length c; number of kernel bases m. Output: Weight matrix W R c m. Procedure Construct P (X) and Q(X); Construct K(X) based on the kernel bases, which are m points randomly selected from X; A 0 = [K(X) T P (X) T ][Q(X)K(X)]; A 1 = ca 0 ; Z = K(X) T K(X) + γi d ; for t = 1 c do Solve the following generalized eigenvalue problem A t w t = λzw t ; U = [K(X) T sgn(k(x)w t )][K(X) T sgn(k(x)w t )] T ; A t+1 = A t U; end for  0 = A c+1 Randomly permutate {1, 2,, c} to generate a random index set M; for t = 1 c do ˆt = M(t);  0 = Â0 + K(X) T sgn(k(x)wˆt )sgn(k(x)wˆt ) T K(X); Solve the following generalized eigenvalue problem  0 v = λzv; Update wˆt v  0 = Â0 K(X) T sgn(k(x)wˆt )sgn(k(x)wˆt ) T K(X); end for Wu-Jun Li ( Big Learning CS, NJU 27 / 125

28 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Top-1k Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH DGH-I DGH-R PCAH LSH Top-1k Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH DGH-I DGH-R PCAH LSH Wu-Jun Li ( Big Learning CS, NJU 28 / 125

29 Learning to Hash Scalable Graph Hashing with Feature Transformation Scalable Graph Hashing (SGH) Training Here, t 1 = Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH t t t t t 1 DGH-I t t t t t 1 DGH-R t t t t t 1 PCAH LSH Training Here, t 2 = Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH t t t t t 2 DGH-I t t t t t 2 DGH-R t t t t t 2 PCAH LSH Wu-Jun Li ( Big Learning CS, NJU 29 / 125

30 Learning to Hash Supervised Hashing with Latent Factor Models Problem Definition Input: Feature vectors: x i R D, i = 1,..., N. (Compact form: X R N D ) Similarity labels: s ij, i, j = 1,..., N. (Compact form: S = {s ij }) s ij = 1 if points i and j belong to the same class. s ij = 0 if points i and j belong to different classes. Output: Binary codes: b i { 1, 1} Q, i = 1,..., N. (Compact form: B { 1, 1} N Q ) When s ij = 1, the Hamming distance between b i and b j should be low. When s ij = 0, the Hamming distance between b i and b j should be high. Wu-Jun Li ( Big Learning CS, NJU 30 / 125

31 Learning to Hash Supervised Hashing with Latent Factor Models Motivation Existing supervised methods: High training complexity Semantic information is poorly utilized Wu-Jun Li ( Big Learning CS, NJU 31 / 125

32 Learning to Hash Supervised Hashing with Latent Factor Models Model The likelihood on the observed similarity labels S is defined as: p(s B) = p(s ij B) p(s ij B) = a ij is defined as a ij = σ(θ ij ) with: σ(x) = s ij S { a ij, s ij = 1 1 a ij, s ij = e x Θ ij = 1 2 bt i b j Relationship between the Hamming distance and the inner product: dist H (b i, b j ) = 1 2 (Q bt i b j ) = 1 2 (Q 2Θ ij) Wu-Jun Li ( Big Learning CS, NJU 32 / 125

33 Learning to Hash Supervised Hashing with Latent Factor Models Relaxation Re-define Θ ij as: Θ ij = 1 2 UT i U j p(s B), p(b), p(b S) become p(s U), p(u), p(u S). Define a normal distribution of p(u) as: p(u) = Q N (U d 0, βi) d=1 The log posteriori of U can be derived as: L = log p(u S) = s ij S (s ij Θ ij log(1 + e Θ ij )) 1 2β U 2 F + c Wu-Jun Li ( Big Learning CS, NJU 33 / 125

34 Learning to Hash Supervised Hashing with Latent Factor Models Stochastic Learning Furthermore, if we choose the subset of S by randomly selecting O (Q) of its columns and rows, we can further reduce the time cost to O ( NQ 2) per iteration. Wu-Jun Li ( Big Learning CS, NJU 34 / 125

35 Learning to Hash Supervised Hashing with Latent Factor Models MAP (CIFAR-10) LFH KSH MLH ITQ AGH LSH PCAH SH SIKH MAP Code Length Wu-Jun Li ( Big Learning CS, NJU 35 / 125

36 Learning to Hash Supervised Hashing with Latent Factor Models Training Time (CIFAR-10) LFH KSH MLH ITQ AGH LSH PCAH SH SIKH Log Training Time Code Length Wu-Jun Li ( Big Learning CS, NJU 36 / 125

37 Learning to Hash Column Sampling based Discrete Supervised Hashing Motivation Learning to hash is essentially a discrete optimization problem Most existing methods solve relaxed continuous problems FastH [Lin et al., CVPR 2014] illustrates better accuracy with discrete optimization, but it cannot utilize all training points due to high time complexity. We propose a discrete supervised hashing method which can leverage all training points: COlumn Sampling based DIscrete Supervised Hashing (COSDISH). Wu-Jun Li ( Big Learning CS, NJU 37 / 125

38 Learning to Hash Column Sampling based Discrete Supervised Hashing Objective Function KSH [CVPR 12], TSH [ICCV 13], FastH [CVPR 14] all optimize the following objective function: min qs B { 1,1} BBT 2 n q F = n i=1 j=1 where S ij { 1, 1} denotes the supervised label. n (qs ij B i B T j ) 2 The inner product reflects the opposite of the Hamming distance: dist H (B i, B j ) = (q B i B T j )/2 Wu-Jun Li ( Big Learning CS, NJU 38 / 125

39 Learning to Hash Column Sampling based Discrete Supervised Hashing Column Sampling In each iteration, we randomly choose a subset Ω of N = {1, 2,...n}, and sample Ω columns of S with the column numbers indexed by Ω. We use S { 1, 1} n Ω to denote the sampled sub-similarity matrix. Let Γ = N Ω, we can split S and B into two parts: Ω part: S Ω { 1, 1} Ω Ω, and B Ω { 1, 1} Ω q Γ part: S Γ { 1, 1} Γ Ω, and B Γ { 1, 1} Γ q Based on sampled columns in each iteration, the objective can be reformulated as follows: min q S Γ B Γ [B Ω ] T 2 B Ω,B Γ F + q S Ω B Ω [B Ω ] T 2 F. Our learning strategy is to alternatively optimize B Ω and B Γ Wu-Jun Li ( Big Learning CS, NJU 39 / 125

40 Learning to Hash Column Sampling based Discrete Supervised Hashing Alternating Learning Update B Γ with B Ω Fixed B Γ = sgn( S Γ B Ω ) reaches the minimum of f 1 ( ) which changes the F -norm to L 1 norm. Theorem Suppose that f 1 (F 1 ) and f 2(F 2 ) reach their minimum at the points F 1 and F 2, respectively. We have f 2(F 1 ) 2qf 2(F 2 ). Wu-Jun Li ( Big Learning CS, NJU 40 / 125

41 Learning to Hash Column Sampling based Discrete Supervised Hashing Alternating Learning Update B Ω with B Γ Fixed When B Γ is fixed, the sub-problem of B Ω is given by: min q S Γ B Γ [B Ω ] T 2 B Ω { 1,1} Ω q F + q S Ω B Ω [B Ω ] T 2 F. We can transform the above problem to q binary quadratic programming (BQP) problems. The optimization of the kth bit of B Ω is given by: min b k { 1,1} Ω [b k ] T Q (k) b k + [b k ] T p (k) where b k denotes the kth column of B Ω, and Q (k) i,j = 2(q S i,j Ω k 1 m=1 bm i bm j ), Q(k) i,i = 0, i j p (k) i = 2 Γ l=1 BΓ l,k (q S Γ l,i k 1 m=1 BΓ l,m BΩ i,m ) Wu-Jun Li ( Big Learning CS, NJU 41 / 125

42 Learning to Hash Column Sampling based Discrete Supervised Hashing Experiment Accuracy (MAP) Wu-Jun Li ( Big Learning CS, NJU 42 / 125

43 Learning to Hash Column Sampling based Discrete Supervised Hashing Experiment Training Time Wu-Jun Li ( Big Learning CS, NJU 43 / 125

44 Learning to Hash Deep Supervised Hashing with Pairwise Labels Motivation and Contribution Motivation: Most existing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Deep hashing with simultaneous feature learning and hash-code learning can achieve better performance. Most existing deep hashing methods are supervised with ranking (triplet) labels. For pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. Our contribution: An end-to-end framework, called deep pairwise-supervised hashing (DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. The first work for this case. Wu-Jun Li ( Big Learning CS, NJU 44 / 125

45 Learning to Hash Deep Supervised Hashing with Pairwise Labels Model The end-to-end deep learning architecture for DPSH Weight Sharing Binary Code Pairwise Similarity Convolutions Pooling Convolutions Pooling Feature Learning Part Objective Function Part Feature learning part: convolutional neural network (CNN) with 7 layers (five conv+pooling, two 4096 fully connected) Wu-Jun Li ( Big Learning CS, NJU 45 / 125

46 Learning to Hash Deep Supervised Hashing with Pairwise Labels Objective Function Part min J = (s ij Θ ij log(1 + e Θ ij )) B,W,v,θ s ij S n + η b i (W T φ(x i ; θ) + v) 2 2 i=1 n points (images) X = {x i } n i=1 pairwise labels S = {s ij } with s ij {0, 1} binary code b i { 1, 1} c for each point x i, B = {b i } n i=1 Θ ij = 1 2 ut i u j, where u i = W T φ(x i ; θ) + v θ is the CNN parameters of the feature learning part Wu-Jun Li ( Big Learning CS, NJU 46 / 125

47 Learning to Hash Deep Supervised Hashing with Pairwise Labels Accuracy Comparison to state-of-the-art baselines Method CIFAR-10 (MAP) NUS-WIDE (MAP) 12-bits 24-bits 32-bits 48-bits 12-bits 24-bits 32-bits 48-bits DPSH DPSH NINH CNNH FastH SDH KSH LFH SPLH ITQ SH NINH [CVPR 15]; CNNH [AAAI 14]; FastH [CVPR 14]; SDH [CVPR 15]; KSH [CVPR 12]; LFH [SIGIR 14]; SPLH [ICML 10]; ITQ [CVPR 11]; SH [NIPS 08] Wu-Jun Li ( Big Learning CS, NJU 47 / 125

48 Learning to Hash Deep Supervised Hashing with Pairwise Labels Accuracy Comparison to deep baselines with ranking (triplet) labels Method CIFAR-10 (MAP) NUS-WIDE (MAP) 16-bits 24-bits 32-bits 48-bits 16-bits 24-bits 32-bits 48-bits DPSH DRSCH DSCH DSRH DRSCH [TIP 15]; DSCH [TIP 15]; DSRH [CVPR 15] Wu-Jun Li ( Big Learning CS, NJU 48 / 125

49 Learning to Hash Supervised Multimodal Hashing with SCM Supervised Multimodal Similarity Search Given a query of either image or text, return images or texts similar to it in both feature space and semantics (label information). Wu-Jun Li ( Big Learning CS, NJU 49 / 125

50 Learning to Hash Supervised Multimodal Hashing with SCM Motivation and Contribution Motivation Existing supervised methods are not scalable Contribution Avoiding explicitly computing the pairwise similarity matrix, linear-time complexity w.r.t. the size of training data. A sequential learning method with closed-form solution to each bit, no hyper-parameters and stopping conditions are needed. Wu-Jun Li ( Big Learning CS, NJU 50 / 125

51 Learning to Hash Supervised Multimodal Hashing with SCM In matrix form, we can rewrite the problem as follows min W x,w y sgn(xw x )sgn(y W y ) T cs 2 F s.t. sgn(xw x ) T sgn(xw x ) = ni c sgn(y W y ) T sgn(y W y ) = ni c. Wu-Jun Li ( Big Learning CS, NJU 51 / 125

52 Learning to Hash Supervised Multimodal Hashing with SCM Sequential Strategy Assuming that the projection vectors w x (1),..., w (t 1) have been learned, to learn the next projection vectors w x (t) Define a residue matrix k=1 x and w y (1) t 1 R t = cs sgn(xw x (k) )sgn(y w y (k) ) T. Objective function can be written as w (t) x min,w y (t) sgn(xw (t) x )sgn(y w y (t) ) T 2 R t F.,..., w (t 1) y and w (t) y. Wu-Jun Li ( Big Learning CS, NJU 52 / 125

53 Learning to Hash Supervised Multimodal Hashing with SCM Algorithm 2 Learning Algorithm of SCM Hashing Method. C xy (0) 2(X T L)(Y T L) T (X T 1 n )(Y T 1 n ) T ; C xy (1) c C xy (0) ; C xx X T X + γi dx ; C yy Y T Y + γi dy ; for t = 1 c do Solving the following generalized eigenvalue problem C xy (t) Cyy 1 [C xy (t) ] T w x = λ 2 C xx w x, we can obtain the optimal solution w x (t) corresponding to the largest eigenvalue λ max ; w (t) y h (t) x h (t) y C (t+1) xy end for C 1 yy CT xy w(t) x λ max ; sgn(xw x (t) ); sgn(y w y (t) ); C xy (t) (X T sgn(xw x (t) ))(Y T sgn(y w y (t) )) T ; Wu-Jun Li ( Big Learning CS, NJU 53 / 125

54 Learning to Hash Supervised Multimodal Hashing with SCM Scalability Table: Training time (in seconds) on NUS-WIDE dataset by varying the size of training set. Method\Size of Training Set SCM-Seq SCM-Orth CCA CCA-3V CVH CRH MLBE Wu-Jun Li ( Big Learning CS, NJU 54 / 125

55 Learning to Hash Supervised Multimodal Hashing with SCM Accuracy Table: MAP results on NUS-WIDE. The best performance is shown in boldface. Task Method Code Length c = 16 c = 24 c = 32 SCM-Seq SCM-Orth Image Query CCA v.s. CCA-3V Text Database CVH CRH MLBE SCM-Seq SCM-Orth Text Query CCA v.s. CCA-3V Image Database CVH CRH MLBE Wu-Jun Li ( Big Learning CS, NJU 55 / 125

56 Learning to Hash Multiple-Bit Quantization Double Bit Quantization 2000 Sample Numbe X (a) A 0 BC 1 D (b) (c) Point distribution of the real values computed by PCA on 22K LabelMe data set, and different coding results based on the distribution: (a) single-bit quantization (SBQ); (b) hierarchical hashing (HH); (c) double-bit quantization (DBQ). Wu-Jun Li ( Big Learning CS, NJU 56 / 125

57 Learning to Hash Multiple-Bit Quantization Experiment map on LabelMe data set # bits SBQ HH DBQ SBQ HH DBQ ITQ SH PCA LSH SIKH # bits SBQ HH DBQ SBQ HH DBQ ITQ SH PCA LSH SIKH Wu-Jun Li ( Big Learning CS, NJU 57 / 125

58 Learning to Hash Multiple-Bit Quantization Manhattan Quantization Wu-Jun Li ( Big Learning CS, NJU 58 / 125

59 Learning to Hash Multiple-Bit Quantization Experiment Table: map on ANN SIFT1M data set. The best map among SBQ, HQ and 2-MQ under the same setting is shown in bold face. # bits SBQ HQ 2-MQ SBQ HQ 2-MQ SBQ HQ 2-MH ITQ SIKH LSH SH PCA Wu-Jun Li ( Big Learning CS, NJU 59 / 125

60 Parallel and Distributed Stochastic Learning Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 60 / 125

61 Parallel and Distributed Stochastic Learning Machine Learning Supervised Learning: Given a set of training examples {(x i, y i )} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min f(w) = 1 n f i (w), w n where f i (w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n n i=1 [log(1 + e yixt i w ) + λ 2 w 2 ] SVM: f(w) = 1 n n i=1 [max{0, 1 y ix T i w} + λ 2 w 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li ( Big Learning CS, NJU 61 / 125 i=1

62 Parallel and Distributed Stochastic Learning Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): w t+1 w t η t [ 1 n f i (w t )], n where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example i t {1, 2,..., n}, then update i=1 w t+1 w t η t f it (w t ) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li ( Big Learning CS, NJU 62 / 125

63 Parallel and Distributed Stochastic Learning Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient ( f i ) evaluation to reach ɛ: GD: O(nκ log( 1 ɛ )) SGD: O(κ( 1 ɛ )) SAG: O(n log( 1 ɛ )) when n 8κ SDCA: O((n + κ) log( 1 ɛ )) SVRG: O((n + κ) log( 1 ɛ )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li ( Big Learning CS, NJU 63 / 125

64 Parallel and Distributed Stochastic Learning Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li ( Big Learning CS, NJU 64 / 125

65 Parallel and Distributed Stochastic Learning Our Contributions Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li ( Big Learning CS, NJU 65 / 125

66 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], CoCoA [Jaggi et al. 2014], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li ( Big Learning CS, NJU 66 / 125

67 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w 0, η; for t = 0, 1, 2,... do u 0 = w t ; All threads parallelly compute the full gradient f(u 0 ) = 1 n n i=1 f i(u 0 ); u = w t ; For each thread, do: for m = 1 to M do Read current value of u, denoted as û, from the shared memory. And randomly pick up an i from {1,..., n}; Compute the update vector: ˆv = f i (û) f i (u 0 ) + f(u 0 ); u u ηˆv; end for Take w t+1 to be the current value of u in the shared memory; end for Wu-Jun Li ( Big Learning CS, NJU 67 / 125

68 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis In all the GD or SGD methods to solve the objective function, the key step can be written as Notation u u + i,j : the j th update vector computed by the i th thread; u R d and u = (u (1), u (2),..., u (d) ); t (k) i,j : the time that the operation u(k) u (k) + (k) i,j has been completed (Not the time that the operation begins); Assuming i, j, t (1) i,j < t(2) i,j <... < t(d) i,j. Wu-Jun Li ( Big Learning CS, NJU 68 / 125

69 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis: update sequence Since u (1) can only be changed by at most one thread at any absolute time, these t (1) i,j are different from each other. So we can: Sort these t (1) i,j as t(1) 0 < t (1) 1 <... < t (1) M 1 ( M = p M); 0, 1,..., M 1 are the corresponding update vectors. Since it is lock-free, for each update vector m, the real update vector is B m m because of over-written. The B m is a diagonal matrix whose diagonal elements are 0 or 1. After all the inner-loop stop, we can get: w t+1 = u 0 + M 1 m=0 B m m (1) Wu-Jun Li ( Big Learning CS, NJU 69 / 125

70 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis: update sequence According to (1), we define a sequence {u m } as follow: which means u m+1 = u m + B m m. m 1 u m = u 0 + B i i (2) Note The sequence {u m } (m = 1, 2,..., M 1) is synthetic, and the whole u m may never occur in the shared memory. What we can get is only the final value of u M. i=0 Wu-Jun Li ( Big Learning CS, NJU 70 / 125

71 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Lock-free Analysis: read sequence Assume the old update vectors 0, 1,..., a(m) 1 have been completely applied to u when one thread is reading the shared variable. At the same time, some new update vectors might be updating u. So we can write û m read by the thread to compute m as follows: û m = u a(m) + b(m) i=a(m) P m,i a(m) i where P m,i a(m) is a diagonal matrix whose diagonal elements are 0 or 1. According to the principle of the order, i (i m) should not be read by û m. So b(m) < m, which means: û m = u a(m) + m 1 i=a(m) P m,i a(m) i Wu-Jun Li ( Big Learning CS, NJU 71 / 125

72 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Convergence result With some assumptions, our algorithm gets a linear convergence rate as follows: Ef(w t+1 ) f(w ) (c M 1 + c 2 1 c 1 )(Ef(w t ) f(w )), where c 1 = 1 αηµ + c 2 and c 2 = η 2 ( 8τL3 ηρ 2 (ρ τ 1) ρ 1 + 2L 2 ρ), M = p M is the total number of iterations of the inner-loop. Note Since it is lock-free, we do not know the exact B m and we can not take the average sum of B m u m to be w t+1. Wu-Jun Li ( Big Learning CS, NJU 72 / 125

73 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments Experimental platform: A server with 12 Intel cores and 64G memory. Model: Logistic regression with L2-norm f(w) = 1 n n log(1 + e y ix T i w ) + λ 2 w 2 i=1 Data set We set λ = dataset instances features memory type rcv1 20,242 47,236 36M sparse real-sim 72,309 20,958 90M sparse news20 19,996 1,355, M sparse epsilon 400,000 2,000 11G dense Wu-Jun Li ( Big Learning CS, NJU 73 / 125

74 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments: Computation Cost log(objective value min) log(objective value min) 10 0 rcv AsySVRG 1 AsySVRG lock 10 AsySVRG Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 real sim AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (a) rcv1 (b) realsim 10 0 news AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 epsilon AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (c) news20 (d) epsilon Wu-Jun Li ( Big Learning CS, NJU 74 / 125

75 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments: Total Time Cost log(objective value min) log(objective value min) rcv1 real sim 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (a) rcv1 (b) realsim news20 epsilon 10 0 AsySVRG 10 Hogwild time (c) news20 log(objective value min) 10 0 AsySVRG 10 Hogwild time (d) epsilon Wu-Jun Li ( Big Learning CS, NJU 75 / 125

76 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent Experiments: Speed up 6 rcv1 6 real sim AsySVRG lock AsySVRG AsySVRG lock AsySVRG 5 5 speedup 4 3 speedup number of threads number of threads (a) rcv1 (b) realsim 8 news20 6 epsilon 7 AsySVRG lock AsySVRG 5 AsySVRG lock AsySVRG unlock 6 speedup 5 4 speedup number of threads (c) news number of threads (d) epsilon Wu-Jun Li ( Big Learning CS, NJU 76 / 125

77 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Motivation and Contribution Motivation: Bulk synchronous parallel (BSP) models, such as MapReduce, are commonly considered to be inefficient for distributed stochastic learning. Is there any technique to solve the issues of BSP models? Contribution: A novel distributed stochastic learning method, called scalable composite optimization for learning (SCOPE), on BSP models Both computation-efficient and communication-efficient Linear convergence rate with theoretical proof Can be easily integrated into the data processing pipeline of Spark Outperform other state-of-the-art distributed learning methods on Spark Wu-Jun Li ( Big Learning CS, NJU 77 / 125

78 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Framework of SCOPE Master w Worker_1 Worker_ Worker_p D! D! D! Figure: Distributed framework of SCOPE. Wu-Jun Li ( Big Learning CS, NJU 78 / 125

79 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Optimization Algorithm: Master Task of Master in SCOPE: Initialization: p Workers, w 0 ; for t = 0, 1, 2,..., T do Send w t to the Workers; Wait until it receives z 1, z 2,..., z p from the p Workers; Compute the full gradient z = 1 n p k=1 z k, and then send z to each Worker; Wait until it receives ũ 1, ũ 2,..., ũ p from the p Workers; Compute w t+1 = 1 p p k=1 ũk; end for Wu-Jun Li ( Big Learning CS, NJU 79 / 125

80 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Optimization Algorithm: Workers Task of Workers in SCOPE: Initialization: initialize η and c > 0; For the Worker k: for t = 0, 1, 2,..., T do Wait until it gets the newest parameter w t from the Master; Let u k,0 = w t, compute the local gradient sum z k = i D k f i (w t ), and then send z k to the Master; Wait until it gets the full gradient z from the Master; for m = 0 to M 1 do Randomly pick up an instance with index i k,m from D k ; u k,m+1 = u k,m η( f ik,m (u k,m ) f ik,m (w t )+z+c(u k,m w t )); end for Send u k,m or the average of these {u k,m }, which is called the locally updated parameter and denoted as ũ k, to the Master; end for Wu-Jun Li ( Big Learning CS, NJU 80 / 125

81 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Convergence Let α = 1 η(2µ + c) < 1, β = cη + 3L 2 η 2 and α + β < 1. We have the following theorems: Theorem If we take w t+1 = 1 p p k=1 u k,m, then we can get the following convergence result: Theorem E w t+1 w 2 (α M + If we take w t+1 = 1 p p k=1 ũk with ũ k = 1 M the following convergence result: β 1 α )E w t w 2. M 1 m=0 u k,m, then we can get E w t+1 w 2 1 ( M(1 α) + β 1 α )E w t w 2. Wu-Jun Li ( Big Learning CS, NJU 81 / 125

82 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Communication Cost Traditional mini-batch based methods: O(T n) SCOPE: O(T ) Wu-Jun Li ( Big Learning CS, NJU 82 / 125

83 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment Logistic regression (LR) with a L 2 -norm regularization term: f(w) = 1 ] n n i=1 [log(1 + e y ix T i w ) + λ 2 w 2. Table: Datasets for evaluation. instances features memory λ MNIST-8M 8,100, G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691, G 1e-6 Two Spark clusters: small: 1 Master and 16 Workers large: 1 Master and 128 Workers Wu-Jun Li ( Big Learning CS, NJU 83 / 125

84 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment Baselines: MLlib [Meng et al., 2015]: MLlib is an open source library for distributed machine learning on Spark. We compare our method with distributed lbfgs for MLlib, which is a batch learning method. LibLinear [Lin et al., 2014a]: LibLinear is a distributed Newton method, which is also a batch learning method. Splash [Zhang and Jordan, 2015]: Splash is a distributed SGD method by using the local learning strategy to reduce communication cost. CoCoA [Jaggi et al., 2014]: CoCoA is a distributed dual coordinate ascent method. CoCoA+ [Ma et al., 2015]: CoCoA+ is an improved version of CoCoA. CoCoA+ adopts adding rather than average to combine local updates. Wu-Jun Li ( Big Learning CS, NJU 84 / 125

85 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment 10 0 MNIST 8M with 16 cores 10 0 epsilon with 16 cores objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 (a) MNIST-8M (b) epsilon 10 0 KDD12 with 16 cores 10 0 Data A with 128 cores objective value optimal SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 5 (c) KDD CPU Time(millisecond) x 10 4 (d) Data-A Wu-Jun Li ( Big Learning CS, NJU 85 / 125

86 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Experiment speedup SCOPE Ideal CPU Time(milllisecond) per iteration computation time waiting time communication time #cores #cores (e) Speedup (f) Synchronization cost Wu-Jun Li ( Big Learning CS, NJU 86 / 125

87 Parallel and Distributed Stochastic Learning SCOPE: Scalable Composite Optimization for Learning Conclusion Stochastic learning is becoming popular for big data machine learning. Lock-free strategy is key to get a good speedup in parallel stochastic learning. With properly designed techniques, BSP models are also efficient for distributed stochastic learning. Wu-Jun Li ( Big Learning CS, NJU 87 / 125

88 Other Application-Driven Distributed Learning Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 88 / 125

89 Other Application-Driven Distributed Learning Introduction Distributed learning: Perform machine learning on clusters with several machines (nodes). Our contributions: Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising (ICML 2014) Distributed Power-law Graph Computing: Theoretical and Empirical Analysis (NIPS 2014) Distributed Stochastic ADMM for Matrix Factorization (CIKM 2014) Wu-Jun Li ( Big Learning CS, NJU 89 / 125

90 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction CTR Prediction for Online Advertising Multi-billion business on the web and accounts for the majority of the income for the major internet companies. Display advertising is a big part of online advertising. Click through rate (CTR) prediction is the problem of estimating the probability that an ad is clicked when displayed to a user in a specific context. (a) Google (b) Amazon (c) Taobao Wu-Jun Li ( Big Learning CS, NJU 90 / 125

91 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Notation Impression (instance): ad + user + context Training set {(x (i), y (i) ) i = 1,..., N} x T = (x T u, x T a, x T o ) y {0, 1} with y = 1 denoting click and y = 0 denoting non-click Learn h(x) = h(x u, x a, x o ) to predict CTR Ads s User Wu-Jun Li ( Big Learning CS, NJU 91 / 125

92 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Coupled Group Lasso (CGL) 1 Likelihood where h(x) = P r(y = 1 x, W, V, b) = g ( (x T u W)(x T a V) T + b T x o ) W R du k, V R da k, (x T u W)(x T a V) T = x T u (WV T )x a 2 Objective function N min W,V,b i=1 ( ξ W, V, b; x (i), y (i)) + λω(w, V), in which ξ(w, V, b; x (i), y (i) ) = log ((h(x (i) )) y(i) (1 h(x (i) )) 1 y(i)) Ω(W, V) = W 2,1 + V 2,1 = d u W i 2 + d a i=1 i=1 V i 2 Wu-Jun Li ( Big Learning CS, NJU 92 / 125

93 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Learning Framework Compute gradient g p locally on each node p in parallel. Compute gradient g = P p=1 g p with AllReduce. Add the gradient of the regularization term and take an L-BFGS step in the master node. Broadcast the updated parameters to each slaver node. Wu-Jun Li ( Big Learning CS, NJU 93 / 125

94 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Experiment Experiment Environment MPI-Cluster with hundreds of nodes, each of which is a 24-core server with 2.2GHz CPU and 96GB of RAM Baseline and Evaluation Metric Baseline: LR (with L2-norm) [Chapelle et al., 2013] Evalution Metric (Discrimination) RelaImpr = AUC(model) 0.5 AUC(baseline) % Wu-Jun Li ( Big Learning CS, NJU 94 / 125

95 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Data Sets Data set # Instances (Billion) CTR (%) # Ads # Users (Million) Storage (TB) Train , Test , Train , Test , Train , Test , Real-world data sets collected from taobao of Alibaba Group Wu-Jun Li ( Big Learning CS, NJU 95 / 125

96 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Performance (c) RelaImpr w.r.t. Baseline (d) Speedup Accuracy and Scalability Wu-Jun Li ( Big Learning CS, NJU 96 / 125

97 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Feature Selection. Important Features Useless Features Ad Part Women s clothes, Skirt, Dress,Children s wear, Shoes, Cellphone Movie, Act, Takeout, Food booking service User Part Watch, Underwear, Fur clothing, Furniture Stage Costume, Flooring, Pencil, Outdoor sock Feature Selection Results Wu-Jun Li ( Big Learning CS, NJU 97 / 125

98 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Graph-based Machine Learning Big graphs emerge in many real applications Graph-based machine learning is a hot research topic with wide applications: relational learning, manifold learning, PageRank, community detection, etc Distributed graph computing frameworks for big graphs (a) Social Network (b) Biological Network Wu-Jun Li ( Big Learning CS, NJU 98 / 125

99 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Graph Partitioning Graph partitioning (GP) plays a key role to affect the performance of distributed graph computing: Workload balance Communication cost Two strategies for graph partitioning. Shaded vertices are ghosts and mirrors, respectively. (a) Edge-Cut (b) Vertex-Cut Theoretical and empirical results show vertex-cut is better than edge-cut. Wu-Jun Li ( Big Learning CS, NJU 99 / 125

100 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Power-Law Graph Partitioning Natural graphs from real world typically follow skewed power-law degree distributions: Pr(d) d α. Different vertex-cut methods can result in different performance. Intuition: cut the high-degree vertices (b) Bad partitioning (g) Sample (c) Good partitioning Wu-Jun Li ( Big Learning CS, NJU 100 / 125

101 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Degree-based Hashing (DBH) Existing GP methods, such as the Random in PowerGraph (Joseph E Gonzalez et al, 2012) and Grid in GraphBuilder (Nilesh Jain et al, 2013), do not make effective use of the power-law degree distribution. We propose a novel GP method called degree-based hashing (DBH): Algorithm 3 GP with DBH Input: The set of edges E; the set of vertices V ; number of machines p. Output: The assignment M(e) {1,..., p} for each edge e. Initialization: count the degree d i for each vertex i {1,..., n} in parallel for all e = (v i, v j ) E do Hash each edge in parallel: if d i < d j then M(e) vertex hash(v i ) else M(e) vertex hash(v j ) end if end for Wu-Jun Li ( Big Learning CS, NJU 101 / 125

102 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Theoretical Analysis We theoretically prove that our DBH method can outperform Random and Grid in terms of: reducing replication factor (communication and storage cost) keeping good edge-balance (workload balance) Nice property: Our DBH reduces more replication factor when the power-law graph is more skewed. Wu-Jun Li ( Big Learning CS, NJU 102 / 125

103 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Data Set Table: Datasets Alias Graph V E Tw Twitter 42M 1.47B Arab Arabic M 0.6B Wiki Wiki 5.7M 130M LJ LiveJournal 5.4M 79M WG WebGoogle 0.9M 5.1M Wu-Jun Li ( Big Learning CS, NJU 103 / 125

104 Other Application-Driven Distributed Learning Distributed Power-Law Graph Computing Empirical Results (PageRank) Replication Factor Random Grid DBH Speedup(%) Random Grid 60.6% 26.5% 8.42% 21.2% 4.28% 23.6% 6.06% 31.5% 13.3% 25% 0 WG LJ Wiki Arab Tw WG LJ Wiki Arab Tw (d) Replication Factor (e) Speedup relative to baselines Figure: Experiments on real-world graphs. The number of machines is Random Grid DBH Random Grid DBH Replication Factor Number of Machines Number of Machines (a) Replication Factor (b) Execution Time Figure: The number of machines ranges from 8 to 64 on Twitter graph. Execution Time (Sec) Wu-Jun Li ( Big Learning CS, NJU 104 / 125

105 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Recommender System Recommend products (items) to customers (users) by utilizing the customers historic preferences. (a) Amazon (b) Sina Weibo Wu-Jun Li ( Big Learning CS, NJU 105 / 125

106 Other Application-Driven Distributed Learning m Users m Distributed Stochastic ADMM for Matrix Factorization k Matrix Factorization Popular in recommender systems for its promising performance Ann James I 3 N CE I M D P AT I TI O T O R IX S N k n Kate 2 Helen min U,V 2 n Items (i,j) Ω User Factors R U T V Item Factors [(R i,j U T iv j ) 2 + λ 1 U T iu i + λ 2 V T jv j ] Wu-Jun Li ( Big Learning CS, NJU 106 / 125

107 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Data Split Strategy (P machines) Decouple U and V as possible as we can: R 1 R 2 R 1 [U 1 ] T V 1 [U 1 ] T [U 2 ] T R p V R 2 V 2 [U 2 ] T R p [U p ] T V p V [U p ] T Wu-Jun Li ( Big Learning CS, NJU 107 / 125

108 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Reformulated MF problem : min U,V,V 1 2 P p=1 (i,j) Ω p [ ] (R i,j U T iv p j )2 + λ 1 U T iu i + λ 2 [V p j ]T V p j s.t. : V p V = 0; p {1, 2,..., P } where V = {V p } P p=1, Ωp denotes the (i, j) indices of the ratings located in node p Wu-Jun Li ( Big Learning CS, NJU 108 / 125

109 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Define: we can get L p (U, V p, Θ p, V) = f p (U, V p ) + l p (V p, V, Θ p ) = ˆfi,j (U i, V p j ) (i,j) Ω [ p ρ + 2 Vp V 2 F + tr ( [Θ p ] T (V p V) ) ], L(U, V, O, V) = P L p (U, V p, Θ p, V). p=1 Wu-Jun Li ( Big Learning CS, NJU 109 / 125

110 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Get the solutions by repeating the following three steps: U t+1, V p t+1 argmin U,V p L p (U, V p, Θ p t, V t), p {1, 2,..., P } V t+1 argmin L(U t+1, V t+1, O t, V), V Θ p t+1 Θp t + ρ(vp t+1 V t+1), p {1, 2,..., P }. The solution for V t+1 is: V t+1 = 1 P which can be calculated efficiently. P p=1 V p t+1 The problem lies in getting U t+1,v t+1 efficiently Wu-Jun Li ( Big Learning CS, NJU 110 / 125

111 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Stochastic Learning Batch learning is still not very efficient: U t+1 = U t τ t T Uf p (U t, V p t ), V p t+1 = τ t 1 + ρτ t [ Vp t τ t + ρv t Θ p t T V pf p (U t, V p t )] Stochastic Learning (U i ) t+1 =(U i ) t + τ t (ɛ ij (V p j ) t λ 1 (U i ) t ), (V p j ) t+1 = τ t [ 1 λ 2τ t (V p j 1 + ρτ ) t t +ɛ ij (U i ) t + ρ(v j ) t (Θ p j ) t], where ɛ ij = R ij [(U i ) t ] T (V p j ) t. τ t Wu-Jun Li ( Big Learning CS, NJU 111 / 125

112 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Scheduler Comparison Baselines: CCD++ (H.-F Yu etc, ICDM 12); DSGD (R. Gemulla etc, KDD 11) Number of synchronization operations for one iteration to fully scan all the ratings: Set d = 0 Generate Statum Sequence for Each Node Parallelly Update U p and V p Parallelly Update Ud* Synchronize Ud* Parallelly Update Vd* Set p = 0 Parallelly Update U and V Synchronize to get V Synchronize Vd* d++ d < k (a) CCD++ synchronizes k times Synchronize V p++ p < P (b) DSGD synchronizes P times Parallelly Update Θ p (c) DS-ADMM synchronizes 1 time Wu-Jun Li ( Big Learning CS, NJU 112 / 125

113 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Experiment Experiment environment MPI-Cluster with 20 nodes, each of which is a 24-core server with 2.2GHz CPU and 96GB of RAM; One core and 10GB memory for each node are actually used. Baseline and evaluation metric Baseline: CCD++, DSGD, DSGD-Bias Evalution metric: 1 test RMSE : Q (Ri,j U T i V j) 2 Wu-Jun Li ( Big Learning CS, NJU 113 / 125

114 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Data Sets Table: Data sets and parameter settings Data Set Netflix Yahoo! Music R1 Yahoo! Music R2 m 480,190 1,938,361 1,823,179 n 17,770 49, ,736 #Train 99,072,112 73,578, ,640,226 #Test 1,408,395 7,534,592 18,231,790 k η 0 /τ λ 1 /λ ρ α β P Wu-Jun Li ( Big Learning CS, NJU 114 / 125

115 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Accuracy and Efficiency CCD++ DSGD DSGD Bias DS ADMM 1 CCD++ DSGD DSGD Bias 0.96 DS ADMM Test RMSE 0.94 Test RMSE Test RMSE Time(s) (a) Netflix CCD++ DSGD DSGD Bias DS ADMM Time(s) Log Running Time to get Test RMSE Time(s) (b) Yahoo-Music-R1 DS ADMM DSGD DSGD Bias CCD Node Number (c) Yahoo-Music-R2 (d) Time to fixed-rmse (0.922) Wu-Jun Li ( Big Learning CS, NJU 115 / 125

116 Other Application-Driven Distributed Learning Distributed Stochastic ADMM for Matrix Factorization Speedup 6 5 CCD++ DSGD DS ADMM Linear Speedup Speedup Node Number Wu-Jun Li ( Big Learning CS, NJU 116 / 125

117 Conclusion Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Scalable Graph Hashing with Feature Transformation Supervised Hashing with Latent Factor Models Column Sampling based Discrete Supervised Hashing Deep Supervised Hashing with Pairwise Labels Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Parallel and Distributed Stochastic Learning Fast Asynchronous Parallel Stochastic Gradient Descent SCOPE: Scalable Composite Optimization for Learning 4 Other Application-Driven Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Wu-Jun Li ( Big Learning CS, NJU 117 / 125

118 Conclusion Our Contribution Learning to hash (MFÆS): memory/disk/cpu/communication Parallel and distributed stochastic learning ( 1 Ùª ÅÆS): memory/disk/cpu; but increase communication cost Other application-driven distributed learning (Ù A^ Ä Ùª ÆS) Wu-Jun Li ( Big Learning CS, NJU 118 / 125

119 Conclusion Future Work Learning to hash for decreasing communication cost Distributed programming models and platforms for machine learning: MPI (fault tolerance, asynchronous), GraphLab, Spark, Parameter Server, Petuum, MapReduce, Storm, GPU, etc Wu-Jun Li ( Big Learning CS, NJU 119 / 125

120 Conclusion Future Work Open source project: LIBBLE: A library for big learning LIBBLE-Spark: Classification: LR, SVM Regression: Linear Regression Dimensionality Reduction: PCA Matrix Decomposition/Factorization: SVD LIBBLE-MPI LIBBLE-GPU LIBBLE-MultiCore Wu-Jun Li ( Big Learning CS, NJU 120 / 125

121 Conclusion 统计模式Future Work Big data machine learning (big learning) framework 搜索分类聚类 大数据机器学习核心引擎 (AIOS 内核 ) 推荐 排序验知识接口 先新型机器学习与数据挖掘算法 数值计算并行库 云计算等分布式环境 特定数据形式 矩阵计算 最优化方法 编程模型 数据分块布局 关系数据 高维数据 动态数据 Wu-Jun Li ( Big Learning CS, NJU 121 / 125

122 Conclusion Related Publication (1/3)[*indicate my students] Shen-Yi Zhao*, Ru Xiang*, Ying-Hao Shi*, Peng Gao*, Wu-Jun Li. SCOPE: Scalable Composite Optimization for Learning on Spark. arxiv: v4, Shen-Yi Zhao*, Wu-Jun Li. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li, Sheng Wang*, Wang-Cheng Kang*. Feature Learning based Deep Supervised Hashing with Pairwise Labels. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), Wang-Cheng Kang*, Wu-Jun Li, Zhi-Hua Zhou. Column Sampling based Discrete Supervised Hashing. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Qing-Yuan Jiang*, Wu-Jun Li. Scalable Graph Hashing with Feature Transformation. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Wu-Jun Li ( Big Learning CS, NJU 122 / 125

123 Conclusion Related Publication (2/3)[*indicate my students] Ling Yan*, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st International Conference on Machine Learning (ICML), Cong Xie*, Ling Yan*, Wu-Jun Li, Zhihua Zhang. Distributed Power-law Graph Computing: Theoretical and Empirical Analysis. Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), Peichao Zhang*, Wei Zhang*, Wu-Jun Li, Minyi Guo. Supervised Hashing with Latent Factor Models. Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Zhi-Qin Yu*, Xing-Jian Shi*, Ling Yan*, Wu-Jun Li. Distributed Stochastic ADMM for Matrix Factorization. Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM), Wu-Jun Li ( Big Learning CS, NJU 123 / 125

124 Conclusion Related Publication (3/3)[*indicate my students] Dongqing Zhang*, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), Weihao Kong*, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), Weihao Kong*, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale Image Retrieval. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Weihao Kong*, Wu-Jun Li. Double-Bit Quantization for Hashing. Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li ( Big Learning CS, NJU 124 / 125

125 Conclusion Q & A Thanks! Wu-Jun Li ( Big Learning CS, NJU 125 / 125

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Big Data Machine Learning

Big Data Machine Learning Big Data Machine Learning ŒêâÅìÆS oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Oct 31, 2014 Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 1 / 79 Outline 1 Introduction 2 Learning to Hash

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Correlation Autoencoder Hashing for Supervised Cross-Modal Search Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Stochastic Generative Hashing

Stochastic Generative Hashing Stochastic Generative Hashing B. Dai 1, R. Guo 2, S. Kumar 2, N. He 3 and L. Song 1 1 Georgia Institute of Technology, 2 Google Research, NYC, 3 University of Illinois at Urbana-Champaign Discussion by

More information

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data 2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data Liu Liu Instructor: Dr. Hairong Qi University of Tennessee, Knoxville lliu25@vols.utk.edu September 21, 2017 Liu Liu (UTK)

More information

The Power of Asymmetry in Binary Hashing

The Power of Asymmetry in Binary Hashing The Power of Asymmetry in Binary Hashing Behnam Neyshabur Yury Makarychev Toyota Technological Institute at Chicago Russ Salakhutdinov University of Toronto Nati Srebro Technion/TTIC Search by Image Image

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Learning to Hash with Partial Tags: Exploring Correlation Between Tags and Hashing Bits for Large Scale Image Retrieval

Learning to Hash with Partial Tags: Exploring Correlation Between Tags and Hashing Bits for Large Scale Image Retrieval Learning to Hash with Partial Tags: Exploring Correlation Between Tags and Hashing Bits for Large Scale Image Retrieval Qifan Wang 1, Luo Si 1, and Dan Zhang 2 1 Department of Computer Science, Purdue

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Multi-Label Informed Latent Semantic Indexing

Multi-Label Informed Latent Semantic Indexing Multi-Label Informed Latent Semantic Indexing Shipeng Yu 12 Joint work with Kai Yu 1 and Volker Tresp 1 August 2005 1 Siemens Corporate Technology Department of Neural Computation 2 University of Munich

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Accelerated Manhattan hashing via bit-remapping with location information

Accelerated Manhattan hashing via bit-remapping with location information DOI 1.17/s1142-15-3217-x Accelerated Manhattan hashing via bit-remapping with location information Wenshuo Chen 1 Guiguang Ding 1 Zijia Lin 2 Jisheng Pei 2 Received: 2 May 215 / Revised: 2 December 215

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Probabilistic Hashing for Efficient Search and Learning

Probabilistic Hashing for Efficient Search and Learning Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing

More information

Lecture 15: Random Projections

Lecture 15: Random Projections Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Supplemental Material for Discrete Graph Hashing

Supplemental Material for Discrete Graph Hashing Supplemental Material for Discrete Graph Hashing Wei Liu Cun Mu Sanjiv Kumar Shih-Fu Chang IM T. J. Watson Research Center Columbia University Google Research weiliu@us.ibm.com cm52@columbia.edu sfchang@ee.columbia.edu

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object

More information

Isotropic Hashing. Abstract

Isotropic Hashing. Abstract Isotropic Hashing Weihao Kong, Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China {kongweihao,liwujun}@cs.sjtu.edu.cn

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Supervised Quantization for Similarity Search

Supervised Quantization for Similarity Search Supervised Quantization for Similarity Search Xiaojuan Wang 1 Ting Zhang 2 Guo-Jun Qi 3 Jinhui Tang 4 Jingdong Wang 5 1 Sun Yat-sen University, China 2 University of Science and Technology of China, China

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

ParaGraphE: A Library for Parallel Knowledge Graph Embedding ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

PU Learning for Matrix Completion

PU Learning for Matrix Completion Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Spectral Hashing: Learning to Leverage 80 Million Images

Spectral Hashing: Learning to Leverage 80 Million Images Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Large-scale Linear RankSVM

Large-scale Linear RankSVM Large-scale Linear RankSVM Ching-Pei Lee Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin Ching-Pei Lee (National Taiwan Univ.) 1 / 41 Outline 1 Introduction 2 Our

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Hamming Compatible Quantization for Hashing

Hamming Compatible Quantization for Hashing Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Hamming Compatible Quantization for Hashing Zhe Wang, Ling-Yu Duan, Jie Lin, Xiaofang Wang, Tiejun

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information