Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Size: px

Start display at page:

Download "Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs"

Ruth Taylor
6 years ago
Views:

1 Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs Yining Wang Jun Zhu Tsinghua University July, 2014 Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

2 Outline 1 Infinite SVM (Zhu et al., 2011) The Infinite SVM (isvm) Model Gibbs-iSVM 2 The Max-Margin DP-means (M 2 DPM) Algorithm Small-variance Asymptotic (SVA) Analysis The M 2 DPM Algorithm 3 Experiments 4 Summary Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

3 Outline 1 Infinite SVM (Zhu et al., 2011) The Infinite SVM (isvm) Model Gibbs-iSVM 2 The Max-Margin DP-means (M 2 DPM) Algorithm Small-variance Asymptotic (SVA) Analysis The M 2 DPM Algorithm 3 Experiments 4 Summary Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

4 Infinite SVM (Zhu et al., ICML 2011) Solving clustering and classification simultaneously Motivation: Improve classification by finding underlying clusters. Improve clustering by incorporating supervised information Unknown # of clusters a nonparametric treatment. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

5 Infinite SVM CRP(α) z x µ ρ 2 K y N η K ν 2 Nonparametric prior: p 0 (z i = k α, z i ) Max-margin classification model: { n i,k, if n i,k > 0 α, otherwise φ(y i x i, η, z i = k) = exp( 2c max(0, 1 y i η k x i)) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

6 Outline 1 Infinite SVM (Zhu et al., 2011) The Infinite SVM (isvm) Model Gibbs-iSVM 2 The Max-Margin DP-means (M 2 DPM) Algorithm Small-variance Asymptotic (SVA) Analysis The M 2 DPM Algorithm 3 Experiments 4 Summary Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

7 Gibbs sampler for isvm Iteratively sample model parameters: 1 Mean parameter µ k : Gaussian distribution. 2 Linear classifier η k : data augmentation. 3 Cluster assignments z i : categorical distribution. Problems: slow! q(z i = z new ) α N (x i ; µ, σ 2 I)dp 0 (µ) φ(y i x i, η)dp 0 (η) } {{ } difficult to evaluate Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

8 Outline 1 Infinite SVM (Zhu et al., 2011) The Infinite SVM (isvm) Model Gibbs-iSVM 2 The Max-Margin DP-means (M 2 DPM) Algorithm Small-variance Asymptotic (SVA) Analysis The M 2 DPM Algorithm 3 Experiments 4 Summary Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

9 Small-variance Asymptotic (SVA) Analysis 1 Assumptions: model variance 0. 2 Consequences: Connections between Bayesian posterior and deterministic loss. Novel inference algorithms. 3 Examples: Probabilistic PCA vs. PCA. (Tipping et al., 1999) DP-means. (Kulis et al., 2012) Efficient feature learning. (Broderick et al., 2013) Asymp-iHMM. (Roychowdhury et al., 2013) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

10 Small-variance Asymptotic (SVA) Analysis 1 Assumptions: model variance 0. 2 Consequences: Connections between Bayesian posterior and deterministic loss. Novel inference algorithms. 3 Examples: Probabilistic PCA vs. PCA. (Tipping et al., 1999) DP-means. (Kulis et al., 2012) Efficient feature learning. (Broderick et al., 2013) Asymp-iHMM. (Roychowdhury et al., 2013) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

11 Small-variance Asymptotic (SVA) Analysis 1 Assumptions: model variance 0. 2 Consequences: Connections between Bayesian posterior and deterministic loss. Novel inference algorithms. 3 Examples: Probabilistic PCA vs. PCA. (Tipping et al., 1999) DP-means. (Kulis et al., 2012) Efficient feature learning. (Broderick et al., 2013) Asymp-iHMM. (Roychowdhury et al., 2013) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

12 SVA on Gibbs-iSVM SVA Assumptions: σ 2, ν 2 0, c, α. Existing clusters: q(z i = k) n i,k exp ( x i µ k 2 ) σ 2 2c max(0, 1 y i ηk x i) Q i (k) = s x i µ k 2 + 2c(1 y }{{} i ηk x i) + ; }{{} Clustering Classification Creating new clusters: q(z i = z new ) α N (x i ; µ, σ 2 I)dp 0 (µ) φ(y i x i, η)dp 0 (η) Q i (new) = }{{} λ + 2c(1 y i η x i ) + + η 2 /ν 2. }{{}}{{} AIC Classification Regularization Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

13 SVA on Gibbs-iSVM SVA Assumptions: σ 2, ν 2 0, c, α. Existing clusters: q(z i = k) n i,k exp ( x i µ k 2 ) σ 2 2c max(0, 1 y i ηk x i) Q i (k) = s x i µ k 2 + 2c(1 y }{{} i ηk x i) + ; }{{} Clustering Classification Creating new clusters: q(z i = z new ) α N (x i ; µ, σ 2 I)dp 0 (µ) φ(y i x i, η)dp 0 (η) Q i (new) = }{{} λ + 2c(1 y i η x i ) + + η 2 /ν 2. }{{}}{{} AIC Classification Regularization Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

14 Outline 1 Infinite SVM (Zhu et al., 2011) The Infinite SVM (isvm) Model Gibbs-iSVM 2 The Max-Margin DP-means (M 2 DPM) Algorithm Small-variance Asymptotic (SVA) Analysis The M 2 DPM Algorithm 3 Experiments 4 Summary Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

15 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation. 2 Deterministic loss function (via SVA on posterior): L(z, µ, η) K η k 2 = 2ν 2 k=1 }{{} Regularization + 2c n (ζ z i i=1 i ) + } {{ } Classification + s n x i µ zi 2 + λ }{{ K}. i=1 }{{} AIC Clustering 3 Extension to exponential family distributions and multi-class classification. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

16 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation. 2 Deterministic loss function (via SVA on posterior): L(z, µ, η) K η k 2 = 2ν 2 k=1 }{{} Regularization + 2c n (ζ z i i=1 i ) + } {{ } Classification + s n x i µ zi 2 + λ }{{ K}. i=1 }{{} AIC Clustering 3 Extension to exponential family distributions and multi-class classification. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

17 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation. 2 Deterministic loss function (via SVA on posterior): L(z, µ, η) K η k 2 = 2ν 2 k=1 }{{} Regularization + 2c n (ζ z i i=1 i ) + } {{ } Classification + s n x i µ zi 2 + λ }{{ K}. i=1 }{{} AIC Clustering 3 Extension to exponential family distributions and multi-class classification. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

18 Outline 1 Infinite SVM (Zhu et al., 2011) The Infinite SVM (isvm) Model Gibbs-iSVM 2 The Max-Margin DP-means (M 2 DPM) Algorithm Small-variance Asymptotic (SVA) Analysis The M 2 DPM Algorithm 3 Experiments 4 Summary Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

19 Datasets and Algorithms 1 Protein fold classification 27 classes, 696 instances, 21 features 2 Pakinson s disease detection 2 classes, 195 instances, 22 features 3 Algorithms Classification models: MNL, Linear-SVM, RBF-SVM Hybrid models: dpmnl (Shahbaba et al., 2009), DP+SVM, Gibbs-iSVM, M 2 DPM. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

20 Classification performance Protein Parkinson Algo F1 (%) Times (s) Acc. (%) Times (s) MNL L-SVM RBF-SVM dpmnl DP+SVM Gibbs-iSVM M 2 DPM Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

21 Clustering performance 1 Nonparametric clustering on synthetic datasets: K 0 vs. K. n K K Clustering on potential Parkinson s disease patients. Group I II III IV V Avg. age Avg. stage (0-4) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

22 Clustering performance 1 Nonparametric clustering on synthetic datasets: K 0 vs. K. n K K Clustering on potential Parkinson s disease patients. Group I II III IV V Avg. age Avg. stage (0-4) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

23 Summary Infinite SVM: clustering + classification Small variance analysis M 2 DPM: fast and accurate. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

24 Summary Infinite SVM: clustering + classification Small variance analysis M 2 DPM: fast and accurate. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

25 Summary Infinite SVM: clustering + classification Small variance analysis M 2 DPM: fast and accurate. Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

26 Questions Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

27 The data augmentation trick Data augmentation for SVM classifiers (Polson and Scott, 2011) φ(y i z i = k, η) = exp( 2c max(0, 1 y i ηk x i)) ( 1 = exp (ω i + cζi k ) 2 2πωi 2ω i = 0 0 φ(y i, ω i z i = k, η)dω i ) dω i Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

28 Extension to multi-class classification Multi-class hinge loss: (Crammer and Singer, 2001) φ m (y i z i = k, η) = exp( 2c max(δ y,yi + η y k,y x i ηk,y i x i )) Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

29 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

30 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

31 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

32 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

33 The M 2 DPM Algorithm 1 Repeat until convergence: For each instance i: z i argmin k Q i (k). For each cluster k: µ k x k. Update {η k } K k=1 using data augmentation Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, / 25

Fast Approximate MAP Inference for Bayesian Nonparametrics

Fast Approximate MAP Inference for Bayesian Nonparametrics Y. Raykov A. Boukouvalas M.A. Little Department of Mathematics Aston University 10th Conference on Bayesian Nonparametrics, 2015 1 Iterated Conditional