Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Size: px

Start display at page:

Download "Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College"

Annabel Melton
5 years ago
Views:

1 Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College

2 Statistical Learning / Estimation Learning generative models from data Topic models (text), computer vision, bioinformatics 2

3 Statistical Learning / Estimation Find a model (indexed by parameter) from a distribution family (or set) that fits the data best Data (X), iid Best model p(x ) from {x 1,...,x n } P = {p(x ), 2 } Maximum likelihood: ˆ mle = arg max 2 nx log p(x i ) i Many other traditional methods 3

4 Challenge: Big Data Big data (very large n) Private data Can t be stored in a single machine! Learning on distributed data! 4

5 Challenge: Big Data Big data (very large n) Private data Can t be stored in a single machine! Learning on distributed data! Challenge: Communication is the bottleneck 5

6 Big Data = Distributed Data Assumption: randomly partition the instances into d subsets evenly (n/d instances in each subset) X = U U X 1 U X 2 X d Traditional (centralized) MLE doesn t work in general ˆ mle = arg max 2 nx log p(x i ) i Existing methods: distributed implementation of MLE Stochastic gradient descent (SGD), ADMM, etc This talk: One- shot combine local MLEs Good approximation to global MLE Much less communication 6

7 Combining local MLEs Global MLE MLE X Combine the local MLEs (this talk) " MLE MLE MLE X 1 X 2 X d ˆ 1 ˆ 2 ˆ d ˆ mle ˆ f (= f(ˆ 1,...,ˆ d )) Questions: The best combination function f( )? How well the local MLEs approximates the global MLE? Statistical loss by using local MLEs? 7

8 Linear Averaging (Naive) Take the linear averaging (Zhang et al. 13; Scott et al.13) ˆ linear = 1 X ˆ k d k 8

9 Linear Averaging (Naive) Take the linear averaging (Zhang et al. 13; Scott et al.13) ˆ linear = 1 X ˆ k d k Many disadvantages: Doesn t work for discrete, non- additive parameters Unidentifiable Parameters Latent variables models: Gaussian mixture, topic models 9

10 Linear Averaging (Naive) Take the linear averaging (Zhang et al. 13; Scott et al.13) ˆ linear = 1 X ˆ k d k Many disadvantages: Doesn t work for discrete, non- additive parameters Unidentifiable Parameters Latent variables models: Gaussian mixture, topic models µ 1 µ 2 + µ 2 µ 1 + µ 2 µ 2 1 Labels mismatched

11 Linear Averaging (Naive) Take the linear averaging (Zhang et al. 13; Scott et al.13) ˆ linear = 1 X ˆ k d k Many disadvantages: Doesn t work for discrete, non- additive parameters Parameter non- identifiability Latent variables models: Gaussian mixture, topic models µ 1 µ 2 + Different numbers of components

12 Linear Averaging (Naive) Take the linear averaging (Zhang et al. 13; Scott et al.13) ˆ linear = 1 X ˆ k d Many disadvantages: Doesn t work for discrete, non- additive parameters Parameter non- identifiability k Latent variables models: Gaussian mixture, LDA, deep nets Result changes with different parameterization For example: = 2 = =1/ (variance) (std) (precision) ( )/2 ( )/2 (1/ 1 +1/ 2 )/2

13 Linear Averaging (Naive) Take the linear averaging (Zhang et al. 13; Scott et al.13) ˆ linear = 1 X ˆ k d k Many disadvantages: Doesn t work for discrete, non- additive parameters Parameter non- identifiability Latent variables models: Gaussian mixture, LDA, deep nets Result changes with different parameterization Asymptotically: suboptimal error rate (explain later)

14 Our Algorithm: KL- Averaging We propose KL- averaging: ˆ KL = arg min 2 Where KL divergence: X KL(Pˆ k P ) k KL(P 0 P )= Z x p(x 0 ) log p(x 0 ) p(x ) dx Idea: average the models, not the parameters!! Pˆ 1 Pˆ 2 Pˆ KL Geometric mean in the model space Pˆ 3

15 Our Algorithm: KL- Averaging We propose KL- averaging: ˆ KL = arg min 2 Where KL divergence: X KL(Pˆ k P ) k KL(P 0 P )= Z x p(x 0 ) log p(x 0 ) p(x ) dx Addresses all the problems (non- additive, non- identifiabilities, reparameterization invariance ) µ 1 µ 2 + µ 2 µ 1

16 Our Algorithm: KL- Averaging We propose KL- averaging: ˆ KL = arg min 2 Where KL divergence: X KL(Pˆ k P ) k KL(P 0 P )= Z x p(x 0 ) log p(x 0 ) p(x ) dx Addresses all the problems (non- additive, non- identifiabilities, reparameterization invariance ) +

17 Parametric Bootstrap Interpretation A parametric bootstrap procedure: Resample from the local model p(x ˆ k ) X k p(x ˆ k ) Maximum likelihood based on X ˆ = arg max log p( X k ) 2 k X =[ X 1,..., X d ] Reduces to infinite ˆ KL when the resample size goes to 17

18 KL- averaging on Exponential Families (Full) exponential families p(x ) = 1 Z( ) exp[ T (x)] = { : Z( ) < 1} : natural (x): su cient Z( ): normalization parameter statistics n constant Include: Gaussian, exponential, Gamma, Dirichlet, Bernoulli Most undirected graphical models (e.g., Ising model) Not include: Latent variables models (Gaussian mixture, topic models) Hierarchical models, Gaussian process, (Parametric) Bayes nets 18

19 KL- averaging on Exponential Families KL- averaging exactly recovers the global MLE under exponential families ( ˆ KL = ˆ mle )! Linear- averaging is not exact in general " MLE MLE MLE X 1 X 2 X d ˆ 1 ˆ 2 ˆ d = MLE X ˆ ˆ ff ˆ KL ˆ mle 19

20 KL- averaging on Exponential Families KL- averaging exactly recovers the global MLE under exponential families ( ˆ KL = ˆ mle )! ˆ k Because the local MLE is a sufficient statistic of X k Can be viewed as a nonlinear averaging: ˆ KL = g 1 g(ˆ 1 )+ + g(ˆ d ) d where g : 7! µ E [ (x)] (one- to- one map) µ : The moment parameter 20

21 Non- Exponential Families Non- exponential families: ˆ KL doesn t equal in general ˆ mle Error rate = how nearly exponential family they are Statistical Curvature (Efron 75): the mathematical measure of exponential- family- ness. Geometric intuition: Exponential families = lines / linear subspaces r Non- exponential families = curves / curved surfaces Curvature: =1/r 21

22 Curved Exponential Families Curved exponential families: p(x ) = 1 Z( ( )) exp ( )T (x) Assume: dimension of is smaller than (and (x)) ( ) : lower dimensional curved surface in - space 1/ -space 22 ( )

23 Curved Exponential Families Statistical curvature of p(x ) : Geometric curvature of ( ) (Efron 75) with inner product via Fisher information matrix: h, 0 i def = T 0 where =cov[ (x)] Curved exponential families: 1 p(x ) = exp ( )T (x) Z( ( )) 1/ -space 23 ( )

24 Statistical Curvature &Information Loss Information loss of MLE (Fisher, Rao, Efron): Statistical curvature = Loss of information when summarizing the data using MLE 2 = lim n!1 I 1 1 (I n Iˆ mle) Fisher Statistical curvature Fisher information in a simple of size n: I n = ni 1 Fisher information in ˆ mle based on a sample of size n Rao Full exponential family: =0 => Zero curvature ( ) => No information loss => MLE is sufficient statistics Efron 24

simple of size n: I n = ni 1 Fisher information in ˆ mle based on a sample of

25 Statistical Curvature &Information Loss Information loss of MLE (Fisher, Rao, Efron): Statistical curvature = Loss of information when summarizing the data using MLE 2 = lim n!1 I 1 1 (I n Iˆ mle) Fisher Statistical curvature Fisher information in a simple of size n: I n = ni 1 Fisher information in ˆ mle based on a sample of size n Rao Doesn t violate the first order efficiency of MLE: lim n!1 Iˆ mle / I n =1 Second order efficiency of MLE: is the lower bound 2 Efron 25

26 Information loss in Distributed Estimation MLE X " MLE MLE MLE X 1 X 2 X d ˆ 1 ˆ 2 ˆ d ˆ mle 2 ˆ f (= f(ˆ 1,...,ˆ d )) Information loss = Information loss >= d 2 Relative Information loss >= (d 1) 2 26

27 Information Loss & Distributed Estimation Lower bound: For any ˆ mle Intuition: project into the space of spanned by (ˆ 1,...,ˆ d, we have ˆ f = f(ˆ 1,...,ˆ d ) E p I(ˆ f ˆ mle ) 2 1 ] n 2 (d 1) 2 + o(n 2 ) ˆ 1,..., ˆ d ˆ mle ( = 1 n 2 (d 1) 2

28 Information Loss & Distributed Estimation Lower bound: For any, we have ˆ f = f(ˆ 1,...,ˆ d ) E p I(ˆ f ˆ mle ) 2 1 ] n 2 (d 1) 2 + o(n 2 ) KL- averaging: achieves the lower bound E p I(ˆ KL ˆ mle ) 2 ] = 1 n 2 (d 1) 2 + o(n 2 ) Linear averaging: doesn t achieve the lower bound (even on exponential families) E p I(ˆ linear ˆ mle ) 2 ] = 1 n 2 (d 1) ( 2 + (d + 1) 2 )+o(n 2 ) ( 6= 0 in general)

29 Error to the true parameters The error w.r.t. the *true parameters* is related to the error w.r.t. the global MLE E p I(ˆ KL ) 2 ] MSE mle + 1 n 2 (d 1) 2 E p I(ˆ linear ) 2 ] MSE mle + 1 n 2 (d 1) ( ) But note that MSE mle 1 n I n 2 The different is small, theoretically and asymptotically But practically, KL- averaging is more robust in non- asymptotic or non- regular settings 29

30 Experiments 2D Gaussian distribution on an ellipse apple apple x1 a cos( ) N ( x 2 b sin( ), 2 I) a Statistical curvature = geometric curvature b [ ] MSE w.r.t. global MLE Linear averaging KL averaging Theoretical Limits Linear averaging KL averaging Theoretical Limits [ ] Total sample size (n) 30

31 Experiments 2D Gaussian distribution on an ellipse apple x1 apple a cos( ) N ( x 2 b sin( ), 2 I) a Statistical curvature = geometric curvature MSE w.r.t. true parameter (a). MSE w.r.t. global MLE b [ ] Linear averaging KL averaging Global MLE [ ] Total sample size (n) 31

32 Gaussian mixture on MNIST data: Training data partitioned into 10 groups Calculate KL divergence by Monte Carlo (i.e., the parametric bootstrap procedure) Show testing likelihood for different methods Testing log-likelihood Random partition Training Total Sample Size (n) Local MLEs Global MLE Linear Avg Matched Linear Avg KL Avg

33 Gaussian mixture on MNIST data: Training data partitioned into 10 groups Calculate KL divergence by Monte Carlo (i.e., the parametric bootstrap procedure) Show testing likelihood for different methods Testing log-likelihood Random partition Local MLEs Global MLE Linear Avg Matched Linear Avg KL Avg Training Total Sample Size (n) Label- wise partition Training Total Sample Size (n) (n)

34 More datasets YearPredictionMSD dataset from UCI repository Similar setting as before (random partition) Testing log-likelihood Local MLEs Global MLE Linear Avg Matched Linear Avg KL Avg Training Sample Size (n) 34

35 Conclusion Distributed Learning: " MLE MLE MLE X 1 X 2 X d ˆ 1 ˆ 2 ˆ d vs. MLE X ˆ f (= f(ˆ 1,...,ˆ d )) ˆ mle Combination function f()? KL- averaging vs. linear- averaging Statistical loss? Exponential / non- exponential families statistical curvature = Information loss 35

36 Thanks! 36

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate