SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE. David M. Blei Columbia University

Size: px

Start display at page:

Download "SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE. David M. Blei Columbia University"

Abigayle Newman
6 years ago
Views:

1 SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE David M. Blei Columbia University

2 This talk is about how to discover hidden patterns in large high-dimensional data sets.

3 Communities discovered in a.m node network of U.S. Patents [Gopalan and Blei, PNAS 0]

4 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Man Officer Officers Case Found Charged Topics found in.8m articles from the New York Times Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, except that each cell s distribution is determined by hidden variables that depend on the cell s row Shot [Hoffman, Blei, Wang, Paisley, JMLR 0]

5 prob lian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba Population analysis of billion genetic measurements [Gopalan, Hao, Blei, Storey, to appear]

6 Neuroscience analysis of 0 million fmri measurements [Manning et al., PLOS ONE 0]

7 Analysis of.m taxi trajectories, in Stan [Kucukelbir et al., 0]

8 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Customized data analysis is important to many fields. I Pipeline separates assumptions, computation, application I Eases collaborative solutions to statistics problems

9 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Inference is the key algorithmic problem. I Answers the question: What does this model say about this data? I Our goal: General and scalable approaches to inference

10 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Variational methods: inference as optimization [Jordan et al., 999] I Scale up with stochastic variational inference (SVI) [Hoffman et al., 0] I Generalize with black box variational inference (BBVI) [Ranganath et al., 0]

11 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Both approaches use stochastic optimization. I SVI subsamples from a massive data set I BBVI uses Monte Carlo to approximate difficult-to-compute expectations

12 STOCHASTIC VARIATIONAL INFERENCE (with Matt Hoffman, Chong Wang, John Paisley)

13 Stochastic variational inference is an algorithm that scales general Bayesian computation to massive data.

14 Motivation: Topic Modeling. Discover the thematic structure in a large collection of documents. Annotate the documents. Use the annotations to visualize, organize, summarize,...

15 Example: Latent Dirichlet allocation (LDA) Documents exhibit multiple topics.

0 organism 0.0.,, brain 0.0 neuron 0.0 nerve 0.0... data 0.0 number 0.0 computer 0.0.,,

16 Example: Latent Dirichlet allocation (LDA) Topics gene 0.0 dna 0.0 genetic 0.0.,, Documents Topic proportions and assignments life 0.0 evolve 0.0 organism 0.0.,, brain 0.0 neuron 0.0 nerve data 0.0 number 0.0 computer 0.0.,, Each topic is a distribution over words Each document is a mixture of corpus-wide topics Each word is drawn from one of those topics

17 Example: Latent Dirichlet allocation (LDA) Topics Documents Topic proportions and assignments But we only observe the documents; the other structure is hidden. We compute the posterior p.topics, proportions, assignments j documents/

18 Example: Latent Dirichlet allocation (LDA) Topics Documents Topic proportions and assignments Many data sets contain millions of documents. This requires inference about billions of variables. SVI can scale to these data.

19 LDA as a graphical model Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter d z d;n w d;n ˇk N D K Encodes assumptions about data with a factorization of the joint Connects assumptions to algorithms for computing with data Defines the posterior (through the joint)

20 Posterior inference d z d;n w d;n ˇk N D K The posterior of the latent variables given the documents is p.ˇ; ; z j w/ D R ˇ p.ˇ; ; z; w/ R Pz p.ˇ; ; z; w/: We can t compute the denominator, the marginal p.w/. We use approximate inference.

21 Classical variational inference d z d;n w d;n ˇk N D K Originally, we used variational methods to fit this model. [Blei et al., 00] Classical variational inference is inefficient: Do some local computation for each data point. Aggregate these computations to re-estimate global structure. Repeat. This cannot handle massive data.

22 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Stochastic variational inference MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure

23 Stochastic variational inference Perplexity Online 98K Online.M Batch 98K Documents analyzed Top eight words 08 systems road made service announced national west language Documents seen (log scale) 09 systems health communication service billion language care road 89 service systems health companies market communication company billion 88 service systems companies business company billion health industry 8 service companies systems business company industry market billion 8 business service companies industry company management systems services 9 business service companies industry services company management public business industry service companies services company management public [Hoffman et al., 00]

24 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Topics using the HDP, found in.8m articles from the New York Times Man Officer Officers Case Found Charged Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, Shot

25 Stochastic variational inference Subsample data Infer local structure Update global structure Next parts of this talk:. Define a generic class of models. Derive classical mean-field variational inference. Use the classical algorithm to derive stochastic variational inference

26 A generic class of models Global variables ˇ z x Local variables i i ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id The observations are x D x Wn. The local variables are z D z Wn. The global variables are ˇ. The ith data point x i only depends on z i and ˇ. n Goal: Compute p.ˇ; z j x/.

27 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id A complete conditional is the conditional of a latent variable given the observations and other latent variables. Assume each complete conditional is in the exponential family, p.z i j ˇ; x i / D h.z i / expf`.ˇ; x i / > z i a.`.ˇ; x i //g p.ˇ j z; x/ D h.ˇ/ expf g.z; x/ >ˇ a. g.z; x//g:

28 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id A complete conditional is the conditional of a latent variable given the observations and other latent variable. The global parameter comes from conjugacy [Bernardo and Smith, 99] g.z; x/ D C P n id t.z i; x i /; where is a hyperparameter and t./ are sufficient statistics for Œz i ; x i.

29 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id Dirichlet process mixture with conjugate base measure [Sethuraman, 99] Local: mixture assignments for each observation (categorical). Global: stick lengths (beta) and mixture components (e.g., Gaussian). Other BNP models, e.g., Beta-Bernoulli process Hierarchical Dirichlet process

30 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id Bayesian mixture models Time series models (variants of HMMs, Kalman filters) Factorial models Matrix factorization (e.g., factor analysis, PCA, CCA) Dirichlet process mixtures, HDPs Multilevel regression (linear, probit, Poisson) Stochastic blockmodels Mixed-membership models (LDA and some variants)

31 (c) Iteration 8 (d) Iteration (c)iteration Iteration08 (b) (a) Initialization (e) Iteration 0 (d) Iteration Evidence Lower Bound ;00 ;00 ;00 ;800 Average Log Predictive Evidence Lower Bound : : : : ;00 ;00 (e) Iteration 0 Average Log Predictive : : 0 0: : Iterations ; ;00 Iterations Iteration 8 (d) Iteration (f) subcaption elbo (g) subcaption avelogpred 0 0 (e) 0Iteration Iterations Iterations : Main caption I Variational methods turn Figure inference into optimization 0 0 (f) subcaption elbo (g) subcaption avelogpred I Idea: Fit a simple distribution to be close (in KL) to the exact posterior Figure : Main caption I Here: Evidence LowerABound Log Predictive simple mixtureaverage of Gaussians [image by Alp Kucukelbir]

32 Mean-field variational inference ˇ ˇ z i x i n ELBO i z i n Goal: Minimize KL divergence between a family q and the posterior p. Mean-field assumption: Set q.ˇ; z/ to be fully factored, q.ˇ; z/ D q.ˇ j / Q n id q.z i j i /: Each factor is the same family as the model s complete conditional, p.ˇ j z; x/ D h.ˇ/ expf g.z; x/ >ˇ q.ˇ j / D h.ˇ/ expf >ˇ a./g: a. g.z; x//g

33 Mean-field variational inference ˇ ˇ ELBO z i x i n i z i n Optimize the evidence lower bound, equivalent to optimizing negative KL, L.; Wn / D E q Œlog p.ˇ; Z; x/ E q Œlog q.ˇ; Z/ : Traditional VI uses coordinate ascent [Ghahramani and Beal, 00] D E g.z; x/ I i D E Œ`.ˇ; x i / Iteratively update each parameter, holding others fixed. Notice the relationship to Gibbs sampling.

34 Mean-field variational inference for LDA d d;n k d z d;n w d;n ˇk N D K The local variables are the per-document variables d and z d. The global variables are the topics ˇ; : : : ; ˇK. The variational distribution is KY DY NY q.ˇ; ; z/ D q.ˇk j k / q. d j d / q.z d;n j d;n / kd dd nd

35 Mean-field variational inference for LDA Probability Topics

36 Mean-field variational inference for LDA Genetics Evolution Disease Computers human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

37 Classical variational inference Input: data x, model p.ˇ; z; x/. Initialize randomly. repeat for each data point i do Set local parameter i end E Œ`.ˇ; x i /. Set global parameter C P n id E i Œt.Z i ; x i /. until the ELBO has converged This is inefficient: We analyze all data before completing one iteration.

38 Stochastic optimization Replace the gradient with cheaper noisy estimates [Robbins and Monro, 9] Guaranteed to converge to a local optimum [Bottou, 99] Has enabled modern machine learning

Natural gradients Natural Gradient Works Efficiently in Learning Shun-ichi Amari RIKEN Frontier Research Program, Saitama -0, Japan When a parameter space has a certain underlying structure, the

39 Natural gradients Natural Gradient Works Efficiently in Learning Shun-ichi Amari RIKEN Frontier Research Program, Saitama -0, Japan When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. The natural gradient of the ELBO [Amari, 998; Sato, 00] Computationally: Or L D C P n id E i Œt.Z i ; x i / : Compute coordinate updates. Subtract the current variational parameters.

40 Stochastic variational inference Subsample data Infer local structure Update global structure Construct a noisy natural gradient Sample a data point at random j Uniform.; : : : ; n/. Calculate Qr L D C ne Œt.Z j ; x j j / This is a good noisy gradient The expectation (with respect to j ) is the natural gradient. Only requires the local parameters for one data point.

41 Stochastic variational inference Input: data x, model p.ˇ; z; x/. Initialize randomly. Set t appropriately. repeat Sample j Unif.; : : : ; n/. Set local parameter E `.ˇ; x j /. Set intermediate global parameter O D C ne Œt.Z j ; x j / : Set global parameter until forever D. t / C t O :

42 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Stochastic variational inference MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure

43 Stochastic variational inference in LDA d d;n k d z d;n w d;n ˇk N D K. Sample a document. Estimate the local variational parameters using the current topics. Form intermediate topics from those local parameters. Update topics as a weighted average of intermediate and current topics

44 Stochastic variational inference in LDA Perplexity Online 98K Online.M Batch 98K Documents analyzed Top eight words 08 systems road made service announced national west language Documents seen (log scale) 09 systems health communication service billion language care road 89 service systems health companies market communication company billion 88 service systems companies business company billion health industry 8 service companies systems business company industry market billion 8 business service companies industry company management systems services 9 business service companies industry services company management public business industry service companies services company management public [Hoffman et al., 00]

45 HDP topic models Log Predictive Probability nature nyt wiki Hours Algorithm HDP HDP batch A study of large corpora with the HDP topic model [Teh et al., 00] Details are in Hoffman et al., 0. SVI is faster and lets us analyze more data.

46 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Topics using the HDP, found in.8m articles from the New York Times Man Officer Officers Case Found Charged Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, Shot

47 Stochastic variational inference Subsample data Infer local structure Update global structure We derived an algorithm for scalable variational inference. Bayesian mixture models Time series models (variants of HMMs, Kalman filters) Factorial models Matrix factorization (e.g., factor analysis, PCA, CCA) Dirichlet process mixtures, HDPs Multilevel regression (linear, probit, Poisson) Stochastic blockmodels Mixed-membership models (LDA and some variants)

48 Adygei research funding support nih program mice antigen t cells antigens immune response activated tyrosine phosphorylation activation phosphorylation kinase BalochiBantuKenya BantuSouthAfrica Basque Bedouin BiakaPygmy Brahui BurushoCambodian Colombian Dai Daur Druze French Han Han NChina HazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba ch Han Han NChina HazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba science scientists says research people patients disease treatment drugs clinical cells cell expression cell lines bone marrow virus hiv aids infection viruses bacteria bacterial host resistance parasite p cell cycle activity cyclin regulation wild type amino acids cdna sequence isolated protein gene disease mutations families mutation proteins protein binding domain domains enzyme enzymes iron active site reduction plants plant gene genes arabidopsis development embryos drosophila genes expression genetic population populations differences variation rna dna rna polymerase cleavage site sequence sequences genome dna sequencing magnetic magnetic field spin superconductivity superconducting fossil record birds fossils dinosaurs fossil surface liquid surfaces fluid model ancient found impact million years ago africa brain memory subjects left task computer problem information computers problems species forest forests populations ecosystems laser optical light electrons quantum pressure high pressure pressures core inner core neurons stimulus motor visual cortical materials organic polymer polymers molecules mantle crust upper mantle meteorites ratios earthquake earthquakes fault images data climate ocean ice changes climate change physicists particles physics particle experiment co carbon carbon dioxide methane water sun solar wind earth planets planet ozone atmospheric measurements stratosphere concentrations united states women universities students education receptor receptors ligand ligands apoptosis mutant mutations mutants mutation cells proteins researchers protein found volcanic deposits magma eruption volcanism surface tip image sample device synapses ltp glutamate synaptic neurons stars reaction astronomers reactions universe molecule galaxies molecules galaxy transition state prob

49 BLACK BOX VARIATIONAL INFERENCE (with Rajesh Ranganath and Sean Gerrish)

50 Black box variational inference is an algorithm that efficiently performs Bayesian computation in any model.

51 Black box variational inference KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Approximate inference can be difficult to derive. I Especially true for models that are not conditionally conjugate I E.g., discrete choice models, Bayesian generalized linear models,...

52 Black box variational inference REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Easily use variational inference with any model No exponential family requirements No mathematical work beyond specifying the model

53 Black box variational inference REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Sample from q./ Form noisy gradients without model-specific computation Use stochastic optimization

54 Black box variational inference ˇ ˇ z i x i n ELBO i z i n ELBO: L./ D E q Œlog p.ˇ; Z; x/ log q.ˇ; Z/ Shorthand: L./ D E q ŒD.ˇ; Z/

55 Black box variational inference ˇ ˇ z i x i n ELBO i z i n A noisy gradient: r L BX r log q.ˇb; z b / D.ˇ; Z/ B bd where.ˇb; z b / q.ˇ; z/

56 The noisy gradient r L./ BX r log q.ˇb; z b /D.ˇb; z b / B bd We use these gradients in a stochastic optimization algorithm. Requirements: Sampling from q.ˇ; z/ Evaluating r log q.ˇ; z/ Evaluating log p.ˇ; z; x/ A black box : We can reuse q./ across models

57 The noisy gradient r L./ BX r log q.ˇb; z b /D.ˇb; z b / B bd Making it work: Rao-Blackwellization for each component of the gradient Control variates, again using r log q.ˇ; z/ AdaGrad, for setting learning rates [Duchi, Hazan, Singer, 0] Stochastic variational inference, for handling massive data

58 Monte Carlo Gradients of the ELBO REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ MC gradient [Ji et al., 00; Nott et al., 0; Paisley et al., 0; Ranganath et al. 0] Autoencoders [Kingma and Welling, 0/0] Neural networks [Kingma et al., 0; Mnih and Gregor, 0; Rezende et al., 0] A perspective from regression [Salimans and Knowles, 0] Doubly stochastic VB [Titsias and Lazaro-Gredilla, 0]

59 Neuroscience analysis of 0 million fmri measurements [Manning et al., PLOS ONE 0]

60 z n;l;k K L w`+,k w`,k z n;`c;k z n;`;k K`C K` w,k z n;;k K w0,i x n;i V N Deep Exponential Families Probabilistic Programming [Ranganath et al., AISTATS 0] [Kucukelbir et al., 0]

61 Edward: A library for probabilistic modeling, inference, and criticism github.com/blei-lab/edward (lead by Dustin Tran)

62 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Customized data analysis is important to many fields. I Pipeline separates assumptions, computation, application I Eases collaborative solutions to statistics problems

63 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Inference is the key algorithmic problem. I Answers the question: What does this model say about this data? I Our goal: General and scalable approaches to inference

64 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure Stochastic variational inference [Hoffman et al., 0, JMLR]

65 REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Black box variational inference [Ranganath et al., 0, AISTATS]

66 Recent research (on the ArXiv): Variational Inference: A Review for Statisticians (with J. McAuliffe and A. Kucukelbir) Hierarchical Variational Models (with R. Ranganath and D. Tran) Automatic Differentiation Variational Inference (with A. Kucukelbir, D. Tran, R. Ranganath, and A. Gelman) Open and current research: How can we improve the gradient estimator in BBVI? Can we use alternative divergence measures? What are their properties? What are the statistical properties of VI? What is a good framework? How can we efficiently go beyond the mean field?

67 KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 Criticize model Revise

68 Should I be skeptical about variational inference? Average Log Predictive 9 ADVI (M=) ADVI (M=0) NUTS HMC Seconds Average Log Predictive 0: 0:9 : : : ADVI (M=) ADVI (M=0) NUTS HMC Seconds (a) Linear Regression with (b) Hierarchical Logistic Regression Figure : Hierarchical Generalized Linear Models. MCMC enjoys theoretical guarantees. But they usually get to the same place. [Kucukelbir et al., submitted] We need more theory about variational inference.

69 Should I be skeptical about variational inference? Density (KDE estimate) 0 8 truth NUTS ADVI-MF ADVI-FR Density (KDE estimate) truth NUTS ADVI-MF ADVI-FR Tau Beta Variational inference underestimates the variance of the posterior. Relaxing the mean-field assumption can help. Here: A Poisson GLM [Giordano et al., 0]

Stochastic Optimization and Variational Inference

Stochastic Optimization and Variational Inference David M. Blei Princeton University February 27, 2014 PROSTHESIS Prosthesis comprising an expansible or contractile tubular body Light reflectant surface