SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE. David M. Blei Columbia University

Size: px
Start display at page:

Download "SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE. David M. Blei Columbia University"

Transcription

1 SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE David M. Blei Columbia University

2 This talk is about how to discover hidden patterns in large high-dimensional data sets.

3 Communities discovered in a.m node network of U.S. Patents [Gopalan and Blei, PNAS 0]

4 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Man Officer Officers Case Found Charged Topics found in.8m articles from the New York Times Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, except that each cell s distribution is determined by hidden variables that depend on the cell s row Shot [Hoffman, Blei, Wang, Paisley, JMLR 0]

5 prob lian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba Population analysis of billion genetic measurements [Gopalan, Hao, Blei, Storey, to appear]

6 Neuroscience analysis of 0 million fmri measurements [Manning et al., PLOS ONE 0]

7 Analysis of.m taxi trajectories, in Stan [Kucukelbir et al., 0]

8 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Customized data analysis is important to many fields. I Pipeline separates assumptions, computation, application I Eases collaborative solutions to statistics problems

9 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Inference is the key algorithmic problem. I Answers the question: What does this model say about this data? I Our goal: General and scalable approaches to inference

10 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Variational methods: inference as optimization [Jordan et al., 999] I Scale up with stochastic variational inference (SVI) [Hoffman et al., 0] I Generalize with black box variational inference (BBVI) [Ranganath et al., 0]

11 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Both approaches use stochastic optimization. I SVI subsamples from a massive data set I BBVI uses Monte Carlo to approximate difficult-to-compute expectations

12 STOCHASTIC VARIATIONAL INFERENCE (with Matt Hoffman, Chong Wang, John Paisley)

13 Stochastic variational inference is an algorithm that scales general Bayesian computation to massive data.

14 Motivation: Topic Modeling. Discover the thematic structure in a large collection of documents. Annotate the documents. Use the annotations to visualize, organize, summarize,...

15 Example: Latent Dirichlet allocation (LDA) Documents exhibit multiple topics.

16 Example: Latent Dirichlet allocation (LDA) Topics gene 0.0 dna 0.0 genetic 0.0.,, Documents Topic proportions and assignments life 0.0 evolve 0.0 organism 0.0.,, brain 0.0 neuron 0.0 nerve data 0.0 number 0.0 computer 0.0.,, Each topic is a distribution over words Each document is a mixture of corpus-wide topics Each word is drawn from one of those topics

17 Example: Latent Dirichlet allocation (LDA) Topics Documents Topic proportions and assignments But we only observe the documents; the other structure is hidden. We compute the posterior p.topics, proportions, assignments j documents/

18 Example: Latent Dirichlet allocation (LDA) Topics Documents Topic proportions and assignments Many data sets contain millions of documents. This requires inference about billions of variables. SVI can scale to these data.

19 LDA as a graphical model Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter d z d;n w d;n ˇk N D K Encodes assumptions about data with a factorization of the joint Connects assumptions to algorithms for computing with data Defines the posterior (through the joint)

20 Posterior inference d z d;n w d;n ˇk N D K The posterior of the latent variables given the documents is p.ˇ; ; z j w/ D R ˇ p.ˇ; ; z; w/ R Pz p.ˇ; ; z; w/: We can t compute the denominator, the marginal p.w/. We use approximate inference.

21 Classical variational inference d z d;n w d;n ˇk N D K Originally, we used variational methods to fit this model. [Blei et al., 00] Classical variational inference is inefficient: Do some local computation for each data point. Aggregate these computations to re-estimate global structure. Repeat. This cannot handle massive data.

22 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Stochastic variational inference MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure

23 Stochastic variational inference Perplexity Online 98K Online.M Batch 98K Documents analyzed Top eight words 08 systems road made service announced national west language Documents seen (log scale) 09 systems health communication service billion language care road 89 service systems health companies market communication company billion 88 service systems companies business company billion health industry 8 service companies systems business company industry market billion 8 business service companies industry company management systems services 9 business service companies industry services company management public business industry service companies services company management public [Hoffman et al., 00]

24 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Topics using the HDP, found in.8m articles from the New York Times Man Officer Officers Case Found Charged Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, Shot

25 Stochastic variational inference Subsample data Infer local structure Update global structure Next parts of this talk:. Define a generic class of models. Derive classical mean-field variational inference. Use the classical algorithm to derive stochastic variational inference

26 A generic class of models Global variables ˇ z x Local variables i i ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id The observations are x D x Wn. The local variables are z D z Wn. The global variables are ˇ. The ith data point x i only depends on z i and ˇ. n Goal: Compute p.ˇ; z j x/.

27 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id A complete conditional is the conditional of a latent variable given the observations and other latent variables. Assume each complete conditional is in the exponential family, p.z i j ˇ; x i / D h.z i / expf`.ˇ; x i / > z i a.`.ˇ; x i //g p.ˇ j z; x/ D h.ˇ/ expf g.z; x/ >ˇ a. g.z; x//g:

28 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id A complete conditional is the conditional of a latent variable given the observations and other latent variable. The global parameter comes from conjugacy [Bernardo and Smith, 99] g.z; x/ D C P n id t.z i; x i /; where is a hyperparameter and t./ are sufficient statistics for Œz i ; x i.

29 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id Dirichlet process mixture with conjugate base measure [Sethuraman, 99] Local: mixture assignments for each observation (categorical). Global: stick lengths (beta) and mixture components (e.g., Gaussian). Other BNP models, e.g., Beta-Bernoulli process Hierarchical Dirichlet process

30 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id Bayesian mixture models Time series models (variants of HMMs, Kalman filters) Factorial models Matrix factorization (e.g., factor analysis, PCA, CCA) Dirichlet process mixtures, HDPs Multilevel regression (linear, probit, Poisson) Stochastic blockmodels Mixed-membership models (LDA and some variants)

31 (c) Iteration 8 (d) Iteration (c)iteration Iteration08 (b) (a) Initialization (e) Iteration 0 (d) Iteration Evidence Lower Bound ;00 ;00 ;00 ;800 Average Log Predictive Evidence Lower Bound : : : : ;00 ;00 (e) Iteration 0 Average Log Predictive : : 0 0: : Iterations ; ;00 Iterations Iteration 8 (d) Iteration (f) subcaption elbo (g) subcaption avelogpred 0 0 (e) 0Iteration Iterations Iterations : Main caption I Variational methods turn Figure inference into optimization 0 0 (f) subcaption elbo (g) subcaption avelogpred I Idea: Fit a simple distribution to be close (in KL) to the exact posterior Figure : Main caption I Here: Evidence LowerABound Log Predictive simple mixtureaverage of Gaussians [image by Alp Kucukelbir]

32 Mean-field variational inference ˇ ˇ z i x i n ELBO i z i n Goal: Minimize KL divergence between a family q and the posterior p. Mean-field assumption: Set q.ˇ; z/ to be fully factored, q.ˇ; z/ D q.ˇ j / Q n id q.z i j i /: Each factor is the same family as the model s complete conditional, p.ˇ j z; x/ D h.ˇ/ expf g.z; x/ >ˇ q.ˇ j / D h.ˇ/ expf >ˇ a./g: a. g.z; x//g

33 Mean-field variational inference ˇ ˇ ELBO z i x i n i z i n Optimize the evidence lower bound, equivalent to optimizing negative KL, L.; Wn / D E q Œlog p.ˇ; Z; x/ E q Œlog q.ˇ; Z/ : Traditional VI uses coordinate ascent [Ghahramani and Beal, 00] D E g.z; x/ I i D E Œ`.ˇ; x i / Iteratively update each parameter, holding others fixed. Notice the relationship to Gibbs sampling.

34 Mean-field variational inference for LDA d d;n k d z d;n w d;n ˇk N D K The local variables are the per-document variables d and z d. The global variables are the topics ˇ; : : : ; ˇK. The variational distribution is KY DY NY q.ˇ; ; z/ D q.ˇk j k / q. d j d / q.z d;n j d;n / kd dd nd

35 Mean-field variational inference for LDA Probability Topics

36 Mean-field variational inference for LDA Genetics Evolution Disease Computers human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

37 Classical variational inference Input: data x, model p.ˇ; z; x/. Initialize randomly. repeat for each data point i do Set local parameter i end E Œ`.ˇ; x i /. Set global parameter C P n id E i Œt.Z i ; x i /. until the ELBO has converged This is inefficient: We analyze all data before completing one iteration.

38 Stochastic optimization Replace the gradient with cheaper noisy estimates [Robbins and Monro, 9] Guaranteed to converge to a local optimum [Bottou, 99] Has enabled modern machine learning

39 Natural gradients Natural Gradient Works Efficiently in Learning Shun-ichi Amari RIKEN Frontier Research Program, Saitama -0, Japan When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. The natural gradient of the ELBO [Amari, 998; Sato, 00] Computationally: Or L D C P n id E i Œt.Z i ; x i / : Compute coordinate updates. Subtract the current variational parameters.

40 Stochastic variational inference Subsample data Infer local structure Update global structure Construct a noisy natural gradient Sample a data point at random j Uniform.; : : : ; n/. Calculate Qr L D C ne Œt.Z j ; x j j / This is a good noisy gradient The expectation (with respect to j ) is the natural gradient. Only requires the local parameters for one data point.

41 Stochastic variational inference Input: data x, model p.ˇ; z; x/. Initialize randomly. Set t appropriately. repeat Sample j Unif.; : : : ; n/. Set local parameter E `.ˇ; x j /. Set intermediate global parameter O D C ne Œt.Z j ; x j / : Set global parameter until forever D. t / C t O :

42 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Stochastic variational inference MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure

43 Stochastic variational inference in LDA d d;n k d z d;n w d;n ˇk N D K. Sample a document. Estimate the local variational parameters using the current topics. Form intermediate topics from those local parameters. Update topics as a weighted average of intermediate and current topics

44 Stochastic variational inference in LDA Perplexity Online 98K Online.M Batch 98K Documents analyzed Top eight words 08 systems road made service announced national west language Documents seen (log scale) 09 systems health communication service billion language care road 89 service systems health companies market communication company billion 88 service systems companies business company billion health industry 8 service companies systems business company industry market billion 8 business service companies industry company management systems services 9 business service companies industry services company management public business industry service companies services company management public [Hoffman et al., 00]

45 HDP topic models Log Predictive Probability nature nyt wiki Hours Algorithm HDP HDP batch A study of large corpora with the HDP topic model [Teh et al., 00] Details are in Hoffman et al., 0. SVI is faster and lets us analyze more data.

46 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Topics using the HDP, found in.8m articles from the New York Times Man Officer Officers Case Found Charged Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, Shot

47 Stochastic variational inference Subsample data Infer local structure Update global structure We derived an algorithm for scalable variational inference. Bayesian mixture models Time series models (variants of HMMs, Kalman filters) Factorial models Matrix factorization (e.g., factor analysis, PCA, CCA) Dirichlet process mixtures, HDPs Multilevel regression (linear, probit, Poisson) Stochastic blockmodels Mixed-membership models (LDA and some variants)

48 Adygei research funding support nih program mice antigen t cells antigens immune response activated tyrosine phosphorylation activation phosphorylation kinase BalochiBantuKenya BantuSouthAfrica Basque Bedouin BiakaPygmy Brahui BurushoCambodian Colombian Dai Daur Druze French Han Han NChina HazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba ch Han Han NChina HazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba science scientists says research people patients disease treatment drugs clinical cells cell expression cell lines bone marrow virus hiv aids infection viruses bacteria bacterial host resistance parasite p cell cycle activity cyclin regulation wild type amino acids cdna sequence isolated protein gene disease mutations families mutation proteins protein binding domain domains enzyme enzymes iron active site reduction plants plant gene genes arabidopsis development embryos drosophila genes expression genetic population populations differences variation rna dna rna polymerase cleavage site sequence sequences genome dna sequencing magnetic magnetic field spin superconductivity superconducting fossil record birds fossils dinosaurs fossil surface liquid surfaces fluid model ancient found impact million years ago africa brain memory subjects left task computer problem information computers problems species forest forests populations ecosystems laser optical light electrons quantum pressure high pressure pressures core inner core neurons stimulus motor visual cortical materials organic polymer polymers molecules mantle crust upper mantle meteorites ratios earthquake earthquakes fault images data climate ocean ice changes climate change physicists particles physics particle experiment co carbon carbon dioxide methane water sun solar wind earth planets planet ozone atmospheric measurements stratosphere concentrations united states women universities students education receptor receptors ligand ligands apoptosis mutant mutations mutants mutation cells proteins researchers protein found volcanic deposits magma eruption volcanism surface tip image sample device synapses ltp glutamate synaptic neurons stars reaction astronomers reactions universe molecule galaxies molecules galaxy transition state prob

49 BLACK BOX VARIATIONAL INFERENCE (with Rajesh Ranganath and Sean Gerrish)

50 Black box variational inference is an algorithm that efficiently performs Bayesian computation in any model.

51 Black box variational inference KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Approximate inference can be difficult to derive. I Especially true for models that are not conditionally conjugate I E.g., discrete choice models, Bayesian generalized linear models,...

52 Black box variational inference REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Easily use variational inference with any model No exponential family requirements No mathematical work beyond specifying the model

53 Black box variational inference REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Sample from q./ Form noisy gradients without model-specific computation Use stochastic optimization

54 Black box variational inference ˇ ˇ z i x i n ELBO i z i n ELBO: L./ D E q Œlog p.ˇ; Z; x/ log q.ˇ; Z/ Shorthand: L./ D E q ŒD.ˇ; Z/

55 Black box variational inference ˇ ˇ z i x i n ELBO i z i n A noisy gradient: r L BX r log q.ˇb; z b / D.ˇ; Z/ B bd where.ˇb; z b / q.ˇ; z/

56 The noisy gradient r L./ BX r log q.ˇb; z b /D.ˇb; z b / B bd We use these gradients in a stochastic optimization algorithm. Requirements: Sampling from q.ˇ; z/ Evaluating r log q.ˇ; z/ Evaluating log p.ˇ; z; x/ A black box : We can reuse q./ across models

57 The noisy gradient r L./ BX r log q.ˇb; z b /D.ˇb; z b / B bd Making it work: Rao-Blackwellization for each component of the gradient Control variates, again using r log q.ˇ; z/ AdaGrad, for setting learning rates [Duchi, Hazan, Singer, 0] Stochastic variational inference, for handling massive data

58 Monte Carlo Gradients of the ELBO REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ MC gradient [Ji et al., 00; Nott et al., 0; Paisley et al., 0; Ranganath et al. 0] Autoencoders [Kingma and Welling, 0/0] Neural networks [Kingma et al., 0; Mnih and Gregor, 0; Rezende et al., 0] A perspective from regression [Salimans and Knowles, 0] Doubly stochastic VB [Titsias and Lazaro-Gredilla, 0]

59 Neuroscience analysis of 0 million fmri measurements [Manning et al., PLOS ONE 0]

60 z n;l;k K L w`+,k w`,k z n;`c;k z n;`;k K`C K` w,k z n;;k K w0,i x n;i V N Deep Exponential Families Probabilistic Programming [Ranganath et al., AISTATS 0] [Kucukelbir et al., 0]

61 Edward: A library for probabilistic modeling, inference, and criticism github.com/blei-lab/edward (lead by Dustin Tran)

62 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Customized data analysis is important to many fields. I Pipeline separates assumptions, computation, application I Eases collaborative solutions to statistics problems

63 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Inference is the key algorithmic problem. I Answers the question: What does this model say about this data? I Our goal: General and scalable approaches to inference

64 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure Stochastic variational inference [Hoffman et al., 0, JMLR]

65 REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Black box variational inference [Ranganath et al., 0, AISTATS]

66 Recent research (on the ArXiv): Variational Inference: A Review for Statisticians (with J. McAuliffe and A. Kucukelbir) Hierarchical Variational Models (with R. Ranganath and D. Tran) Automatic Differentiation Variational Inference (with A. Kucukelbir, D. Tran, R. Ranganath, and A. Gelman) Open and current research: How can we improve the gradient estimator in BBVI? Can we use alternative divergence measures? What are their properties? What are the statistical properties of VI? What is a good framework? How can we efficiently go beyond the mean field?

67 KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 Criticize model Revise

68 Should I be skeptical about variational inference? Average Log Predictive 9 ADVI (M=) ADVI (M=0) NUTS HMC Seconds Average Log Predictive 0: 0:9 : : : ADVI (M=) ADVI (M=0) NUTS HMC Seconds (a) Linear Regression with (b) Hierarchical Logistic Regression Figure : Hierarchical Generalized Linear Models. MCMC enjoys theoretical guarantees. But they usually get to the same place. [Kucukelbir et al., submitted] We need more theory about variational inference.

69 Should I be skeptical about variational inference? Density (KDE estimate) 0 8 truth NUTS ADVI-MF ADVI-FR Density (KDE estimate) truth NUTS ADVI-MF ADVI-FR Tau Beta Variational inference underestimates the variance of the posterior. Relaxing the mean-field assumption can help. Here: A Poisson GLM [Giordano et al., 0]

Stochastic Optimization and Variational Inference

Stochastic Optimization and Variational Inference Stochastic Optimization and Variational Inference David M. Blei Princeton University February 27, 2014 PROSTHESIS Prosthesis comprising an expansible or contractile tubular body Light reflectant surface

More information

SCALABLE INFERENCE FOR BAYESIAN NONPARAMETRIC MODELS

SCALABLE INFERENCE FOR BAYESIAN NONPARAMETRIC MODELS SCALABLE INFERENCE FOR BAYESIAN NONPARAMETRIC MODELS David M. Blei Departments of Computer Science and Statistics Data Science Institute Columbia University Communities discovered in a 3.7M node network

More information

Variational Inference: A Review for Statisticians

Variational Inference: A Review for Statisticians Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University Alp Kucukelbir Department of Computer Science Columbia University Jon D.

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Automatic Differentiation Variational Inference

Automatic Differentiation Variational Inference Journal of Machine Learning Research X (2016) 1-XX Submitted X/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia

More information

Approximate Bayesian inference

Approximate Bayesian inference Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic

More information

Automatic Differentiation Variational Inference

Automatic Differentiation Variational Inference Journal of Machine Learning Research X (2016) 1-XX Submitted 01/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute and Department of Computer Science

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

Mixed-membership Models (and an introduction to variational inference)

Mixed-membership Models (and an introduction to variational inference) Mixed-membership Models (and an introduction to variational inference) David M. Blei Columbia University November 24, 2015 Introduction We studied mixture models in detail, models that partition data into

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University An Overview of Edward: A Probabilistic Programming System Dustin Tran Columbia University Alp Kucukelbir Eugene Brevdo Andrew Gelman Adji Dieng Maja Rudolph David Blei Dawen Liang Matt Hoffman Kevin Murphy

More information

arxiv: v1 [stat.ml] 2 Mar 2016

arxiv: v1 [stat.ml] 2 Mar 2016 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia University arxiv:1603.00788v1 [stat.ml] 2 Mar 2016 Dustin Tran Department

More information

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Local Expectation Gradients for Doubly Stochastic. Variational Inference Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow Mark Lipson, Po-Ru Loh, Alex Levin, David Reich, Nick Patterson, and Bonnie Berger 41 Surui Karitiana

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

Automatic Variational Inference in Stan

Automatic Variational Inference in Stan Automatic Variational Inference in Stan Alp Kucukelbir Columbia University alp@cs.columbia.edu Andrew Gelman Columbia University gelman@stat.columbia.edu Rajesh Ranganath Princeton University rajeshr@cs.princeton.edu

More information

Overdispersed Black-Box Variational Inference

Overdispersed Black-Box Variational Inference Overdispersed Black-Box Variational Inference Francisco J. R. Ruiz Data Science Institute Dept. of Computer Science Columbia University Michalis K. Titsias Dept. of Informatics Athens University of Economics

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo (and Bayesian Mixture Models) David M. Blei Columbia University October 14, 2014 We have discussed probabilistic modeling, and have seen how the posterior distribution is the critical

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

arxiv: v9 [stat.co] 9 May 2018

arxiv: v9 [stat.co] 9 May 2018 Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University arxiv:1601.00670v9 [stat.co] 9 May 2018 Alp Kucukelbir Department of Computer

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18 Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Ximing Li, Changchun

More information

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Applying LDA topic model to a corpus of Italian Supreme Court decisions Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014 Our goal finding

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Modeling Using Latent Dirichlet Allocation (LDA) Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October

More information

Operator Variational Inference

Operator Variational Inference Operator Variational Inference Rajesh Ranganath Princeton University Jaan Altosaar Princeton University Dustin Tran Columbia University David M. Blei Columbia University Abstract Variational inference

More information

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51 Topic Models Material adapted from David Mimno University of Maryland INTRODUCTION Material adapted from David Mimno UMD Topic Models 1 / 51 Why topic models? Suppose you have a huge number of documents

More information

arxiv: v2 [stat.ml] 21 Jul 2015

arxiv: v2 [stat.ml] 21 Jul 2015 The Population Posterior and Bayesian Inference on Streams arxiv:1507.05253v2 [stat.ml] 21 Jul 2015 James McInerney Data Science Institute Columbia University New York, NY 10027 jm4181@columbia.edu Rajesh

More information

Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence

Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence Mohammad Emtiyaz Khan Ecole Polytechnique Fédérale de Lausanne Lausanne Switzerland emtiyaz@gmail.com

More information

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Scott W. Linderman Matthew J. Johnson Andrew C. Miller Columbia University Harvard and Google Brain Harvard University Ryan

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 5, 2011 Today: Latent Dirichlet Allocation topic models Social network analysis based on latent probabilistic

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

The Generalized Reparameterization Gradient

The Generalized Reparameterization Gradient The Generalized Reparameterization Gradient Francisco J. R. Ruiz University of Cambridge Columbia University Michalis K. Titsias Athens University of Economics and Business David M. Blei Columbia University

More information

Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models

Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models David M. Blei Columbia University Abstract We survey how to use latent variable models to solve data analysis problems. A latent

More information

Variational Inference: A Review for Statisticians

Variational Inference: A Review for Statisticians Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 Variational Inference: A Review for Statisticians David

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,

More information

Quasi-Monte Carlo Flows

Quasi-Monte Carlo Flows Quasi-Monte Carlo Flows Florian Wenzel TU Kaiserslautern Germany wenzelfl@hu-berlin.de Alexander Buchholz ENSAE-CREST, Paris France alexander.buchholz@ensae.fr Stephan Mandt Univ. of California, Irvine

More information

Variational Bayes on Monte Carlo Steroids

Variational Bayes on Monte Carlo Steroids Variational Bayes on Monte Carlo Steroids Aditya Grover, Stefano Ermon Department of Computer Science Stanford University {adityag,ermon}@cs.stanford.edu Abstract Variational approaches are often used

More information

Interpretable Latent Variable Models

Interpretable Latent Variable Models Interpretable Latent Variable Models Fernando Perez-Cruz Bell Labs (Nokia) Department of Signal Theory and Communications, University Carlos III in Madrid 1 / 24 Outline 1 Introduction to Machine Learning

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Variational Inference via χ Upper Bound Minimization

Variational Inference via χ Upper Bound Minimization Variational Inference via χ Upper Bound Minimization Adji B. Dieng Columbia University Dustin Tran Columbia University Rajesh Ranganath Princeton University John Paisley Columbia University David M. Blei

More information

Probabilistic Matrix Factorization

Probabilistic Matrix Factorization Probabilistic Matrix Factorization David M. Blei Columbia University November 25, 2015 1 Dyadic data One important type of modern data is dyadic data. Dyadic data are measurements on pairs. The idea is

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Smoothed Gradients for Stochastic Variational Inference

Smoothed Gradients for Stochastic Variational Inference Smoothed Gradients for Stochastic Variational Inference Stephan Mandt Department of Physics Princeton University smandt@princeton.edu David Blei Department of Computer Science Department of Statistics

More information

Evaluating the Variance of

Evaluating the Variance of Evaluating the Variance of Likelihood-Ratio Gradient Estimators Seiya Tokui 2 Issei Sato 2 3 Preferred Networks 2 The University of Tokyo 3 RIKEN ICML 27 @ Sydney Task: Gradient estimation for stochastic

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations : DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms Christian A. Naesseth Francisco J. R. Ruiz Scott W. Linderman David M. Blei Linköping University Columbia University University

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

Computational Biology Course Descriptions 12-14

Computational Biology Course Descriptions 12-14 Computational Biology Course Descriptions 12-14 Course Number and Title INTRODUCTORY COURSES BIO 311C: Introductory Biology I BIO 311D: Introductory Biology II BIO 325: Genetics CH 301: Principles of Chemistry

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data

More information

Advances in Variational Inference

Advances in Variational Inference 1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

XVII. Science and Technology/Engineering, Grade 8

XVII. Science and Technology/Engineering, Grade 8 XVII. Science and Technology/Engineering, Grade 8 Grade 8 Science and Technology/Engineering Test The spring 2017 grade 8 Science and Technology/Engineering test was based on learning standards in the

More information

Distributed ML for DOSNs: giving power back to users

Distributed ML for DOSNs: giving power back to users Distributed ML for DOSNs: giving power back to users Amira Soliman KTH isocial Marie Curie Initial Training Networks Part1 Agenda DOSNs and Machine Learning DIVa: Decentralized Identity Validation for

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA

More information

Content-based Recommendation

Content-based Recommendation Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3

More information

Variational Inference with Copula Augmentation

Variational Inference with Copula Augmentation Variational Inference with Copula Augmentation Dustin Tran 1 David M. Blei 2 Edoardo M. Airoldi 1 1 Department of Statistics, Harvard University 2 Department of Statistics & Computer Science, Columbia

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Proximity Variational Inference

Proximity Variational Inference Jaan Altosaar Rajesh Ranganath David M. Blei altosaar@princeton.edu rajeshr@cims.nyu.edu david.blei@columbia.edu Princeton University New York University Columbia University Abstract Variational inference

More information

Statistical Models. David M. Blei Columbia University. October 14, 2014

Statistical Models. David M. Blei Columbia University. October 14, 2014 Statistical Models David M. Blei Columbia University October 14, 2014 We have discussed graphical models. Graphical models are a formalism for representing families of probability distributions. They are

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Revision Based on Chapter 19 Grade 11

Revision Based on Chapter 19 Grade 11 Revision Based on Chapter 19 Grade 11 Biology Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Most fossils are found in rusty water. volcanic rock. sedimentary

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Bayesian Paragraph Vectors

Bayesian Paragraph Vectors Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com

More information

Distance dependent Chinese restaurant processes

Distance dependent Chinese restaurant processes David M. Blei Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Peter Frazier Department of Operations Research and Information Engineering, Cornell University 232

More information

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Lawrence Livermore National Laboratory Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Keith Henderson and Tina Eliassi-Rad keith@llnl.gov and eliassi@llnl.gov This work was performed

More information