SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE. David M. Blei Columbia University
|
|
- Abigayle Newman
- 6 years ago
- Views:
Transcription
1 SCALING AND GENERALIZING APPROXIMATE BAYESIAN INFERENCE David M. Blei Columbia University
2 This talk is about how to discover hidden patterns in large high-dimensional data sets.
3 Communities discovered in a.m node network of U.S. Patents [Gopalan and Blei, PNAS 0]
4 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Man Officer Officers Case Found Charged Topics found in.8m articles from the New York Times Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, except that each cell s distribution is determined by hidden variables that depend on the cell s row Shot [Hoffman, Blei, Wang, Paisley, JMLR 0]
5 prob lian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba Population analysis of billion genetic measurements [Gopalan, Hao, Blei, Storey, to appear]
6 Neuroscience analysis of 0 million fmri measurements [Manning et al., PLOS ONE 0]
7 Analysis of.m taxi trajectories, in Stan [Kucukelbir et al., 0]
8 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Customized data analysis is important to many fields. I Pipeline separates assumptions, computation, application I Eases collaborative solutions to statistics problems
9 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Inference is the key algorithmic problem. I Answers the question: What does this model say about this data? I Our goal: General and scalable approaches to inference
10 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Variational methods: inference as optimization [Jordan et al., 999] I Scale up with stochastic variational inference (SVI) [Hoffman et al., 0] I Generalize with black box variational inference (BBVI) [Ranganath et al., 0]
11 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Both approaches use stochastic optimization. I SVI subsamples from a massive data set I BBVI uses Monte Carlo to approximate difficult-to-compute expectations
12 STOCHASTIC VARIATIONAL INFERENCE (with Matt Hoffman, Chong Wang, John Paisley)
13 Stochastic variational inference is an algorithm that scales general Bayesian computation to massive data.
14 Motivation: Topic Modeling. Discover the thematic structure in a large collection of documents. Annotate the documents. Use the annotations to visualize, organize, summarize,...
15 Example: Latent Dirichlet allocation (LDA) Documents exhibit multiple topics.
16 Example: Latent Dirichlet allocation (LDA) Topics gene 0.0 dna 0.0 genetic 0.0.,, Documents Topic proportions and assignments life 0.0 evolve 0.0 organism 0.0.,, brain 0.0 neuron 0.0 nerve data 0.0 number 0.0 computer 0.0.,, Each topic is a distribution over words Each document is a mixture of corpus-wide topics Each word is drawn from one of those topics
17 Example: Latent Dirichlet allocation (LDA) Topics Documents Topic proportions and assignments But we only observe the documents; the other structure is hidden. We compute the posterior p.topics, proportions, assignments j documents/
18 Example: Latent Dirichlet allocation (LDA) Topics Documents Topic proportions and assignments Many data sets contain millions of documents. This requires inference about billions of variables. SVI can scale to these data.
19 LDA as a graphical model Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter d z d;n w d;n ˇk N D K Encodes assumptions about data with a factorization of the joint Connects assumptions to algorithms for computing with data Defines the posterior (through the joint)
20 Posterior inference d z d;n w d;n ˇk N D K The posterior of the latent variables given the documents is p.ˇ; ; z j w/ D R ˇ p.ˇ; ; z; w/ R Pz p.ˇ; ; z; w/: We can t compute the denominator, the marginal p.w/. We use approximate inference.
21 Classical variational inference d z d;n w d;n ˇk N D K Originally, we used variational methods to fit this model. [Blei et al., 00] Classical variational inference is inefficient: Do some local computation for each data point. Aggregate these computations to re-estimate global structure. Repeat. This cannot handle massive data.
22 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Stochastic variational inference MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure
23 Stochastic variational inference Perplexity Online 98K Online.M Batch 98K Documents analyzed Top eight words 08 systems road made service announced national west language Documents seen (log scale) 09 systems health communication service billion language care road 89 service systems health companies market communication company billion 88 service systems companies business company billion health industry 8 service companies systems business company industry market billion 8 business service companies industry company management systems services 9 business service companies industry services company management public business industry service companies services company management public [Hoffman et al., 00]
24 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Topics using the HDP, found in.8m articles from the New York Times Man Officer Officers Case Found Charged Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, Shot
25 Stochastic variational inference Subsample data Infer local structure Update global structure Next parts of this talk:. Define a generic class of models. Derive classical mean-field variational inference. Use the classical algorithm to derive stochastic variational inference
26 A generic class of models Global variables ˇ z x Local variables i i ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id The observations are x D x Wn. The local variables are z D z Wn. The global variables are ˇ. The ith data point x i only depends on z i and ˇ. n Goal: Compute p.ˇ; z j x/.
27 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id A complete conditional is the conditional of a latent variable given the observations and other latent variables. Assume each complete conditional is in the exponential family, p.z i j ˇ; x i / D h.z i / expf`.ˇ; x i / > z i a.`.ˇ; x i //g p.ˇ j z; x/ D h.ˇ/ expf g.z; x/ >ˇ a. g.z; x//g:
28 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id A complete conditional is the conditional of a latent variable given the observations and other latent variable. The global parameter comes from conjugacy [Bernardo and Smith, 99] g.z; x/ D C P n id t.z i; x i /; where is a hyperparameter and t./ are sufficient statistics for Œz i ; x i.
29 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id Dirichlet process mixture with conjugate base measure [Sethuraman, 99] Local: mixture assignments for each observation (categorical). Global: stick lengths (beta) and mixture components (e.g., Gaussian). Other BNP models, e.g., Beta-Bernoulli process Hierarchical Dirichlet process
30 A generic class of models Global variables ˇ Local variables z i x i n ny p.ˇ; z; x/ D p.ˇ/ p.z i ; x i j ˇ/ id Bayesian mixture models Time series models (variants of HMMs, Kalman filters) Factorial models Matrix factorization (e.g., factor analysis, PCA, CCA) Dirichlet process mixtures, HDPs Multilevel regression (linear, probit, Poisson) Stochastic blockmodels Mixed-membership models (LDA and some variants)
31 (c) Iteration 8 (d) Iteration (c)iteration Iteration08 (b) (a) Initialization (e) Iteration 0 (d) Iteration Evidence Lower Bound ;00 ;00 ;00 ;800 Average Log Predictive Evidence Lower Bound : : : : ;00 ;00 (e) Iteration 0 Average Log Predictive : : 0 0: : Iterations ; ;00 Iterations Iteration 8 (d) Iteration (f) subcaption elbo (g) subcaption avelogpred 0 0 (e) 0Iteration Iterations Iterations : Main caption I Variational methods turn Figure inference into optimization 0 0 (f) subcaption elbo (g) subcaption avelogpred I Idea: Fit a simple distribution to be close (in KL) to the exact posterior Figure : Main caption I Here: Evidence LowerABound Log Predictive simple mixtureaverage of Gaussians [image by Alp Kucukelbir]
32 Mean-field variational inference ˇ ˇ z i x i n ELBO i z i n Goal: Minimize KL divergence between a family q and the posterior p. Mean-field assumption: Set q.ˇ; z/ to be fully factored, q.ˇ; z/ D q.ˇ j / Q n id q.z i j i /: Each factor is the same family as the model s complete conditional, p.ˇ j z; x/ D h.ˇ/ expf g.z; x/ >ˇ q.ˇ j / D h.ˇ/ expf >ˇ a./g: a. g.z; x//g
33 Mean-field variational inference ˇ ˇ ELBO z i x i n i z i n Optimize the evidence lower bound, equivalent to optimizing negative KL, L.; Wn / D E q Œlog p.ˇ; Z; x/ E q Œlog q.ˇ; Z/ : Traditional VI uses coordinate ascent [Ghahramani and Beal, 00] D E g.z; x/ I i D E Œ`.ˇ; x i / Iteratively update each parameter, holding others fixed. Notice the relationship to Gibbs sampling.
34 Mean-field variational inference for LDA d d;n k d z d;n w d;n ˇk N D K The local variables are the per-document variables d and z d. The global variables are the topics ˇ; : : : ; ˇK. The variational distribution is KY DY NY q.ˇ; ; z/ D q.ˇk j k / q. d j d / q.z d;n j d;n / kd dd nd
35 Mean-field variational inference for LDA Probability Topics
36 Mean-field variational inference for LDA Genetics Evolution Disease Computers human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
37 Classical variational inference Input: data x, model p.ˇ; z; x/. Initialize randomly. repeat for each data point i do Set local parameter i end E Œ`.ˇ; x i /. Set global parameter C P n id E i Œt.Z i ; x i /. until the ELBO has converged This is inefficient: We analyze all data before completing one iteration.
38 Stochastic optimization Replace the gradient with cheaper noisy estimates [Robbins and Monro, 9] Guaranteed to converge to a local optimum [Bottou, 99] Has enabled modern machine learning
39 Natural gradients Natural Gradient Works Efficiently in Learning Shun-ichi Amari RIKEN Frontier Research Program, Saitama -0, Japan When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. The natural gradient of the ELBO [Amari, 998; Sato, 00] Computationally: Or L D C P n id E i Œt.Z i ; x i / : Compute coordinate updates. Subtract the current variational parameters.
40 Stochastic variational inference Subsample data Infer local structure Update global structure Construct a noisy natural gradient Sample a data point at random j Uniform.; : : : ; n/. Calculate Qr L D C ne Œt.Z j ; x j j / This is a good noisy gradient The expectation (with respect to j ) is the natural gradient. Only requires the local parameters for one data point.
41 Stochastic variational inference Input: data x, model p.ˇ; z; x/. Initialize randomly. Set t appropriately. repeat Sample j Unif.; : : : ; n/. Set local parameter E `.ˇ; x j /. Set intermediate global parameter O D C ne Œt.Z j ; x j / : Set global parameter until forever D. t / C t O :
42 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Stochastic variational inference MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure
43 Stochastic variational inference in LDA d d;n k d z d;n w d;n ˇk N D K. Sample a document. Estimate the local variational parameters using the current topics. Form intermediate topics from those local parameters. Update topics as a weighted average of intermediate and current topics
44 Stochastic variational inference in LDA Perplexity Online 98K Online.M Batch 98K Documents analyzed Top eight words 08 systems road made service announced national west language Documents seen (log scale) 09 systems health communication service billion language care road 89 service systems health companies market communication company billion 88 service systems companies business company billion health industry 8 service companies systems business company industry market billion 8 business service companies industry company management systems services 9 business service companies industry services company management public business industry service companies services company management public [Hoffman et al., 00]
45 HDP topic models Log Predictive Probability nature nyt wiki Hours Algorithm HDP HDP batch A study of large corpora with the HDP topic model [Teh et al., 00] Details are in Hoffman et al., 0. SVI is faster and lets us analyze more data.
46 Annual Review of Statistics and Its Application 0.:0-. Downloaded from by Princeton University Library on 0/09/. For personal use only. Figure Game Season Team Coach Play Points Games Giants Second Players Bush Campaign Clinton Republican House Party Democratic Political Democrats Senator Children School Women Family Parents Child Life Says Help Mother Life Know School Street Man Family Says House Children Night Building Street Square Housing House Buildings Development Space Percent Real Stock Percent Companies Fund Market Bank Investors Funds Financial Business Film Movie Show Life Television Films Director Man Story Says 8 Won Team Second Race Round Cup Open Game Play Win Church War Women Life Black Political Catholic Government Jewish Pope Book Life Books Novel Story Man Author House War Children 9 Yankees Game Mets Season Run League Baseball Team Games Hit Art Museum Show Gallery Works Artists Street Artist Paintings Exhibition Topics found in a corpus of.8 million articles from the New York Times. ModifiedfromHoffmanetal.(0). Wine Street Hotel House Room Night Place Restaurant Park Garden 0 Government War Military Officials Iraq Forces Iraqi Army Troops Soldiers Police Yesterday Topics using the HDP, found in.8m articles from the New York Times Man Officer Officers Case Found Charged Street a particular movie), our prediction of the rating depends on a linear combination of the user s embedding and the movie s embedding. We can also use these inferred representations to find groups of users that have similar tastes and groups of movies that are enjoyed by the same kinds of users. Figure c illustrates the graphical model. This model is closely related to a linear factor model, Shot
47 Stochastic variational inference Subsample data Infer local structure Update global structure We derived an algorithm for scalable variational inference. Bayesian mixture models Time series models (variants of HMMs, Kalman filters) Factorial models Matrix factorization (e.g., factor analysis, PCA, CCA) Dirichlet process mixtures, HDPs Multilevel regression (linear, probit, Poisson) Stochastic blockmodels Mixed-membership models (LDA and some variants)
48 Adygei research funding support nih program mice antigen t cells antigens immune response activated tyrosine phosphorylation activation phosphorylation kinase BalochiBantuKenya BantuSouthAfrica Basque Bedouin BiakaPygmy Brahui BurushoCambodian Colombian Dai Daur Druze French Han Han NChina HazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba ch Han Han NChina HazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmy Melanesian MiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba science scientists says research people patients disease treatment drugs clinical cells cell expression cell lines bone marrow virus hiv aids infection viruses bacteria bacterial host resistance parasite p cell cycle activity cyclin regulation wild type amino acids cdna sequence isolated protein gene disease mutations families mutation proteins protein binding domain domains enzyme enzymes iron active site reduction plants plant gene genes arabidopsis development embryos drosophila genes expression genetic population populations differences variation rna dna rna polymerase cleavage site sequence sequences genome dna sequencing magnetic magnetic field spin superconductivity superconducting fossil record birds fossils dinosaurs fossil surface liquid surfaces fluid model ancient found impact million years ago africa brain memory subjects left task computer problem information computers problems species forest forests populations ecosystems laser optical light electrons quantum pressure high pressure pressures core inner core neurons stimulus motor visual cortical materials organic polymer polymers molecules mantle crust upper mantle meteorites ratios earthquake earthquakes fault images data climate ocean ice changes climate change physicists particles physics particle experiment co carbon carbon dioxide methane water sun solar wind earth planets planet ozone atmospheric measurements stratosphere concentrations united states women universities students education receptor receptors ligand ligands apoptosis mutant mutations mutants mutation cells proteins researchers protein found volcanic deposits magma eruption volcanism surface tip image sample device synapses ltp glutamate synaptic neurons stars reaction astronomers reactions universe molecule galaxies molecules galaxy transition state prob
49 BLACK BOX VARIATIONAL INFERENCE (with Rajesh Ranganath and Sean Gerrish)
50 Black box variational inference is an algorithm that efficiently performs Bayesian computation in any model.
51 Black box variational inference KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Approximate inference can be difficult to derive. I Especially true for models that are not conditionally conjugate I E.g., discrete choice models, Bayesian generalized linear models,...
52 Black box variational inference REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Easily use variational inference with any model No exponential family requirements No mathematical work beyond specifying the model
53 Black box variational inference REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Sample from q./ Form noisy gradients without model-specific computation Use stochastic optimization
54 Black box variational inference ˇ ˇ z i x i n ELBO i z i n ELBO: L./ D E q Œlog p.ˇ; Z; x/ log q.ˇ; Z/ Shorthand: L./ D E q ŒD.ˇ; Z/
55 Black box variational inference ˇ ˇ z i x i n ELBO i z i n A noisy gradient: r L BX r log q.ˇb; z b / D.ˇ; Z/ B bd where.ˇb; z b / q.ˇ; z/
56 The noisy gradient r L./ BX r log q.ˇb; z b /D.ˇb; z b / B bd We use these gradients in a stochastic optimization algorithm. Requirements: Sampling from q.ˇ; z/ Evaluating r log q.ˇ; z/ Evaluating log p.ˇ; z; x/ A black box : We can reuse q./ across models
57 The noisy gradient r L./ BX r log q.ˇb; z b /D.ˇb; z b / B bd Making it work: Rao-Blackwellization for each component of the gradient Control variates, again using r log q.ˇ; z/ AdaGrad, for setting learning rates [Duchi, Hazan, Singer, 0] Stochastic variational inference, for handling massive data
58 Monte Carlo Gradients of the ELBO REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ MC gradient [Ji et al., 00; Nott et al., 0; Paisley et al., 0; Ranganath et al. 0] Autoencoders [Kingma and Welling, 0/0] Neural networks [Kingma et al., 0; Mnih and Gregor, 0; Rezende et al., 0] A perspective from regression [Salimans and Knowles, 0] Doubly stochastic VB [Titsias and Lazaro-Gredilla, 0]
59 Neuroscience analysis of 0 million fmri measurements [Manning et al., PLOS ONE 0]
60 z n;l;k K L w`+,k w`,k z n;`c;k z n;`;k K`C K` w,k z n;;k K w0,i x n;i V N Deep Exponential Families Probabilistic Programming [Ranganath et al., AISTATS 0] [Kucukelbir et al., 0]
61 Edward: A library for probabilistic modeling, inference, and criticism github.com/blei-lab/edward (lead by Dustin Tran)
62 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Customized data analysis is important to many fields. I Pipeline separates assumptions, computation, application I Eases collaborative solutions to statistics problems
63 The probabilistic pipeline KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 I Inference is the key algorithmic problem. I Answers the question: What does this model say about this data? I Our goal: General and scalable approaches to inference
64 K= K=8 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH MASSIVE DATA GLOBAL HIDDEN STRUCTURE Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K =to K =8matches structure differentiating between American groups. For K =9, the new cluster is unpopulated. 8 Subsample data Infer local structure Update global structure Stochastic variational inference [Hoffman et al., 0, JMLR]
65 REUSABLE REUSABLE VARIATIONAL REUSABLE VARIATIONAL FAMILIES VARIATIONAL FAMILIES MASSIVE DATA ANY MODEL BLACK BOX VARIATIONAL INFERENCE p.ˇ; z j x/ Black box variational inference [Ranganath et al., 0, AISTATS]
66 Recent research (on the ArXiv): Variational Inference: A Review for Statisticians (with J. McAuliffe and A. Kucukelbir) Hierarchical Variational Models (with R. Ranganath and D. Tran) Automatic Differentiation Variational Inference (with A. Kucukelbir, D. Tran, R. Ranganath, and A. Gelman) Open and current research: How can we improve the gradient estimator in BBVI? Can we use alternative divergence measures? What are their properties? What are the statistical properties of VI? What is a good framework? How can we efficiently go beyond the mean field?
67 KNOWLEDGE & QUESTION DATA R A. Aty This content downloaded from on Thu, Nov 0 0:9: UTC All use subject to JSTOR Terms and Conditions K= LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH K=8 LWK Make assumptions YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH Discover patterns Predict & Explore 8 K=9 LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH 8 9 Figure S: Population structure inferred from the TGP data set using the TeraStructure algorithm at three values for the number of populations K. The visualization of the s in the Figure shows patterns consistent with the major geographical regions. Some of the clusters identify a specific region (e.g. red for Africa) while others represent admixture between regions (e.g. green for Europeans and Central/South Americans). The presence of clusters that are shared between different regions demonstrates the more continuous nature of the structure. The new cluster from K = to K = 8 matches structure differentiating between American groups. For K = 9, the new cluster is unpopulated. 8 Criticize model Revise
68 Should I be skeptical about variational inference? Average Log Predictive 9 ADVI (M=) ADVI (M=0) NUTS HMC Seconds Average Log Predictive 0: 0:9 : : : ADVI (M=) ADVI (M=0) NUTS HMC Seconds (a) Linear Regression with (b) Hierarchical Logistic Regression Figure : Hierarchical Generalized Linear Models. MCMC enjoys theoretical guarantees. But they usually get to the same place. [Kucukelbir et al., submitted] We need more theory about variational inference.
69 Should I be skeptical about variational inference? Density (KDE estimate) 0 8 truth NUTS ADVI-MF ADVI-FR Density (KDE estimate) truth NUTS ADVI-MF ADVI-FR Tau Beta Variational inference underestimates the variance of the posterior. Relaxing the mean-field assumption can help. Here: A Poisson GLM [Giordano et al., 0]
Stochastic Optimization and Variational Inference
Stochastic Optimization and Variational Inference David M. Blei Princeton University February 27, 2014 PROSTHESIS Prosthesis comprising an expansible or contractile tubular body Light reflectant surface
More informationSCALABLE INFERENCE FOR BAYESIAN NONPARAMETRIC MODELS
SCALABLE INFERENCE FOR BAYESIAN NONPARAMETRIC MODELS David M. Blei Departments of Computer Science and Statistics Data Science Institute Columbia University Communities discovered in a 3.7M node network
More informationVariational Inference: A Review for Statisticians
Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University Alp Kucukelbir Department of Computer Science Columbia University Jon D.
More informationStochastic Variational Inference
Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationAutomatic Differentiation Variational Inference
Journal of Machine Learning Research X (2016) 1-XX Submitted X/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia
More informationApproximate Bayesian inference
Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic
More informationAutomatic Differentiation Variational Inference
Journal of Machine Learning Research X (2016) 1-XX Submitted 01/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute and Department of Computer Science
More informationText mining and natural language analysis. Jefrey Lijffijt
Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably
More informationMixed-membership Models (and an introduction to variational inference)
Mixed-membership Models (and an introduction to variational inference) David M. Blei Columbia University November 24, 2015 Introduction We studied mixture models in detail, models that partition data into
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationAn Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University
An Overview of Edward: A Probabilistic Programming System Dustin Tran Columbia University Alp Kucukelbir Eugene Brevdo Andrew Gelman Adji Dieng Maja Rudolph David Blei Dawen Liang Matt Hoffman Kevin Murphy
More informationarxiv: v1 [stat.ml] 2 Mar 2016
Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia University arxiv:1603.00788v1 [stat.ml] 2 Mar 2016 Dustin Tran Department
More informationLocal Expectation Gradients for Doubly Stochastic. Variational Inference
Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationSupplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow
Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow Mark Lipson, Po-Ru Loh, Alex Levin, David Reich, Nick Patterson, and Bonnie Berger 41 Surui Karitiana
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLatent Dirichlet Allocation
Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology
More informationBayesian Nonparametric Models
Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior
More informationAutomatic Variational Inference in Stan
Automatic Variational Inference in Stan Alp Kucukelbir Columbia University alp@cs.columbia.edu Andrew Gelman Columbia University gelman@stat.columbia.edu Rajesh Ranganath Princeton University rajeshr@cs.princeton.edu
More informationOverdispersed Black-Box Variational Inference
Overdispersed Black-Box Variational Inference Francisco J. R. Ruiz Data Science Institute Dept. of Computer Science Columbia University Michalis K. Titsias Dept. of Informatics Athens University of Economics
More informationMarkov Chain Monte Carlo
Markov Chain Monte Carlo (and Bayesian Mixture Models) David M. Blei Columbia University October 14, 2014 We have discussed probabilistic modeling, and have seen how the posterior distribution is the critical
More informationSparse Stochastic Inference for Latent Dirichlet Allocation
Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation
More informationarxiv: v9 [stat.co] 9 May 2018
Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University arxiv:1601.00670v9 [stat.co] 9 May 2018 Alp Kucukelbir Department of Computer
More informationCombine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning
Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business
More informationVariance Reduction in Black-box Variational Inference by Adaptive Importance Sampling
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18 Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Ximing Li, Changchun
More informationApplying LDA topic model to a corpus of Italian Supreme Court decisions
Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014 Our goal finding
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationTopic Modeling Using Latent Dirichlet Allocation (LDA)
Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October
More informationOperator Variational Inference
Operator Variational Inference Rajesh Ranganath Princeton University Jaan Altosaar Princeton University Dustin Tran Columbia University David M. Blei Columbia University Abstract Variational inference
More informationTopic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51
Topic Models Material adapted from David Mimno University of Maryland INTRODUCTION Material adapted from David Mimno UMD Topic Models 1 / 51 Why topic models? Suppose you have a huge number of documents
More informationarxiv: v2 [stat.ml] 21 Jul 2015
The Population Posterior and Bayesian Inference on Streams arxiv:1507.05253v2 [stat.ml] 21 Jul 2015 James McInerney Data Science Institute Columbia University New York, NY 10027 jm4181@columbia.edu Rajesh
More informationConvergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence
Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence Mohammad Emtiyaz Khan Ecole Polytechnique Fédérale de Lausanne Lausanne Switzerland emtiyaz@gmail.com
More informationBayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems
Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Scott W. Linderman Matthew J. Johnson Andrew C. Miller Columbia University Harvard and Google Brain Harvard University Ryan
More informationPROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY
PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories
More informationNatural Gradients via the Variational Predictive Distribution
Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 5, 2011 Today: Latent Dirichlet Allocation topic models Social network analysis based on latent probabilistic
More informationDeep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationThe Generalized Reparameterization Gradient
The Generalized Reparameterization Gradient Francisco J. R. Ruiz University of Cambridge Columbia University Michalis K. Titsias Athens University of Economics and Business David M. Blei Columbia University
More informationBuild, Compute, Critique, Repeat: Data Analysis with Latent Variable Models
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models David M. Blei Columbia University Abstract We survey how to use latent variable models to solve data analysis problems. A latent
More informationVariational Inference: A Review for Statisticians
Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 Variational Inference: A Review for Statisticians David
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang
More informationPROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami
PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,
More informationQuasi-Monte Carlo Flows
Quasi-Monte Carlo Flows Florian Wenzel TU Kaiserslautern Germany wenzelfl@hu-berlin.de Alexander Buchholz ENSAE-CREST, Paris France alexander.buchholz@ensae.fr Stephan Mandt Univ. of California, Irvine
More informationVariational Bayes on Monte Carlo Steroids
Variational Bayes on Monte Carlo Steroids Aditya Grover, Stefano Ermon Department of Computer Science Stanford University {adityag,ermon}@cs.stanford.edu Abstract Variational approaches are often used
More informationInterpretable Latent Variable Models
Interpretable Latent Variable Models Fernando Perez-Cruz Bell Labs (Nokia) Department of Signal Theory and Communications, University Carlos III in Madrid 1 / 24 Outline 1 Introduction to Machine Learning
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationVariational Inference via χ Upper Bound Minimization
Variational Inference via χ Upper Bound Minimization Adji B. Dieng Columbia University Dustin Tran Columbia University Rajesh Ranganath Princeton University John Paisley Columbia University David M. Blei
More informationProbabilistic Matrix Factorization
Probabilistic Matrix Factorization David M. Blei Columbia University November 25, 2015 1 Dyadic data One important type of modern data is dyadic data. Dyadic data are measurements on pairs. The idea is
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationTopic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1
Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationTwo Useful Bounds for Variational Inference
Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the
More informationHierarchical Dirichlet Processes
Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley
More informationTopic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationSmoothed Gradients for Stochastic Variational Inference
Smoothed Gradients for Stochastic Variational Inference Stephan Mandt Department of Physics Princeton University smandt@princeton.edu David Blei Department of Computer Science Department of Statistics
More informationEvaluating the Variance of
Evaluating the Variance of Likelihood-Ratio Gradient Estimators Seiya Tokui 2 Issei Sato 2 3 Preferred Networks 2 The University of Tokyo 3 RIKEN ICML 27 @ Sydney Task: Gradient estimation for stochastic
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationPachinko Allocation: DAG-Structured Mixture Models of Topic Correlations
: DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation
More informationBasic Principles of Unsupervised and Unsupervised
Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization
More informationReparameterization Gradients through Acceptance-Rejection Sampling Algorithms
Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms Christian A. Naesseth Francisco J. R. Ruiz Scott W. Linderman David M. Blei Linköping University Columbia University University
More informationCS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs
CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationDocument and Topic Models: plsa and LDA
Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language
More informationRestricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem
More informationTopic Models and Applications to Short Documents
Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationBlack-box α-divergence Minimization
Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationComputational Biology Course Descriptions 12-14
Computational Biology Course Descriptions 12-14 Course Number and Title INTRODUCTORY COURSES BIO 311C: Introductory Biology I BIO 311D: Introductory Biology II BIO 325: Genetics CH 301: Principles of Chemistry
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationDynamic Probabilistic Models for Latent Feature Propagation in Social Networks
Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data
More informationAdvances in Variational Inference
1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationXVII. Science and Technology/Engineering, Grade 8
XVII. Science and Technology/Engineering, Grade 8 Grade 8 Science and Technology/Engineering Test The spring 2017 grade 8 Science and Technology/Engineering test was based on learning standards in the
More informationDistributed ML for DOSNs: giving power back to users
Distributed ML for DOSNs: giving power back to users Amira Soliman KTH isocial Marie Curie Initial Training Networks Part1 Agenda DOSNs and Machine Learning DIVa: Decentralized Identity Validation for
More informationGenerative Models for Discrete Data
Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers
More informationCOMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017
COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA
More informationContent-based Recommendation
Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3
More informationVariational Inference with Copula Augmentation
Variational Inference with Copula Augmentation Dustin Tran 1 David M. Blei 2 Edoardo M. Airoldi 1 1 Department of Statistics, Harvard University 2 Department of Statistics & Computer Science, Columbia
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationSharing Clusters Among Related Groups: Hierarchical Dirichlet Processes
Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics
More informationGenomes and Their Evolution
Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from
More informationProximity Variational Inference
Jaan Altosaar Rajesh Ranganath David M. Blei altosaar@princeton.edu rajeshr@cims.nyu.edu david.blei@columbia.edu Princeton University New York University Columbia University Abstract Variational inference
More informationStatistical Models. David M. Blei Columbia University. October 14, 2014
Statistical Models David M. Blei Columbia University October 14, 2014 We have discussed graphical models. Graphical models are a formalism for representing families of probability distributions. They are
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More informationRevision Based on Chapter 19 Grade 11
Revision Based on Chapter 19 Grade 11 Biology Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Most fossils are found in rusty water. volcanic rock. sedimentary
More informationCS Lecture 18. Topic Models and LDA
CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationDistance dependent Chinese restaurant processes
David M. Blei Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Peter Frazier Department of Operations Research and Information Engineering, Cornell University 232
More informationApplying Latent Dirichlet Allocation to Group Discovery in Large Graphs
Lawrence Livermore National Laboratory Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Keith Henderson and Tina Eliassi-Rad keith@llnl.gov and eliassi@llnl.gov This work was performed
More information