Nonparametric Bayes tensor factorizations for big data
|
|
- Darren Leslie Owen
- 6 years ago
- Views:
Transcription
1 Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES & DARPA N C-2082
2 Motivation Conditional tensor factorizations Some properties - heuristic & otherwise Computation & applications Generalizations
3 Motivating setting - high dimensional predictors Routine to encounter massive-dimensional prediction & variable selection problems We have y Y & x = (x 1,..., x p ) X Unreasonable to assume linearity or additivity in motivating applications - e.g., epidemiology, genomics, neurosciences Goal: nonparametric approaches that accommodate large p, small n, allow interactions, scale computationally to big p
4 Gaussian processes with variable selection For Y = R & X R p, then one approach lets y i = µ(x i ) + ɛ i, ɛ i N(0, σ 2 ), where µ : X R is an unknown regression function Following Zou et al. (2010) & others, { µ GP(m, c), c(x, x ) = φ exp with mixture priors placed on α j s p α j (x j x j ) }, 2 j=1 Zou et al. (2010) show good empirical results Bhattacharya, Pati & Dunson (2011) - minimax adaptive rates
5 Issues & alternatives Mean regression & computation challenging Difficult computationally beyond conditionally Gaussian homoscedastic case Density regression interesting as variance & shape of response distribution often changes with x Initial focus: classification from many categorical predictors Approach generalizes directly to arbitrary Y and X.
6 Classification & conditional probability tensors Suppose Y {1,..., d 0 } & X j {1,..., d j },j=1,...,p The classification function or conditional probability is Pr(Y = y X 1 = x 1,..., X p = x p ) = P(y x 1,..., x p ). This classification function can be structured as a d 0 d 1 d p tensor Let P d1,...,d p (d 0 ) denote to set of all possible conditional probability tensors P P d1,...,d p (d 0 ) implies P(y x 1,..., x p ) 0 y, x 1,..., x p & d0 y=1 P(y x 1,..., x p ) = 1
7 Tensor factorizations P = big tensor & data will be very sparse If P was a matrix, we may think of SVD We can instead consider a tensor factorization Common approach is PARAFAC - sum of rank one tensors Tucker factorizations express d 1 d p tensor A = {a c1 c p } as a c1 c p = d 1 h 1 =1 d j p g h1 h p h p=1 j=1 where G = {g h1 h p } is a core tensor, u (j) h j c j,
8 Our factorization (with Yun Yang) Our proposed nonparametric model for the conditional probability: P(y x 1,..., x p ) = k 1 h 1 =1 k p h p=1 λ h1 h 2...h p (y) p j=1 Tucker factorization of the conditional probability P π (j) h j (x j ) (1) To be valid conditional probability, parameters subject to d 0 c=1 k j h=1 λ h1 h 2...h p (c) = 1, for any (h 1, h 2,..., h p ), π (j) h (x j) = 1, for any possible pair of (j, x j ). (2)
9 Comments on proposed factorization k j = 1 corresponds to exclusion of the jth feature By placing prior on k j, can induce variable selection & learning of dimension of factorization Representation is many-to-one and the parameters in the factorization cannot be uniquely identified. Does not present a barrier to Bayesian inference - we don t care about the parameters in factorization We want to do variable selection, prediction & inferences on predictor effects
10 Theoretical support The following Theorem formalizes the flexibility: Theorem Every d 0 d 1 d 2 d p conditional probability tensor P P d1,...,d p (d 0 ) can be decomposed as (1), with 1 k j d j for j = 1,..., p. Furthermore, λ h1 h 2...h p (y) and π (j) h j (x j ) can be chosen to be nonnegative and satisfy the constraints (2).
11 Latent variable representation Simplify representation through introducing p latent class indicators z 1,..., z p for X 1,..., X p Conditional independence of Y and (X 1,..., X p ) given (z 1,..., z p ) The model can be written as Y i z i1,..., z ip Mult ( {1,..., d 0 }, λ zi1,...,z ip ), z ij X ij = x j Mult ( {1,..., k j }, π (j) 1 (x j),..., π (j) k j (x j ) ), Useful computationally & provides some insight into the model
12 Prior specification & hierarchical model Conditional likelihood of response is (Y i z i1,..., z ip, Λ) Multinomial ( {1,..., d 0 }, λ zi1,...,z ip ) Conditional likelihood of latent class variables is (z ij X ij = x j, π) Multinomial ( {1,..., k j }, π (j) 1 (x j),..., π (j) k j (x j ) ) Prior on core tensor λ h1,...,h p = ( λh1,...,h p (1),..., λ h1,...,h p (d 0 ) ) Diri(1/d 0,..., 1/d 0 ) Prior on independent rank one components, ( π (j) 1 (x j),..., π (j) k j (x j ) ) Diri(1/k j,..., 1/k j )
13 Prior on predictor inclusion/tensor rank For the jth dimension, we choose the simple prior P(k j = 1) = 1 r p, P(k j = k) = d j =# levels of covariate X j. r (d j 1)p, k = 2,..., d j, r =expected # important features, r =specified maximum number of features Effective prior on k j s is P(k 1 = l 1,..., k p = l p ) = P(k 1 = l 1 ) P(k p = l p )I { {j:lj >1} r}(l 1,..., l p ), where I A ( ) is the indicator function for set A.
14 Properties - Bias-Variance Tradeoff Extreme data sparsity - vast majority of combinations of Y, X 1,..., X p not observed Critical to include sparsity assumptions - even if such assumptions do not hold, massively reduces the variance Discard predictors having small impact & parameters having small values Makes the problem tractable & may lead to good MSE
15 Illustrative example Binary Y & p binary covariates X j { 1, 1}, j = 1,..., p The true model can be expressed in the form [β (0, 1)] P(Y = 1 X 1 = x 1,..., X p = x p ) = β 2 2 x β 2 p+1 x p. Effect of X j decreases exponentially as j increases from 1 to p. Natural strategy: estimate P(Y = 1 X 1 = x 1,..., X p = x p ) by sample frequencies over 1st k covariates {i : y i = 1, x 1i = x 1,..., x ki = x k }, {i : x 1i = x 1,..., x ki = x k } & ignore the remaining p k covariates. Suppose we have n = 2 l (k l p) observations with one in each cell of combinations of X 1,..., X l.
16 MSE analysis Mean square error (MSE) can be expressed as MSE = h 1,...,h p E Bias 2 + Var. { P(Y = 1 X1 = h 1,..., X p = h p ) ˆP(Y = 1 X 1 = h 1,..., X k = h k ) } 2 The squared bias is Bias 2 = { P(Y = 1 X1 = h 1,..., X p = h p ) h 1,...,h p = β 2 2 E ˆP(Y = 1 X 1 = h 1,..., X k = h k ) } 2 ( ) 2i p+1 = β2 3 (2p 2k 2 2 p 2 ). k+1 2p k 1 i=1
17 MSE analysis (continued) Finally we obtain the variance as Var = VarˆP(Y = 1 X 1 = h 1,..., X k = h k ) h 1,...,h p 2k 1 ( p k = 2 2 l 2 + 2i 1 )( 1 2 k+1 β 2 2i 1 ) 2 k+1 β i=1 = 1 3{ (3 β 2 )2 p+k l 2 + β 2 2 p k l 2}. Since there are 2 p cells, the average MSE for each cell equals 1{ (3 β 2 )2 k l 2 + β 2 2 k l 2 + β 2 2 2k 2 β 2 2 2p 2}. 3
18 Implications of MSE analysis #predictors p has little impact on selection of k k l & so second term small comparing to 1st & 3rd terms Average MSE obtains its minimum at k l/3 = log 2 (n)/3 True model not sparse & all the predictors impact conditional probability But optimal # predictors only depends on the log sample size
19 Borrowing of information Critical feature of our model is borrowing across cells Letting w h1,...,h p (x 1,..., x p ) = j π(j) h j (x j ), our model is P(Y = y X 1 = x 1,..., X p = x p ) = h 1,...,h p w h1,...,h p (x 1,..., x p )λ h1...h p (y), with h 1,...,h p w h1,...,h p (x 1,..., x p ) = 1. View λ h1...h p (y) as frequency of Y = y in cell X 1 = h 1,..., X p = h p We have kernel estimate for borrowing information via weighted avg of cell freqs
20 Illustrative example One covariate X {1,..., m} with Y {0, 1} & P j = P(Y = 1 X = j) Naive estimate ˆP j = k j /n j = {i : y i = 1, x i = j}/ {i : x i = j} = sample freqs Alternatively, consider kernel estimate indexed by 0 c 1/(m 1) P j = {1 (m 1)c}ˆP j + c k j ˆP k, j = 1,..., m. Use squared error loss to compare these estimators
21 MSE for illustrative example E{L(ˆP, P)} = m j=1 E(ˆP j P j ) 2 = m P j (1 P j ) j=1 n j. E{L( P, P)} = m j=1 E( P j P j ) 2 is fn of c with min at c 0 = 1 E{L(ˆP, P)} m E{L(ˆP, P)} + 1 m 1 i<j (P i P j ) 2 ( 0, ) 1. m 1 When P j s are similar, estimate P can reduce risk up to only 1/m the risk of estimating ˆP separately. If P j s are not similar, P can still reduce the risk considerably when the cell counts {n j } are small.
22 Setting & assumptions Data y n & X n for n subjects with p n n (large p, small n) Assume d j = d for simplicity in exposition Putting true model P 0 in our tensor form, assume Assumption A. p n j=1 max x j d h j =2 π(j) h j (x j ) <. This is a near sparsity restriction on P 0. Additionally assume true conditional probabilities strictly greater than zero, Assumption B. P 0 (y x) ɛ 0 for any x, y for some ɛ 0 > 0.
23 Posterior convergence theorem x 1,..., x n independent from unknown G n on {1,..., d} pn Let ɛ n be a sequence with ɛ n 0, nɛ 2 n and n exp( nɛ2 n) <. Assume the following conditions hold: (i) r n log p n nɛ 2 n, (ii) r n d rn log( r n /ɛ n ) nɛ 2 n, (iii) r n /p n 0 as n, and (iv) there exists a sequence of models γ n with size r n such that j / γ n max xj d h j =2 π(j) h j (x j ) ɛ 2 n. Denote d(p, P 0 ) = d0 P(y x1,..., x p ) P 0 (y x 1,..., x p ) Gn (dx 1,..., dx p ), then y=1 Π n { P : d(p, P0 ) Mɛ n y n, X n} 0 a.s.p n 0.
24 Implications of theorem Posterior convergence rate can be very close to n 1/2 for appropriate hyperparameter choices. For any α (0, 1), ɛ n = n (1 α)/2 log n satisfies conditions rn r n log n (# important predictors scales w/ log n) pn exp(n α ) (# candidate predictors exponential in n) There exists a sequence of models γn with size r n such that max x j j / γ n d h j =2 π (j) h j (x j ) n α 1 log 2 n. Use r n = log d (n), r n = 2r n as default values for the prior in applications.
25 Posterior computation Conditionally on {k j } simple Gibbs sampler - Dirichlet & multinomial conditionals # components to update p j=1 k j - can blow up with p but all but small number of k j = 1 To update {k j } we can use RJMCMC - current vs doesn t scale very well computationally Two stage algorithm: (i) SSVS to estimate k j - acceptance probs use approximated conditional marginal likelihoods; (ii) conditionally on {ˆk j } run Gibbs Scales efficiently & excellent performance in cases we have considered
26 Simulation study N = 2, 000 instances, p = 600 covariates X j {1,..., 4} & binary response Y True model: 3 important predictors X 9, X 11 and X 13 Generate P(Y = 1 X 9 = x 9, X 11 = x 11, X 13 = x 13 ) independently for each combination of (x 9, x 11, x 13 ). To obtain optimal misclassification rate 15%, generated f (U) = U 2 /{U 2 + (1 U) 2 }, U Unif(0, 1). n training samples & N n testing Training - n {200, 400, 600, 800}, with 10 random training-test splits. Apply our approach to each split.
27 Simulation results - misclassification rate Table: Testing Results for Synthetic Data Example. RF: random forests; TF: Our tensor factorization model. training size amse of TF Misclassification Rate of TF Misclassification Rate of RF amse = 1 { P(Y = 1 x1 4 p,..., x p ) ˆP(Y = 1 x 1,..., x p ) } 2, x 1,...,x p
28 Simulation results - variable selection performance Table: Columns 2-4 = inclusion probs of 9th,11th,13th predictors. Col 5 = max inclusion prob across remaining predictors. Col 6 = average inclusion probability across the remaining predictors. Quantities are averages over 10 trials. training size Max Average
29 Application data sets 1. Promoter gene sequences: A, C, G, T nucleotides at p = 57 positions for N = 106 sequences & binary response (promoter or not). 5-fold CV - n = 85 training & N n = 21 test samples in each split. 2. Splice-junction gene sequences: A, C, G, T nucleotides at p = 60 positions for N = 3, 175 sequences. response classes: exon/intron boundary (EI), intron/exon boundary (IE) or neither (N). Test - n {200, 2540}. 3. Single Proton Emission Computed Tomography (SPECT): cardiac patients normal/abnormal. 267 SPECT images & 22 binary features. Previously divided - n = 80 & N n = 187.
30 Results Table: RF: random forests, NN: neural networks, SVM: support vector machine, BART: Bayesian additive regression trees, TF: Our tensor factorization model. Misclassification rates are displayed. Data CART RF NN LASSO SVM BART TF Promoter (n=85) Splice (n=200) Splice (n=2540) SPECT (n=80) At worst comparable classification performance with RF best of competitors Particularly good relative performance as n decreases & p increases
31 Variable selection - interpretability & parsimony Additional advantages in terms of variable selection In promoter data, selected nucleotides at 15th, 16th, 17th, and 39th positions In splice data, 28th, 29th, 30th, 31st, 32nd and 35th positions are selected. In SPECT data, 11st, 13rd and 16th predictors are selected. Each case obtained excellent classification performance based on a small subset of the predictors.
32 Generalization - conditional distribution modeling Generalization: conditional distribution estimation f (y x) = k k 1 h=1 h 1 =1 k p h p=1 π hh1 h p (x)k(y; θ hh1 h p ), {π hh1 h p (x)} = core probability tensor of predictor-dependent weights on a multiway array of kernels Motivated by above conditional tensor factorization for classification, let π hh1 h p = π h p j=1 π (j) h j (x j ).
33 Linear Tucker density regression Letting x X = [0, 1] p and ψ j [0, 1], choose π (j) 1 (x j) = 1 x j ψ j, π (j) 2 (x j) = x j ψ j, k j = 2, j = 1,..., p, Model linearly interpolates but otherwise is extremely flexible In simple case in which p = 1 & Gaussian kernel, we have f (y x) = k h=1 π h { (1 xψ)n(y; µh1, τ 1 h1 ) + xψn(y; µ h2, τ 1 h2 )}, Induces the linear mean regression model { k E(y x) = h=1 } { k π h µ h1 + h=1 } π h ψ(µ h2 µ h1 ) x = β 0 + β 1 x,
34 Linear Tucker density regression - comments Different quantiles of f (y x) change linearly with x but with slopes that vary If k = 1, and τ h1 = τ h2 = τ h, obtain simple normal linear regression Density changes linearly - f (y x = 0) = h π hn(y; µ h1, τ 1 to f (y x = 1) = h π hn(y; µ h2, τ 1 h2 ) as x increases As p increases, still interpolate linearly but accommodate interactions Posterior computation single stage Gibbs sampler (multinomial, Dirichlet, normal-gamma, Bernoulli, beta) h1 )
35 Joint tensor factorizations Focused in this talk on conditional Tucker factorizations Can also use probabilistic tensor factorizations for joint modeling Very useful for huge sparse contingency table analysis Same ideas provide type of multivariate generalization of current Bayes discrete mixtures Instead of a single cluster index, multiple dependent cluster indices underlying each type of data
36 References Banerjee, A., Murray, J. & Dunson, D.B. (2012). Nonparametric Bayes infinite tensor factorization priors. Bhattacharya, A. and Dunson, D.B. (2012). Simplex factor models for multivariate unordered categorical data. JASA, 107, Bhattacharya, A. and Dunson, D.B. (2012). Nonparametric Bayes testing of associations in high-dimensional categorical data. almost done! Dunson, D.B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. JASA, 104, Ghosh, A. and Dunson, D.B. (2012). Conditional distribution from high-dimensional interacting predictors. in progress. Yang, Y. and Dunson, D.B. (2012). Bayesian conditional tensor factorizations for high-dimensional classification. arxiv.
Bayesian Conditional Tensor Factorizations for High-Dimensional Classification
Bayesian Conditional Tensor Factorizations for High-Dimensional Classification Yun Yang Department of Statistical Science, Duke University, Durham, USA. David B. Dunson Department of Statistical Science,
More informationBayes methods for categorical data. April 25, 2017
Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,
More informationNonparametric Bayes Uncertainty Quantification
Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationMotivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University
Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSTA 216, GLM, Lecture 16. October 29, 2007
STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationScaling up Bayesian Inference
Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSpatial Bayesian Nonparametrics for Natural Image Segmentation
Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University Joint work with Michael Jordan University of California Soumya Ghosh Brown University Parsing Visual Scenes
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationBuilding a Prognostic Biomarker
Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationA Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles
A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles Jeremy Gaskins Department of Bioinformatics & Biostatistics University of Louisville Joint work with Claudio Fuentes
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationBayesian non-parametric model to longitudinally predict churn
Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics
More informationCS Lecture 19. Exponential Families & Expectation Propagation
CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces
More informationLecture 16: Mixtures of Generalized Linear Models
Lecture 16: Mixtures of Generalized Linear Models October 26, 2006 Setting Outline Often, a single GLM may be insufficiently flexible to characterize the data Setting Often, a single GLM may be insufficiently
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationSimplex factor models for multivariate unordered. categorical data
Simplex factor models for multivariate unordered categorical data Anirban Bhattacharya, David B. Dunson Department of Statistical Science, Duke University, NC 27708 email: ab179@stat.duke.edu, dunson@stat.duke.edu
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLatent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent
Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationApproximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)
Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationPart 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior Tom Heskes joint work with Marcel van Gerven
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLecture 1: Bayesian Framework Basics
Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationThe Bayesian approach to inverse problems
The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu
More informationBayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson
Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n
More information1-bit Matrix Completion. PAC-Bayes and Variational Approximation
: PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!) Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationOn Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationWill Penny. DCM short course, Paris 2012
DCM short course, Paris 2012 Ten Simple Rules Stephan et al. Neuroimage, 2010 Model Structure Bayes rule for models A prior distribution over model space p(m) (or hypothesis space ) can be updated to a
More informationBayesian Additive Regression Tree (BART) with application to controlled trail data analysis
Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Weilan Yang wyang@stat.wisc.edu May. 2015 1 / 20 Background CATE i = E(Y i (Z 1 ) Y i (Z 0 ) X i ) 2 / 20 Background
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationWhy experimenters should not randomize, and what they should do instead
Why experimenters should not randomize, and what they should do instead Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) Experimental design 1 / 42 project STAR Introduction
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationMarkov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationNonparametric Bayes regression and classification through mixtures of product kernels
Nonparametric Bayes regression and classification through mixtures of product kernels David B. Dunson & Abhishek Bhattacharya Department of Statistical Science Box 90251, Duke University Durham, NC 27708-0251,
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationOr How to select variables Using Bayesian LASSO
Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationSmall-variance Asymptotics for Dirichlet Process Mixtures of SVMs
Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs Yining Wang Jun Zhu Tsinghua University July, 2014 Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, 2014 1 / 25 Outline
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationA Fully Nonparametric Modeling Approach to. BNP Binary Regression
A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationBayesian Analysis for Natural Language Processing Lecture 2
Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationLecture 5: LDA and Logistic Regression
Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant
More informationNearest Neighbor Gaussian Processes for Large Spatial Data
Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationNonparametric Bayes Density Estimation and Regression with High Dimensional Data
Nonparametric Bayes Density Estimation and Regression with High Dimensional Data Abhishek Bhattacharya, Garritt Page Department of Statistics, Duke University Joint work with Prof. D.Dunson September 2010
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationINTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP
INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP Personal Healthcare Revolution Electronic health records (CFH) Personal genomics (DeCode, Navigenics, 23andMe) X-prize: first $10k human genome technology
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationStat 542: Item Response Theory Modeling Using The Extended Rank Likelihood
Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More information