Bayesian Regression (1/31/13)

Similar documents
Linear Regression (9/11/13)

Linear Regression (1/1/17)

PMR Learning as Inference

Generalized Linear Models (1/29/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Introduction to Bayesian Learning. Machine Learning Fall 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian linear regression

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Exponential Families

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Sparse Linear Models (10/7/13)

Bayesian Inference: Concept and Practice

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Introduction into Bayesian statistics

Time Series and Dynamic Models

Bayesian Regression Linear and Logistic Regression

(1) Introduction to Bayesian statistics

CS 361: Probability & Statistics

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

BTRY 4830/6830: Quantitative Genomics and Genetics

Bayesian Models in Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

CS 361: Probability & Statistics

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Uniformly and Restricted Most Powerful Bayesian Tests

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Bayesian Learning (II)

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

A Bayesian Treatment of Linear Gaussian Regression

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian Methods for Machine Learning

Module 4: Bayesian Methods Lecture 5: Linear regression

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

CS-E3210 Machine Learning: Basic Principles

CSC321 Lecture 18: Learning Probabilistic Models

Package LBLGXE. R topics documented: July 20, Type Package

GWAS IV: Bayesian linear (variance component) models

David Giles Bayesian Econometrics

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Computational Cognitive Science

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Frequentist-Bayesian Model Comparisons: A Simple Example

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

g-priors for Linear Regression

y Xw 2 2 y Xw λ w 2 2

Probability and Estimation. Alan Moses

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

STA414/2104 Statistical Methods for Machine Learning II

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

The joint posterior distribution of the unknown parameters and hidden variables, given the

Machine Learning for Signal Processing Bayes Classification

ST 740: Model Selection

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Statistics: Learning models from data

Bayesian methods in economics and finance

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

Stat 5101 Lecture Notes

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Bayesian Linear Regression [DRAFT - In Progress]

Generalized Linear Models and Exponential Families

13: Variational inference II

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 16: Mixtures of Generalized Linear Models

Introduction to Probabilistic Machine Learning

Statistical Methods in Particle Physics

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Expectation Propagation Algorithm

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003.

Gibbs Sampling in Linear Models #2

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

2018 SISG Module 20: Bayesian Statistics for Genetics Lecture 2: Review of Probability and Bayes Theorem

Probabilistic Reasoning in Deep Learning

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Using prior knowledge in frequentist tests

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Reliability Monitoring Using Log Gaussian Process Regression

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Lecture : Probabilistic Machine Learning

Transcription:

STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed data, what is the probability of a latent parameter. Here, the data are fixed and the parameter value we are trying to estimate is unknown. This uncertainty is captured in a probability distribution, known as the posterior probability distribution, describing the distribution over possible values of the estimated parameter given the observed data. In addition, beliefs about the parameter s likely value before observing any data can be incorporated into the model through a distribution known as a prior. It is possible to have no prior information about the parameter, and this situation can also be dealt with (for example, by using an uninformative prior). Formally, relationship is represented as the following: P (θ D) P (θ)p (D θ) where θ is the model parameters, D is the observed data, P (θ) is the prior distribution on the model parameters, P (D θ) is the likelihood of the data, and P (θ D) is the posterior probability. In the Bayesian setting, we are trying to maximize the posterior probability (i.e., maximize the probability of our estimated parameter), instead of the likelihood. argmax θ [P (θ D)] argmax θ [P (θ)p (D θ)] Bayesian methods have a number of advantages with respect to traditional frequentist methods: the Bayesian paradigm handles small amounts of data well outliers that would otherwise skew results do not contribute as much to parameter estimates (if they are not likely under the prior distribution) beliefs about the parameters being estimated can be incorporated into the model explicitly the data, rather than the parameters, are thought of as fixed. variables, and described probabilistically. Parameters are unknown random 1

2 Bayesian Regression 2 Bayesian regression Let s work toward a description of regression in the Bayesian paradigm. Y X, β N(X T β, σ 2 I) Y X, β (σ 2 ) n/2 exp 2σ 2 (Y Xβ)T (Y Xβ) For the Bayesian approach, we need to put a prior probability on our parameters β and σ 2. P (β, σ 2 ) = p(β σ 2 )p(σ 2 ) We can choose to set any prior distribution on our parameters, but certain choices will make the posterior probability distribution easier to work with (and will give a closed form solution for the maximum a posteriori (MAP) estimate of the parameters). We can return to the exponential family form equations to help us think about appropriate priors. P (x η) = h(x) exp{η T T (x) A(η)} We can write our prior probability (η λ) in terms of the exponential family. Here, (η λ) is the prior distribution of the (natural) parameters for our data, conditioned on prior parameter λ. x η P (x η) η λ F (η λ) F (η λ) = h c (η) exp{λ T 1 η λ T 2 (A(η) A c (λ))} P (η x 1...x n, λ) F (η λ)π n i=1p (x i η) [ T n P (η x 1...x n, λ) h c (η) exp λ 1 + T (x i )] η + (λ 2 + n)( A(η)) [ In the posterior, the sufficient statistics are T p (x) = i=1 η A(η) When the posterior distribution and the prior probability distribution are in the same family, then the prior distribution is called the conjugate prior for the likelihood distribution. Every member of the exponential family has a conjugate prior. For example, the Beta distribution is the conjugate prior of the Bernoulli distribution and the Gaussian distribution is conjugate to itself in the mean parameter. Choosing the appropriate conjugate prior distribution ensures that your posterior has a predictable distribution (in terms of the exponential family) and the MAP parameter estimates will have a closed form solution. Here, we will choose a Gaussian distribution for the prior on β and an inverse gamma distribution for the prior on σ 2. ]

Bayesian Regression 3 P (β σ 2 ) (σ 2 ) k/2 exp{ 2σ 2 (β µ 0) T Λ 0 (β µ 0 )} N(µ 0, σ 2 Λ 0 ) b P (σ 2 a, b) (σ 2 ) a exp σ 2 Where µ 0 and Λ 0 are parameters for the Gaussian distribution on β, and a and b are the shape and scale parameters of the inverse gamma distribution. We can write the posterior distribution together as: P (β, σ 2 Y, X) P (Y X, β, σ 2 )P (β σ 2 )P (σ 2 ) Where P (Y X, β, σ 2 ) is the likelihood and P (β σ 2 )P (σ 2 ) is the prior. In this equation, X is a k n matrix and y is an n k matrix. Here, we are only working with one dimension of X. Now we will substitute the above probability distribution equations into the full equation: P (β, σ 2 Y, X) (σ 2 ) n/2 exp 2σ 2 (Y Xβ)T (Y Xβ) (σ 2 ) k/2 exp 2σ 2 (β µ 0) T Λ 0 (β µ 0 ) (σ 2 ) a exp b σ 2 We can further simply to get the posterior (written below as an expectation). This form of Bayesian regression is known as ridge regression. E[β σ 2, Y, X] = (X T X + Λ 0 ) (X T Y + Λ 0 µ 0 ) This is very similar to the normal equations we encountered in linear regression. β = (X T X) (X T Y ) However, in this form of Bayesian regression, also known as ridge regression, we add a covariance matrix, Λ 0, that helps make X T X invertible if it is singular. Usually, µ 0 = 0 and Λ 0 = ci, where c is a constant and I is the identity matrix. Note that in the second term in this equation, the impact of the prior parameters when n (or the size of the data grows to infinity) is reduced to zero. 3 Stephens and Balding 2009 - Bayesian statistical methods for genetic association studies In this paper, the authors compare Bayesian regression and testing paradigms in the context of association studies and identifying quantitative trait loci. Bayes factors, similar to likelihood ratios, describe the tradeoff between the alternative and the null model. What are the features of Bayes factors as compared to p-values? P values are affected by power and are therefore not comparable across studies with different sample sizes

4 Bayesian Regression there are many implicit assumptions when using P-values; BFs makes many of those assumptions explicit p-values cannot be combined (in an inutitive, easily worked out way) multiple hypothesis testing is a problem when using P-values, as compared to Bayes factors. Let s look at an example of the steps for computing Bayes factors and testing hypotheses in a Bayesian framework. For correlating genotype with phenotype, we normally use the following null hypothesis (that a SNP genotype is uncorrelated with phenotype) and alternative (that a SNP genotype is not uncorrelated with phenotype) hypotheses: H 0 : β = 0 H a : β 0 Using a Bayesian approach, we would set the posterior probability of association according to the null or alternate hypothesis: P (β 0 X, Y ) Then, we would chose our prior odds. For this example (and many genetic association examples), the prior odds indicate how likely it is that a given SNP will be associated with the phenotype of interest. This value may vary across SNPs, perhaps depending on the minor allele frequency. Here, we will choose 10 6. Π : P (H a) P (H 0 ) 10 6 Now, we compute the Bayes factor. This allows us to quantify the ratio of the observed data likelihood under the alternative model H a versus the null model H 0. For a Bayes factor less than or equal to 1, we cannot reject the null hypothesis. For a Bayes factor greater than 1, we may reject the null hypothesis (depending on our threshold for significance/rejection of the null). Larger Bayes factors indicate stronger support that the data fit H a rather than H 0. The form of a Bayes factor looks similar to a likelihood ratio test, but in this case, the likelihoods need not come from the same model (with different numbers of parameters). Next, we compute the posterior odds. BF = P (D H a) P (D H 0 ) P O = BF Π P O = P (D H a) P (D H 0 ) P (H a) P (H 0 ) And from there, we can compute the posterior probability of association (PPA), where values closer to one indicate a higher chance of association: P P A = P O 1 + P O

Bayesian Regression 5 The posterior probability of association can be interpreted directly as a probability, irrespective of power or sample size. PPAs for a list of SNPs can be ranked and used in an FDR-type framework to chose a threshold for significance.