Bayesian Analysis for Natural Language Processing Lecture 2

Similar documents
CSC321 Lecture 18: Learning Probabilistic Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Models in Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Introduction to Probabilistic Machine Learning

CPSC 540: Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Regression Linear and Logistic Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference

6.867 Machine Learning

(1) Introduction to Bayesian statistics

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Bayesian Nonparametrics for Speech and Signal Processing

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 2: Priors and Conjugacy

PMR Learning as Inference

Non-Parametric Bayes

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Computational Cognitive Science

Lecture 18: Learning probabilistic models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 2: Conjugate priors

COMP90051 Statistical Machine Learning

CS 188: Artificial Intelligence Spring Today

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Lecture : Probabilistic Machine Learning

Computational Perception. Bayesian Inference

Hierarchical Models & Bayesian Model Selection

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

6.867 Machine Learning

Bayesian Inference and MCMC

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Machine Learning

Computational Cognitive Science

Discrete Binary Distributions

CS 361: Probability & Statistics

g-priors for Linear Regression

Modeling Environment

Density Estimation. Seungjin Choi

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probability and Estimation. Alan Moses

Probabilistic Graphical Models

Statistical Models. David M. Blei Columbia University. October 14, 2014

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Week 3: Linear Regression

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Bayesian Learning (II)

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Ways to make neural networks generalize better

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Parametric Techniques Lecture 3

The Naïve Bayes Classifier. Machine Learning Fall 2017

CPSC 540: Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Machine Learning

MLE/MAP + Naïve Bayes

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Parametric Techniques

Probabilistic Graphical Models

Bayesian Methods: Naïve Bayes

Probabilistic Graphical Models

Learning Bayesian network : Given structure and completely observed data

Machine Learning 4771

Applied Bayesian Statistics STAT 388/488

Intro to Bayesian Methods

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Computational Cognitive Science

Bayesian Inference. p(y)

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

CS 361: Probability & Statistics

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Introduction to Bayesian Learning. Machine Learning Fall 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Density Estimation: ML, MAP, Bayesian estimation

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Hierarchical Bayesian Nonparametrics

Introduction to AI Learning Bayesian networks. Vibhav Gogate

10. Exchangeability and hierarchical models Objective. Recommended reading

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Introduction to Logistic Regression and Support Vector Machine

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Nonparametric Bayesian Methods (Gaussian Processes)

Transcription:

Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013

Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion on 2/18 (each leading a discussion on one paper) Next week: guest lecture by Bob Carpenter. I will send to the mailing list two papers he will cover.

Refresher: last week Bayesian statistics manages uncertainty using distributions over the parameters Start with a prior ( half-baked idea ) and update it to the posterior The two components of the model: prior and likelihood Inference: uses Bayes theorem p(θ x) = p(θ)p(x θ), p(x) where p(x) = p(θ)p(x θ)dθ. θ

Today s class We will focus on priors in NLP and in general If time permits: Bayesian decision theory If time permits: MAP inference All your questions and comments were great. I picked the ones which seemed most relevant to everybody.

Do Bayesian models overfit? Question: There is a lot of talk about the power that Bayesian statistics can bring to NLP because of the additional degrees of freedom added by picking the prior (and potentially also introducing latent variable and hyperparameters). However, one concern that is apparent to me is that with this freedom there is the danger for the model to overfit training data and perform poorly on test data. Of course this is a theoretical concern, and if Bayesian methods perform well on test sets in practice this is not a real issue but I think it is at least a theorical issue with these methods.

Do Bayesian models overfit? Bayesian models usually work in an unsupervised (transductive or non-transductive) setting MAP inference with certain priors can actually alleviate overfitting for example, L 2 regularization corresponds to MAP inference with Gaussians

Non-uniform prior, coin with 0.7 prob. for heads density 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 prob. of heads

Posterior after 10 tosses, coin with 0.7 prob. for heads density 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 prob. of heads

Posterior after 100 tosses, coin with 0.7 prob. for heads density 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 prob. of heads

Posterior after 1000 tosses, coin with 0.7 prob. for heads density 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 prob. of heads

Effect of priors Question: In practice is there a trend that bayesian methods probably achieve their best performance with the lesser number of examples than the frequentist methods take to achieve their best performance? The motivation for this question is if the bayesian methods are given good priors to start with I get the feel that the parameter search space would be lesser compared to the frequentist methods and hence they require fewer examples to learn a good model.

Effect of priors Priors have great effect when the sample size is not too large In many cases, the posterior converges to a distribution with mass concentrated around the maximum likelihood estimate In NLP, we usually don t have enough samples, so the prior can have large effect

Conjugate priors What are conjugate priors?

Dirichlet prior Question: I was wondering if there was any work done of the theoretic justification of conjugate priors specifically the Dirichlet priors. That is, is there any theoretical work indicating that Dirichlet is a good prior for any reasons other than computation efficiency?

Dirichlet prior Main reason for choosing Dirichlet is computational convenience But... Dirichlet priors can encourage sparsity They are also interpretable as adding extra counts to empirical counts a technique known as additive smoothing

Conjugacy to multinomial Question: Is the Dirichlet distribution the only conjugate prior with respect to the multinomial distribution? In general, can a distribution have multiple conjugate prior associated with it?

Conjugacy to multinomial Is a family of priors with a single delta function on some parameter conjugate? Is a family of all possible priors conjugate? Is a mixture of conjugate priors also conjugate?

Prior selection Question: How to choose priors? Mostly, tractability is the key for choosing priors. I am not sure advantages and disadvantages of each prior from the aspect of NLP.

Prior selection Do we have reasons to believe one parameter is more likely than the other? If not, use non-informative prior Can we interpret the prior? Are we looking to model some aspect of the parameters? Sparsity? Correlation? Is there an interpretation to the prior that is useful for our data? Empirically: Trial-and-error Sometimes selection is done automatically

Prior design Question: can one manually design a prior to suit his needs? Or is it the case that one must always choose from the existing distributions over the real? I suppose unlike discrete distributions (where you can just specify the probability of each item), priors are hard to hand-design if possible at all

Prior design Prior elicitation - not done much in NLP, if at all Empirical Bayesian setting - learning the hyperparameters

Dirichlet vs. Logistic Normal Question: You mentioned that the logistic normal distribution allows modeling for dependence as opposed to the Dirichlet distribution. I didn t quite get how so.

When Bayesian Statistics works well? Question: It would be great if we could see an example comparing Bayesian and MLE (Frequentist) estimators. Specifically, one where (empirically) the bayesian estimator is better than the MLE, and vice versa. Also, are there any examples where Bayesian statistics leads to very bad errors? (I m guessing that the best way to proceed is - analyze the data, pick a very good prior, and then adopt the model. But from the plots in class it seems like with enough data you wash out the prior - this always happens? How important is the prior then?)

When Bayesian Statistics works well? If most signal is in the likelihood, Bayesians and frequentists get similar results This happens when: Lots of data is available Complete data is available (again, in large amounts) Prior is non-informative Bayesian Statistics enables diversity of the parameters, mananging uncertainty using distributions

Hyperparameters Question: What are hyperparameters?

Hyperparameters Priors usually come from a parametric family Priors define a distribution over parameters, and the parameters controlling the distribution are hyperparameters Empirical Bayes vs. hierarchical Bayesian models

Do we need anything but conjugate priors? Question: Aren t we always selecting distributions to model our data because they are elegant/convenient? I have difficulty understanding why we can assume a variable is distributed Gaussian, for example, and we should not assume that its mean is also distributed Gaussian? Isn t that the whole assumption of a generative model? If the mean is not distributed Gaussian, we could always put in an uninformative prior. Could you provide an NLP based example in which a conjugate prior is not an appropriate means of expressing the distribution of a variable in question?

Do we need anything but conjugate priors? Balance: efficiency and accuracy of model Dirichlet is sometimes a weak choice of prior