# Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Save this PDF as:

Size: px
Start display at page:

Download "Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework"

## Transcription

1 HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford Maximum Likelihood Principle A generative model for training data D = {(x i, y i )} n i= given a parameter vector θ: y i (π,..., π K ), x y i g yi (x) = p(x yi ) k-th class conditional density assumed to have a parametric form for g k (x) = p(x k ) and all parameters are given by θ = (π,..., π K ;,..., K ) Generative process defines the likelihood function: the joint distribution of all the observed data p(d θ) given a parameter vector θ. Process of generative learning consists of computing the MLE θ of θ based on D: θ = argmax p(d θ) θ Θ We then use a plug-in approach to perform classification f θ(x) = argmax P θ(y π k p(x k ) = k X = x) = argmax K k {,...,K} k {,...,K} j= π jp(x j ) The Framework Being Bayesian: treat parameter vector θ as a random variable: process of learning is then computation of the posterior distribution p(θ D). In addition to the likelihood p(d θ) need to specify a prior distribution p(θ). Posterior distribution is then given by the Bayes Theorem: Likelihood: p(d θ) Prior: p(θ) Summarizing the posterior: p(θ D) = p(d θ)p(θ) Posterior: p(θ D) Marginal likelihood: = Θ p(d θ)p(θ)dθ Posterior mode: θ MAP = argmax θ Θ p(θ D) (maximum a posteriori). Posterior mean: θ mean = E [θ D]. Posterior variance: Var[θ D].

2 A simple example: We have a coin with probability of coming up heads. Model coin tosses as i.i.d. Bernoullis, =head, =tail. Estimate given a dataset D = {x i } n i= of tosses n=, n =, n = n=, n =, n =.5 n=, n =4, n =6.5 p(d ) = n ( ) n with n j = n i= (x i = j). Maximum Likelihood estimate: ˆ ML = n n Bayesian approach: treat the unknown parameter as a random variable. Simple prior: Uniform[, ], i.e., p() = for [, ]. Posterior distribution: p( D) = p(d θ)p(θ) = n ( ) n, = n ( ) n d = Posterior is a Beta(n +, n + ) distribution: mean = n + n+. (n + )! n!n! n=, n =65, n = n=, n =686, n = n=, n =74, n = Posterior becomes behaves like the ML estimate as dataset grows and is peaked at true value =.7. All Bayesian reasoning is based on the posterior distribution. Posterior mode: MAP = n n Posterior mean: mean = n + n+ Posterior variance: Var[ D] = mean ( mean ) n+ ( α)-credible regions: (l, r) [, ] s.t. r p(θ D)dθ = α. l Consistency: Assuming that the true parameter value is given a non-zero density under the prior, the posterior distribution concentrates around the true value as n. Rate of convergence? The posterior predictive distribution is the conditional distribution of x n+ given D = {x i } n i= : p(x n+ D) = = p(x n+, D)p( D)d p(x n+ )p( D)d = ( mean ) x n+ ( mean ) x n+ We predict on new data by averaging the predictive distribution over the posterior. Accounts for uncertainty about.

3 Beta Distributions In this example, the posterior distribution has a known analytic form and is in the same Beta family as the prior: Uniform[, ] Beta(, ). An example of a conjugate prior. A Beta distribution Beta(a, b) with parameters a, b > is an exponential family distribution with density (.,.) (.8,.8) (,) (,) (5,5) (,9) (,7) (5,5) (7,) (9,) p( a, b) = Γ(a + b) Γ(a)Γ(b) a ( ) b where Γ(t) = u t e u du is the gamma function. If the prior is Beta(a, b), then the posterior distribution is so is Beta(a + n, b + n ). p( D, a, b) = a+n ( ) b+n Hyperparameters a and b are pseudo-counts, an imaginary initial sample that reflects our prior beliefs about Bayesian Inference on the Categorical Distribution Dirichlet Distributions Suppose we observe D = {y i } n i= with y i {,..., K}, and model them as i.i.d. with pmf π = (π,..., π K ): p(d π) = n π yi = i= with n k = n i= (y i = k) and π k >, K k= π k =. The conjugate prior on π is the Dirichlet distribution Dir(α,..., α K ) with parameters α k >, and density p(π) = Γ( K k= α k) K k= Γ(α k) k= k= π n k k π α k k on the probability simplex {π : π k >, K k= π k = }. The posterior is also Dirichlet Dir(α + n,..., α K + n K ). Posterior mean is π k mean = α k + n k K j= α. j + n j (A) Support of the Dirichlet density for K =. (B) Dirichlet density for α k =. (C) Dirichlet density for α k =..

4 Naïve Bayes Bayesian Inference on Naïve Bayes model Return to the spam classification example with two-class naïve Bayes p p(x i k ) = x(j) i kj ( kj ) x(j) i. j= Set n k = n i= {y i = k}, n kj = n i= (y i = k, x (j) i ˆπ k = n k n, ˆkj = i:y i =k x(j) i n k = ). MLE is: = n kj n k. One problem: if the l-th word did not appear in documents labelled as class k then ˆ kl = and P(Y = k X = x with l-th entry equal to ) p ( ) x (j) ˆπ k ˆkj ( ˆ ) x (j) kj = j= i.e. we will never attribute a new document containing word l to class k (regardless of other words in it). Bayesian Inference on Naïve Bayes model Given D = {(x i, y i )} n i=, want to predict a label ỹ for a new document x. We can calculate with Predicted class is p( x, ỹ = k D) = p(ỹ = k D)p( x ỹ = k, D) p(ỹ = k D) = α k + n k K l= α l + n p( x (j) = ỹ = k, D) = a + n kj a + b + n k p(ỹ = k x, D) = p(ỹ = k D)p( x ỹ = k, D) p( x D) Compared to ML plug-in estimator, pseudocounts help to regularize probabilities away from extreme values. Under the Naïve Bayes model, the joint distribution of labels y i {,..., K} and data vectors x i {, } p is n p(x i, y i ) = i= = n i= k= k= π k π n k k p j= p j= x(j) i kj ( kj ) x(j) i n kj kj ( kj) n k n kj (y i =k) where n k = n i= (y i = k), n kj = n i= (y i = k, x (j) i = ). For conjugate prior, we can use Dir((α k ) K k= ) for π, and Beta(a, b) for kj independently. Because the likelihood factorizes, the posterior distribution over π and ( kj ) also factorizes, and posterior for π is Dir((α k + n k ) K k= ), and for kj is Beta(a + n kj, b + n k n kj ). and Regularization Consider a Bayesian approach to logistic regression: introduce a multivariate normal prior for weight vector w R p, and a uniform (improper) prior for offset b R. The prior density ( is: p(b, w) = (πσ ) p exp ) σ w The posterior is p(b, w D) exp ( σ w ) n log( + exp( y i (b + w x i ))) The posterior mode is equivalent to minimizing the L -regularized empirical risk. Regularized empirical risk minimization is (often) equivalent to having a prior and finding a MAP estimate of the parameters. L regularization - multivariate normal prior. L regularization - multivariate Laplace prior. From a Bayesian perspective, the MAP parameters are just one way to summarize the posterior distribution. i=

5 Bayesian Model Selection A model M with a given set of parameters θ M consists of both the likelihood p(d θ M ) and the prior distribution p(θ M ). One example model would consist of all Gaussian mixtures with K components and equal covariance (LDA): θ LDA = (π,..., π K ; µ,..., µ K ; Σ), along with a prior on θ; another would allow different covariances (QDA) θ QDA = (π,..., π K ; µ,..., µ K ; Σ,..., Σ K ). The posterior distribution p(θ M D, M) = p(d θ M, M)p(θ M M) p(d M) Marginal probability of the data under M (Bayesian model evidence): p(d M) = p(d θ M, M)p(θ M M)dθ Θ Bayesian Occam s Razor Occam s Razor: of two explanations adequate to explain the same set of observations, the simpler should be preferred. p(d M) = p(d θ M, M)p(θ M M)dθ Θ Model evidence p(d M) is the probability that a set of randomly selected parameter values inside the model would generate dataset D. Models that are too simple are unlikely to generate the observed dataset. Models that are too complex can generate many possible dataset, so again, they are unlikely to generate that particular dataset at random. Compare models using their Bayes factors p(d M) p(d M ) Bayesian model comparison: Occam s razor at work M = M = M = M = M = 4 M = 5 M = 6 M = P(Y M) Model Evidence M Discussion Use probability distributions to reason about uncertainties of parameters (latent variables and parameters are treated in the same way). Model consists of the likelihood function and the prior distribution on parameters: allows to integrate prior beliefs and domain knowledge. Bayesian computation most posteriors are intractable, and posterior needs to be approximated by: Monte Carlo methods (MCMC and SMC). Variational methods (variational Bayes, belief propagation etc). Prior usually has hyperparameters, i.e., p(θ) = p(θ ψ). How to choose ψ? Be Bayesian about ψ as well choose a hyperprior p(ψ) and compute p(ψ D). Maximum Likelihood II find ψ maximizing argmax ψ Ψ p(d ψ). p(d ψ) = p(d θ)p(θ ψ)dθ p(ψ D) = p(d ψ)p(ψ) figures by M.Sahani

6 Further Reading Videolectures by Zoubin Ghahramani: and Graphical models. Gelman et al. Bayesian Data Analysis. Kevin Murphy. Machine Learning: a Probabilistic Perspective. E. T. Jaynes. Probability Theory: The Logic of Science.

### Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

### Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### Bayesian Methods: Naïve Bayes

Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### Probabilistic Graphical Models

Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

### GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

### Bayesian Learning (II)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

### Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

### σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### Approximate Inference using MCMC

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

### CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment

### The Bayes classifier

The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

### Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

### Generative Models for Discrete Data

Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### Data Analysis and Uncertainty Part 2: Estimation

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Topics in Estimation 1. Estimation 2. Desirable

### 4: Parameter Estimation in Fully Observed BNs

10-708: Probabilistic Graphical Models 10-708, Spring 2015 4: Parameter Estimation in Fully Observed BNs Lecturer: Eric P. Xing Scribes: Purvasha Charavarti, Natalie Klein, Dipan Pal 1 Learning Graphical

### Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

### CS540 Machine learning L8

CS540 Machine learning L8 Announcements Linear algebra tutorial by Mark Schmidt, 5:30 to 6:30 pm today, in the CS X-wing 8th floor lounge (X836). Move midterm from Tue Oct 14 to Thu Oct 16? Hw3sol handed

### CS540 Machine learning L9 Bayesian statistics

CS540 Machine learning L9 Bayesian statistics 1 Last time Naïve Bayes Beta-Bernoulli 2 Outline Bayesian concept learning Beta-Bernoulli model (review) Dirichlet-multinomial model Credible intervals 3 Bayesian

### Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

### Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

### Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of

### Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

STA56: Probabilistic machine learning Introduction: exponential family, conjugacy, and sufficiency 9/2/3 Lecturer: Barbara Engelhardt Scribes: Melissa Dalis, Abhinandan Nath, Abhishek Dubey, Xin Zhou Review

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP Personal Healthcare Revolution Electronic health records (CFH) Personal genomics (DeCode, Navigenics, 23andMe) X-prize: first \$10k human genome technology

### Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

### The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

### Introduction to Gaussian Process

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

### Bayesian Nonparametrics

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

### Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

### Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

### Lecture 2: Simple Classifiers

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### Lecture 9: PGM Learning

13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

### Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

### Probabilistic Modelling and Bayesian Inference

Probabilistic Modelling and Bayesian Inference Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ MLSS Tübingen Lectures

### Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

### Estimating Parameters

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 13, 2012 Today: Bayes Classifiers Naïve Bayes Gaussian Naïve Bayes Readings: Mitchell: Naïve Bayes

### Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

### Nonparametric Bayes Uncertainty Quantification

Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes

### Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

### Machine Learning, Fall 2012 Homework 2

0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

### Some Probability and Statistics

Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

### Bayesian Inference: Posterior Intervals

Bayesian Inference: Posterior Intervals Simple values like the posterior mean E[θ X] and posterior variance var[θ X] can be useful in learning about θ. Quantiles of π(θ X) (especially the posterior median)

### Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### Least Squares Regression

E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

### An introduction to Sequential Monte Carlo

An introduction to Sequential Monte Carlo Thang Bui Jes Frellsen Department of Engineering University of Cambridge Research and Communication Club 6 February 2014 1 Sequential Monte Carlo (SMC) methods

### Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

### y Xw 2 2 y Xw λ w 2 2

CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

### Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

### Introduction to Bayesian Statistics

School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a

### Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Bayesian methods in economics and finance

1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

### Bayesian linear regression

Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

### Predictive Hypothesis Identification

Marcus Hutter - 1 - Predictive Hypothesis Identification Predictive Hypothesis Identification Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Marcus Hutter - 2 - Predictive

### Introduction to Bayesian Learning

Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

### Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

### CS Lecture 18. Topic Models and LDA

CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

### Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful. Even if you aren t Bayesian, you can define an uninformative prior and everything

### Estimation of Operational Risk Capital Charge under Parameter Uncertainty

Estimation of Operational Risk Capital Charge under Parameter Uncertainty Pavel V. Shevchenko Principal Research Scientist, CSIRO Mathematical and Information Sciences, Sydney, Locked Bag 17, North Ryde,

### Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

### Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian inference Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 10, 2017 1 / 22 Outline for today A genetic example Bayes theorem Examples Priors Posterior summaries

### Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### Probability Based Learning

Probability Based Learning Lecture 7, DD2431 Machine Learning J. Sullivan, A. Maki September 2013 Advantages of Probability Based Methods Work with sparse training data. More powerful than deterministic

### Introduction to Machine Learning

How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

### Some Probability and Statistics

Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about

### Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

### Based on slides by Richard Zemel

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

### Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

### Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

### Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning

### Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

### K. Nishijima. Definition and use of Bayesian probabilistic networks 1/32

The Probabilistic Analysis of Systems in Engineering 1/32 Bayesian probabilistic bili networks Definition and use of Bayesian probabilistic networks K. Nishijima nishijima@ibk.baug.ethz.ch 2/32 Today s

### More Spectral Clustering and an Introduction to Conjugacy

CS8B/Stat4B: Advanced Topics in Learning & Decision Making More Spectral Clustering and an Introduction to Conjugacy Lecturer: Michael I. Jordan Scribe: Marco Barreno Monday, April 5, 004. Back to spectral

### APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

APM 541: Stochastic Modelling in Biology Bayesian Inference Jay Taylor Fall 2013 Jay Taylor (ASU) APM 541 Fall 2013 1 / 53 Outline Outline 1 Introduction 2 Conjugate Distributions 3 Noninformative priors

### Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

### A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

### Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

### STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

### Expectation Maximization (EM)

Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

### Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University

### USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 3 Conditionals and marginals For Bayesian analysis it is very useful to understand how to write joint, marginal, and conditional distributions for the multivariate

### Bayesian Classification Methods

Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define

### Latent Dirichlet Allocation

Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability

### Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic