Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Size: px
Start display at page:

Download "Introduction: exponential family, conjugacy, and sufficiency (9/2/13)"

Transcription

1 STA56: Probabilistic machine learning Introduction: exponential family, conjugacy, and sufficiency 9/2/3 Lecturer: Barbara Engelhardt Scribes: Melissa Dalis, Abhinandan Nath, Abhishek Dubey, Xin Zhou Review In the previous class, we discussed the maximum likelihood estimates MLE and maximum a posteriori MAP estimates. The general ways to calculate them are as follows ĥ MAP arg max P h D h H ĥ MLE arg max P D h h H. How to calculate the MLE? Assume we have the observed data sequence D {x, x 2,..., x n }, with X i independent and identically distributed IID and X i {0, }. We also model the X i s as Bernoulli, X i Berπ, with parameter π [0, ] where π represents the probability that X i is a. Thus the log likelihood for D is lh; D log pd h n log π xi π xi [x i log π + x i log π] The reason we use the logarithm of the likelihood is to facilitate the calculation of the first derivative of the likelihood. The log likelihood is a concave function see Figure. It will first increase as π increases and reach a maximum value and then reduce as π increases. The global maximum value indicates that the corresponding π maximizes the log likelihood of the data; this maximum value will occur at the location where the first derivative of the log likelihood with respect to π is equal to 0 i.e., zero slope of the log likelihood function with respect to π. The logarithm function is monotonically increasing, so maximizing logfx also maximizes fx..2 Examples: Calculating the ĥmle for the Bernoulli Distribution In order to get the maximum of the above concave function, we take the first derivative of lh; D with respect to π, as: lh; D π n x n i x i π π

2 2 Introduction: exponential family, conjugacy, and sufficiency Figure : Example of a concave function, depicting the global maximum at the position where the slope is zero. Then, set this to 0, we get π N π N 0 0 π π N N 0 π N N N 0 N + N 0 N π ˆπ MLE N N + N 0 where we use N and N 0 as shorthand for the number of heads and the number of tails in the data set: N N 0 x i x i 2 The Exponential Family 2. Why exponential family? The exponential family is the only family of distributions with finite-sized sufficient statistics, meaning that we can compress potentially very large amounts of data into a constant-sized summary without loss of information. This is particularly useful for online learning, where the observed data may become huge e.g., your inbox, where each is a sample, and s arrive in an ongoing way.

3 Introduction: exponential family, conjugacy, and sufficiency 3 The exponential family is the only family of distributions for which conjugate priors exist, which simplifies the computation of the posterior. They are the core of generalized linear models and variational methods, which we will learn about in this course. Expectations are simple to compute, as we will see today, making our life simple. This is part of the exponential family machinery that we can exploit in our work to make the computation and mathematics simpler. The exponential family includes many of the distributions we ve seen already, including: normal, exponential, gamma, beta, Dirichlet, Bernoulli, Poisson, and many others. An important distribution that does not strictly belong to the exponential family is the uniform distribution. 2.2 Definition A distribution is in the exponential family if its pdf or pmf px η, for x {x, x 2,..., x n } R n and η H R d can be written in the following form: px η hx exp { η T T x Aη } where: η natural parameters Aη log partition function T x sufficient statistics hx base measure µ mean parameter Response function η µ Link function Figure 2: Link and Response functions The relationship between µ and η is shown in Figure 2: the link function and the response function are invertible functions that allow a mapping between the mean parameters and the natural parameters. This is a one-to-one mapping the reason it is one-to-one will become clear later. This fact is enables us to work in terms of either the the space of natural or mean parameters whichever is mathematically most convenient, since converting between them is straightforward. 2.3 Examples In this section, we will represent the Bernoulli and the Gaussian in the exponential family form. We will also derive the link and response functions.

4 4 Introduction: exponential family, conjugacy, and sufficiency 2.3. Bernoulli distribution in the exponential family We write the probability mass function for a Bernoulli random variable x Berπ in exponential form as below, where π is the mean parameter of the random variable X, e.g., the probability of a single coin flip coming up heads. The general way to derive this is to take the explogpdf of the pmf or pdf and organize the resulting variables to match the form of the exponential family distribution. Berx π π x π x exp [ logπ x π x ] exp [x log π + x log π] exp [xlog π log π + log π] [ ] π exp x log + log π π On comparing the above formula with the exponential family form, we have hx T x x π η log π Aη log π log + expη Converting between η and µ: we can use the logit function, η log π π, to compute the natural parameter η from the mean parameter π. We can use the function π +exp η, called the logistic function, which is just the inverse of the previous relation to compute the mean parameter from the natural parameter. Aη log + e η log π + e η π e η + e η π π + e η Gaussian The Gaussian probability density function can be written in exponential form as follows: N x µ, σ 2 exp[ 2πσ 2 /2 2σ 2 x µ2 ] exp[ 2πσ 2 /2 2σ 2 x2 + µ σ 2 x 2σ 2 µ2 ] exp[ logσ 2π /2 2σ 2 µ2 + µ σ 2 x 2σ 2 x2 ]

5 Introduction: exponential family, conjugacy, and sufficiency 5 On comparing the above with the exponential family form, we have: hx η T x 2π /2 µ σ 2 2σ 2 x x 2 Aη log σ + 2σ 2 µ2 η2 4η 2 2 log 2η Properties of the exponential family For the exponential family, we have hx expη T T x Aηdx as the integrand is a probability distribution which implies Aη log hx expη T T xdx We will use this property below to compute the mean and variance of a distribution in the exponential family. For this reason, the function Aη is often called the log normalizer Expected Value In the exponential family, we can take the derivatives of the log partition function in order to obtain the cumulants of the sufficient statistics. This is why Aη is often called a cumulant function. Below we will show how to calculate the first and second cumulants of a distribution, which are the mean E[ ] and variance var[ ], in this case, of the sufficient statistics. The first derivative of Aη is the expected value, as shown below. Aη log hx exp{η T T x}dx hx exp{η T T x}t xdx expaη hx exp{η T T x Aη}T xdx px ηt xdx E η [T x]

6 6 Introduction: exponential family, conjugacy, and sufficiency Variance It is also simple to calculate the variance, which is equal to the second derivative of the log partition function with respect to the natural parameter, as proved below. 2 Aη T hx exp{η T T x Aη}T x T x Aη dx px ηt xt x A ηdx px ηt 2 xdx A η px ηt xdx E[T 2 x] E 2 [T x] var[t x] Example: the Bernoulli distribution Let X be a Bernoulli random variable with probability π. Then Aη log + e η and T x x as shown in a previous section. As explained above, we can then find the expectation and variance by taking the first and second derivatives of Aη: Aη πη E[x] E[T x] + e η 2 Aη T + e η + e η π π var[t x] + e η The property of the log partition function can be generalize: The m th derivative of Aη is the m th cumulant around the mean, so that the problems of estimating moments integration are simplified to differentiation, making the computation much easier. Also, Aη is a convex function of η, since its second derivative is var[tx, which is always positive semidefinite, and when var[t x] is positive definite, under strict convexity, we are guaranteed that Aη is one-to-one, which means that µ E[T x] Aη is invertible, i.e. one-to-one mapping between the mean parameter and the natural parameter. 3 MLE for the Exponential Family A nice property of the exponential family is that exponential families are closed under sampling. The sufficient statistics T x are finite independent of the size of the data set, i.e., the size of T x does not grow as n D. To see why, consider a sequence of observations X {x, x 2,..., x n } all x i s are i.i.d.. We look at T x as n. We do this by writing the likelihood in the exponential family form. px, x 2,..., x n η n px i η n hx i exp η T n T x i naη

7 Introduction: exponential family, conjugacy, and sufficiency 7 The sufficient statistics are thus {n, n T jx}, where j {,..., T x }, which has exactly T x + components. For examples: the sufficient statistics for Bernoulli distribution are: { n lx i, n}, and the sufficient statistics for Gaussian distribution are: { n x i, n x2 i, n} 3. Computing MLE We define the log likelihood with respect to the exponential family form as: lη; D log pd η n n log hx i + η T T x i naη This is a concave function of η and hence must have a global maximum, which can be found by equating the derivative of the log likelihood with respect to natural parameter η to 0: lη; x,..., x n 0 T x i n Aη 0 E ηmle [T x] T x i n µη MLE n T x i We see that the theoretical expected sufficient statistics of the model are equal to the empirical average of the sufficient statistics. 4 Conjugacy When the prior and the posterior have the same form, we say that the prior is a conjugate prior for the corresponding likelihood. For all distributions in the exponential family, we can derive a conjugate prior for the natural parameters. Let the prior be pη τ, where τ denotes the hyper-parameters. The posterior can be written as: pη X px ηpη τ The likelihood of the exponential family is: n px η hx i exp η T n T x i naη To make the prior conjugate to the likelihood, the prior pη τ must be in the exponential family form with two terms, one term τ {τ,..., τ k } multiplying the η, the other term τ 0 multiplying Aη, as: pη τ exp { η T τ τ 0 Aη }

8 8 Introduction: exponential family, conjugacy, and sufficiency Then the posterior can be written as: pη X px ηpη τ n exp η T T x i naη exp η T τ τ 0 Aη { n } exp η T T x i + τ n + τ 0 Aη So we see the posterior has the same exponential family form as the prior, and the posterior hyper-parameters are adding the sum of the sufficient statistics to hyper-parameters of the conjugate prior. The exponential family is the only family of distributions for which the conjugate priors exist. This is a convenient property of the exponential family because conjugate priors simplify computation of the posterior.

Exponential Families

Exponential Families Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,

More information

Gaussian Models (9/9/13)

Gaussian Models (9/9/13) STA561: Probabilistic machine learning Gaussian Models (9/9/13) Lecturer: Barbara Engelhardt Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava 1 Multivariate Normal Distribution The multivariate

More information

Generalized Linear Models and Exponential Families

Generalized Linear Models and Exponential Families Generalized Linear Models and Exponential Families David M. Blei COS424 Princeton University April 12, 2012 Generalized Linear Models x n y n β Linear regression and logistic regression are both linear

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS LECTURE : EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS HANI GOODARZI AND SINA JAFARPOUR. EXPONENTIAL FAMILY. Exponential family comprises a set of flexible distribution ranging both continuous and

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Lecture 2: Conjugate priors

Lecture 2: Conjugate priors (Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Chapter 8.8.1: A factorization theorem

Chapter 8.8.1: A factorization theorem LECTURE 14 Chapter 8.8.1: A factorization theorem The characterization of a sufficient statistic in terms of the conditional distribution of the data given the statistic can be difficult to work with.

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

Lecture 2: Repetition of probability theory and statistics

Lecture 2: Repetition of probability theory and statistics Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:

More information

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Accouncements You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF Please do not zip these files and submit (unless there are >5 files) 1 Bayesian Methods Machine Learning

More information

Maximum likelihood estimation

Maximum likelihood estimation Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered:

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Stat 710: Mathematical Statistics Lecture 12

Stat 710: Mathematical Statistics Lecture 12 Stat 710: Mathematical Statistics Lecture 12 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 12 Feb 18, 2009 1 / 11 Lecture 12:

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell, Recall The Neyman-Pearson Lemma Neyman-Pearson Lemma: Let Θ = {θ 0, θ }, and let F θ0 (x) be the cdf of the random vector X under hypothesis and F θ (x) be its cdf under hypothesis. Assume that the cdfs

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University Spring 2018 1 Announcements Logistics of next few lectures Final project released, proposals/groups due

More information

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Random variables. DS GA 1002 Probability and Statistics for Data Science. Random variables DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Motivation Random variables model numerical quantities

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Basics on Probability. Jingrui He 09/11/2007

Basics on Probability. Jingrui He 09/11/2007 Basics on Probability Jingrui He 09/11/2007 Coin Flips You flip a coin Head with probability 0.5 You flip 100 coins How many heads would you expect Coin Flips cont. You flip a coin Head with probability

More information

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Vibhav Gogate The University of Texas at Dallas Point Estimation Vibhav Gogate The University of Texas at Dallas Some slides courtesy of Carlos Guestrin, Chris Bishop, Dan Weld and Luke Zettlemoyer. Basics: Expectation and Variance Binary Variables

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018 CLASS NOTES Models, Algorithms and Data: Introduction to computing 208 Petros Koumoutsakos, Jens Honore Walther (Last update: June, 208) IMPORTANT DISCLAIMERS. REFERENCES: Much of the material (ideas,

More information

Mathematical statistics

Mathematical statistics October 4 th, 2018 Lecture 12: Information Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation Chapter

More information

Algorithms for Uncertainty Quantification

Algorithms for Uncertainty Quantification Algorithms for Uncertainty Quantification Tobias Neckel, Ionuț-Gabriel Farcaș Lehrstuhl Informatik V Summer Semester 2017 Lecture 2: Repetition of probability theory and statistics Example: coin flip Example

More information

Lecture 18: Learning probabilistic models

Lecture 18: Learning probabilistic models Lecture 8: Learning probabilistic models Roger Grosse Overview In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions.

More information

5.2 Fisher information and the Cramer-Rao bound

5.2 Fisher information and the Cramer-Rao bound Stat 200: Introduction to Statistical Inference Autumn 208/9 Lecture 5: Maximum likelihood theory Lecturer: Art B. Owen October 9 Disclaimer: These notes have not been subjected to the usual scrutiny reserved

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch. School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

More Spectral Clustering and an Introduction to Conjugacy

More Spectral Clustering and an Introduction to Conjugacy CS8B/Stat4B: Advanced Topics in Learning & Decision Making More Spectral Clustering and an Introduction to Conjugacy Lecturer: Michael I. Jordan Scribe: Marco Barreno Monday, April 5, 004. Back to spectral

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

More information

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

MIT Spring 2016

MIT Spring 2016 Dr. Kempthorne Spring 2016 1 Outline Building 1 Building 2 Definition Building Let X be a random variable/vector with sample space X R q and probability model P θ. The class of probability models P = {P

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Definitions Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER.

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER. Two hours MATH38181 To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER EXTREME VALUES AND FINANCIAL RISK Examiner: Answer any FOUR

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20 Conditional Gaussian Regression Gaussian Regression Input

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind

More information

Probability reminders

Probability reminders CS246 Winter 204 Mining Massive Data Sets Probability reminders Sammy El Ghazzal selghazz@stanfordedu Disclaimer These notes may contain typos, mistakes or confusing points Please contact the author so

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Northwestern University Department of Electrical Engineering and Computer Science

Northwestern University Department of Electrical Engineering and Computer Science Northwestern University Department of Electrical Engineering and Computer Science EECS 454: Modeling and Analysis of Communication Networks Spring 2008 Probability Review As discussed in Lecture 1, probability

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 22, 2011 Today: MLE and MAP Bayes Classifiers Naïve Bayes Readings: Mitchell: Naïve Bayes and Logistic

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

Conjugate Priors, Uninformative Priors

Conjugate Priors, Uninformative Priors Conjugate Priors, Uninformative Priors Nasim Zolaktaf UBC Machine Learning Reading Group January 2016 Outline Exponential Families Conjugacy Conjugate priors Mixture of conjugate prior Uninformative priors

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Mathematical statistics

Mathematical statistics October 1 st, 2018 Lecture 11: Sufficient statistic Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about

More information

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018 Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian

More information

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell Machine Learning 0-70/5 70/5-78, 78, Spring 00 Probability 0 Aarti Singh Lecture, January 3, 00 f(x) µ x Reading: Bishop: Chap, Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell Announcements Homework

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

LECTURE 2 NOTES. 1. Minimal sufficient statistics.

LECTURE 2 NOTES. 1. Minimal sufficient statistics. LECTURE 2 NOTES 1. Minimal sufficient statistics. Definition 1.1 (minimal sufficient statistic). A statistic t := φ(x) is a minimial sufficient statistic for the parametric model F if 1. it is sufficient.

More information