LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Size: px
Start display at page:

Download "LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)"

Transcription

1 LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered stochastic. Its distribution over Θ before observing any data, called the prior, reflects the investigator s prior beliefs about θ. After obtaining a sample x, the investigator updates his or her beliefs by Bayes rule, hence the name. Let π(θ) be the prior density on Θ and f θ (x) be a set of densities on X. The distribution of θ x is called the posterior distribution and is given by Bayes rule: f(x θ)π(θ) π(θ x) =, f π (x) where f π (x) := Θ f(x θ)π(θ)dθ is the marginal density of x. It reflects the investigator s updated beliefs about θ after observing x. Its expected value is often used as a point estimate of θ. i.i.d. Example 1.1. Let x i Ber(p). A Bayesian investigator assumes a beta(a, b) prior on p. It is possible to show that the joint density of t = i n x i and p is ( ) n Γ(a + b) f(t, p) = t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is f beta(a,b) (t) = ( n t and the posterior density of p is π(p t) = ) Γ(a + b) Γ(a)Γ(b) Γ(t + a)γ(n t + b), Γ(n + a + b) Γ(n + a + b) Γ(t + a)γ(n t + b) pt+a 1 (1 p) n t+b 1, which is the density of a beta(t + a, n t + b) random variable. The Bayes estimator is ˆp B := E π p t = t + a a + b + n The Bayes estimator ˆp B has the form n t (1.1) ˆp B = a + b + n n + a + b a a + b + n a + b. 1

2 STAT 01B Thus it is a convex combination of the sample mean t n a a+b. As the sample size grows, ˆp B converges to t n. and the prior mean Example 1.. Let x N (µ, 1). A Bayesian investigator assumes a N (a, b ) prior on µ. The posterior density of µ after observing x is π(µ x) exp ( 1 (x µ)) }{{} f(x µ) exp ( (µ a) b ) } {{ } π(µ) It is possible to show that the posterior is Gaussian: ) (µ ã) π(µ x) exp (, b where b = ( b ) 1/, ã = x + a b b = b x + a 1 + b. The Bayes estimator is the posterior mean, which is also a convex combination of the sample x and the prior mean a. In practice, it is usually not possible to derive the expected value of the posterior in closed-form, and we must evaluate Bayes estimators by Monte Carlo methods. In the preceding two examples, we chose the expected value of the posterior as the point estimator. Another option is the mode of the posterior, also known as the maximum a posteriori (MAP) estimator: (1.) ˆθMAP arg max θ Θ π(θ x) = arg max f(x θ)π(θ). θ Θ The main advantage of MAP estimators over Bayes estimators is they are usually easier to evaluate that Bayes estimators. For this reason, MAP estimators are sometimes known as the poor man s Bayes estimator. Example 1.3. Consider the Bayesian linear model β N (0, λ 1 I p ) y i {x i, β} i.i.d. N (x T i β, 1). We remark that the task is to learn the distribution of y i x i : the distribution of x i is irrelevant. The posterior is π(β X, y) ( ) exp 1 y i x T i β exp ( λ ) θ. i n

3 LECTURE 5 NOTES 3 Maximizing the posterior is equivalent to maximizing log π(β X, y) = 1 y i x T i β λ θ, which is akin to ridge regression. i n We remark that MAP estimators are generally interpretable as regularized MLE s: the regularizer is the log-density of the prior. Another prominent example is the lasso estimator: 1 ˆβ lasso arg min β R p y Xβ + λ β 1, which is the MAP estimator under a double-exponential prior. A subtle drawback of the MAP estimator is it is not equivariant under reparametrization. Let π(θ x) be the posterior density of θ and ˆθ MAP be its mode. The posterior density of g(θ), where g : Θ H is invertible, is π (η x) = π(g 1 (η) x) g 1 (η). Due to the presence of the Jacobian term, the maximizer of π (η x) is generally not g(ˆθ MAP ). Example 1.4. Consider a Bayesian investigation of the expected value of a Ber(p) random variable. The investigator assumes a uniform prior on 0, 1. Before collecting any data, the posterior is just the prior, and a MAP estimator of p is any point in 0, 1. However, the prior of q = p is π (q) = π(q ) q = 1 0,1 ( q ) q, which is maximized at 1. The prior of r = 1 1 p is which is maximized at 0. π (r) = π ( (1 r) ) (1 r), A key design choice in a Bayesian investigation is the choice of prior. If prior knowledge is available, it is often desirable to incorporate this knowledge into the choice of prior. We summarize some popular approaches to choosing a prior. 1. empirical Bayes: estimate (the parameters of) the prior from data. The marginal distribution of x usually depends on the parameters of the prior, which can be used to estimate the parameters. We give an example of an empirical Bayes estimator later in the course.

4 4 STAT 01B. hierarchical Bayes: impose a hyper prior on the parameters of the prior. The choice of hyper prior usually has a smaller impact on the Bayes estimator than the choice of prior. 3. robust Bayes: look for estimators that perform well under all priors in some family of priors. Bayes estimators are less popular in practice than maximum likelihood or MoM estimators because the choice of prior is usually subjective. Two investigators given the same dataset may arrive at different estimates of a parameter due to differences in their priors. Thus, unlike the MLE or MoM estimators, Bayes estimators are inherently subjective.. Evaluating point estimators. Definition.1. of a parameter θ is E θ δ(x) θ The mean squared error (MSE) of an estimator δ(x). 1 The MSE is the average discrepancy between the estimator δ(x) and the unknown parameter θ in the l norm. For any estimator, the MSE decomposes into bias and variance terms: E θ δ(x) θ = E θ δ(x) E θ δ(x) + Eθ δ(x) θ = E θ δ(x) E θ δ(x) + E θ E θ δ(x) θ (.1) (δ(x) ) T ( ) + E θ Eθ δ(x) Eθ δ(x) θ = E θ δ(x) E θ δ(x) + E θ δ(x) θ. }{{}}{{} variance bias The bias term is the difference between the expected value of the estimator and the target and is a measure of the estimator s accuracy. An estimator whose bias vanishes for any θ Θ is unbiased. MSE is by no means the only error metric that practitioners consider. It is a special case of a risk function. The study of the performance of estimators evaluated by risk functions is a branch of decision theory. Decision theory formalizes a statistical investigations as a decision problem. After observing x F F, the investigator makes a decision regarding the unknown parameter θ Θ. The set of allowed decisions is called the action space A. In point estimation, the decision is typically the point 1 The subscript on E means the expectation is taken with respect to F θ F.

5 LECTURE 5 NOTES 5 estimate. Thus A is often just Θ. After a decision is made, the investigator incurrs a loss given by a loss function l : Θ A R +. By convention, bad decisions incur higher losses, and the investigator seeks to minimize his or her losses. Some examples of loss functions are 1. square loss: l(θ, a) = θ a,. absolute value loss: l(θ, a) = θ a 1, 3. zero-one-loss: l(θ, a) = 1 1 θ (a), 4. log loss: l(θ, a) = f θ(x) f a(x). The performance of a decision rule δ : X A is summarized by its risk function: Risk δ (θ) = E θ l(θ, δ(x)). The MSE of a point estimator δ(x) is the risk function of the decision rule δ under the square loss function l(θ, a) = θ a..1. Rao-Blackwellization. As we shall see, there is an intimate connection between data reduction and good point estimators (in the decision theoretic sense). When the loss function is convex (in δ), it is always possible to reduce the risk of an estimator by Rao-Blackwellization. Theorem. (Rao-Blackwell). Let t be a sufficient statistic for model F := {F θ : θ Θ}. For any loss function l : Θ A R + that is convex in its second argument, we have E θ l ( θ, E δ(x) t ) E θ l(θ, δ(x)) for any θ Θ. Further, if l is strictly convex in its second argument, the inequality is strict unless δ(x) a.s. = E δ(x) t. Proof. By the tower property of conditional expectations, E θ l(θ, δ(x)) = Eθ E l(θ, δ(x)) t, which, by Jensen s inequality, is at least E θ l ( θ, E δ(x) t ). If l is strictly convex in its second argument, unless δ(x) a.s. = E δ(x) t, Jensen s inequality is strict.

6 6 STAT 01B The Rao-Blackwell theorem shows that it is possible to reduce the risk of any point estimator by conditioning on a sufficient statistic. The estimator δ RB (t) := E δ(x) t is sometimes called a Rao-Blackwellized estimator. By modifying the proof of the Rao-Blackwell theorem, it is possible to show that if t is a minimal sufficient statistic and t is another sufficient statistic, then E θ l ( θ, E δ(x) t ) E θ l ( θ, E δ(x) t ) for any θ Θ. The astute reader may observe that the proof of the Rao-Blackwell theorem is valid whether t is a sufficient statistic or not. Thus conditioning on any statistic reduces the risk. Although this is true, the law of δ(x) t generally depends on the unknown parameter θ. Thus the Rao-Blackwellized estimator is, in fact, not an estimator... Admissibility. Recall the risk of an estimator δ R δ (θ) = E θ l(θ, δ(x)), which led us to study risk-study notions of optimality for estimators. Unfortunately, there is usually no uniformly optimal estimator. Example.3. Bayes estimator is Let x i i.i.d. Ber(p). The MLE of p is ˆp ML = x, and the (.) ˆp B = E π p t = n x + a a + b + n Since the MLE is unbiased, its MSE is its variance, which is p(1 p) n. The MSE of the Bayes estimator is ( ) MSEˆpB (p) = var p ˆpB + Ep ˆpB p + n x + a = var p a + b + n = np(1 p) (a + b + n) + ( n x + a E p a + b + n ( np + a a + b + n p ), ) p which is a quadratic function of p. It is possible to show that by choosing a = b = n 4, the MSE is constant in p: n MSEˆpB (p) = 4(n + n). We observe that the MSE of the MLE is smaller than that of ˆp B when p is near 0 or 1, but larger when p is near 1. Thus neither estimator dominates the other uniformly on p 0, 1.

7 LECTURE 5 NOTES 7 Although there is usually no uniformly optimal estimator, there are uniformly suboptimal estimators. Definition.4. An decision rule δ is inadmissible if there is another decision rule δ such that R δ (θ) R δ (θ) for any θ Θ and R δ (θ) < R δ (θ) at some θ Θ. Otherwise, δ is admissible. Intuitively, an inadmissible estimator is uniformly bested by another estimator, so, from a decision theoretic perspective, there is no good reason to use inadmissible estimators. We remark that although inadmissible estimators are bad, admissible estimators are not necessarily good. An estimator is admissible if it has (strictly) smaller risk than any other estimator at a single θ Θ. Its risk at any other θ may be arbitrarily bad. 3. Bayes estimators. Definition 3.1. Let π be a prior on the parameter space Θ. The Bayes risk of a decision rule δ is E π Rδ (θ) = R δ (θ)π(θ)dθ. It is possible to minimize the Bayes risk to derive a Bayes estimator. We begin by noticing the Bayes risk has the form E π Rδ (θ) = l(θ, δ(x))f(x θ)π(θ)dxdθ Θ X = l(θ, δ(x))π(θ x)f π (x)dxdθ Θ X = PostRisk δ (x)f π (x)dx. where PostRisk δ (x) = E θ π( x) l(θ, δ(x)) = X Θ Θ l(θ, δ(x))π(θ x)dθ is the posterior risk of the estimator δ. We observe that the posterior risk does not depend on θ; it only depends only on x. By choosing δ(x) to minimize the posterior risk, we minimize the Bayes risk. In practice, it is only necessary to minimize the posterior at the observed x. That is, δ B (x) := arg min E θ π( x) l(θ, a). a A

8 8 STAT 01B Example 3.. The posterior MSE of an estimator δ is PostMSE δ (x) = E θ π( x) θ δ(x) = E θ π( x) θ + E θ π( x) θ T δ(x) + δ(x), which is a quadratic function of δ(x). It is minimized at δ B (x) = E θ π( x) θ. Thus the Bayes estimator is the expected value of the posterior. Example.3 is a concrete example of a Bayes estimator. Example 3.3. loss function Instead of the square loss function, consider the zero-one l(θ, δ) = 1 1 θ (δ) = If Θ is finite, the posterior risk is PostRisk δ (x) = θ Θ { 0 θ = δ 1 otherwise. l(θ, δ(x))π(θ x) = 1 π(δ(x) x). To minimize the posterior risk, we should choose δ(x) so that π(δ(x) x) is as large as possible. Thus the Bayes estimator under the zero-one loss function is the MAP estimator. When minimization of the posterior risk cannot be done analytically, it is often done numerically: δ B (x) arg min δ Θ PostRisk δ (x) = arg min log ( PostRisk δ (x) ). δ Θ We remark that the problem is similar to evaluating the MLE. Thus most numerical methods for evaluating the MLE are directly applicable to minimizing the posterior risk. Before moving on to minimax estimators, we comment on the admissibility of Bayes estimators. Theorem 3.4. A unique Bayes estimator is admissible. Proof. If there is another estimator θ such that R δ (θ) R δb (θ) for any θ Θ and R δ (θ) < R δb (θ) at some θ, then (3.1) E π Rδ (θ) E π RδB (θ). Thus δ(x) is Bayes. Since δ B is unique, δ B (x) = δ(x) for all x X.

9 LECTURE 5 NOTES 9 Yuekai Sun Berkeley, California November 3, 015

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Lecture 7 October 13

Lecture 7 October 13 STATS 300A: Theory of Statistics Fall 2015 Lecture 7 October 13 Lecturer: Lester Mackey Scribe: Jing Miao and Xiuyuan Lu 7.1 Recap So far, we have investigated various criteria for optimal inference. We

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Wednesday, October 5, 2011 Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

STAT215: Solutions for Homework 2

STAT215: Solutions for Homework 2 STAT25: Solutions for Homework 2 Due: Wednesday, Feb 4. (0 pt) Suppose we take one observation, X, from the discrete distribution, x 2 0 2 Pr(X x θ) ( θ)/4 θ/2 /2 (3 θ)/2 θ/4, 0 θ Find an unbiased estimator

More information

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Econ 2140, spring 2018, Part IIa Statistical Decision Theory Econ 2140, spring 2018, Part IIa Maximilian Kasy Department of Economics, Harvard University 1 / 35 Examples of decision problems Decide whether or not the hypothesis of no racial discrimination in job

More information

Lecture 8 October Bayes Estimators and Average Risk Optimality

Lecture 8 October Bayes Estimators and Average Risk Optimality STATS 300A: Theory of Statistics Fall 205 Lecture 8 October 5 Lecturer: Lester Mackey Scribe: Hongseok Namkoong, Phan Minh Nguyen Warning: These notes may contain factual and/or typographic errors. 8.

More information

STAT 830 Decision Theory and Bayesian Methods

STAT 830 Decision Theory and Bayesian Methods STAT 830 Decision Theory and Bayesian Methods Example: Decide between 4 modes of transportation to work: B = Ride my bike. C = Take the car. T = Use public transit. H = Stay home. Costs depend on weather:

More information

Evaluating the Performance of Estimators (Section 7.3)

Evaluating the Performance of Estimators (Section 7.3) Evaluating the Performance of Estimators (Section 7.3) Example: Suppose we observe X 1,..., X n iid N(θ, σ 2 0 ), with σ2 0 known, and wish to estimate θ. Two possible estimators are: ˆθ = X sample mean

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

STAT 830 Bayesian Estimation

STAT 830 Bayesian Estimation STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Parameter Estimation

Parameter Estimation Parameter Estimation Chapters 13-15 Stat 477 - Loss Models Chapters 13-15 (Stat 477) Parameter Estimation Brian Hartman - BYU 1 / 23 Methods for parameter estimation Methods for parameter estimation Methods

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) :

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Econ 2148, spring 2019 Statistical decision theory

Econ 2148, spring 2019 Statistical decision theory Econ 2148, spring 2019 Statistical decision theory Maximilian Kasy Department of Economics, Harvard University 1 / 53 Takeaways for this part of class 1. A general framework to think about what makes a

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

IEOR165 Discussion Week 5

IEOR165 Discussion Week 5 IEOR165 Discussion Week 5 Sheng Liu University of California, Berkeley Feb 19, 2016 Outline 1 1st Homework 2 Revisit Maximum A Posterior 3 Regularization IEOR165 Discussion Sheng Liu 2 About 1st Homework

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture notes on statistical decision theory Econ 2110, fall 2013 Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic

More information

Lecture 5 September 19

Lecture 5 September 19 IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical

More information

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018 Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian

More information

9 Bayesian inference. 9.1 Subjective probability

9 Bayesian inference. 9.1 Subjective probability 9 Bayesian inference 1702-1761 9.1 Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake pm to win

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)] LECTURE 7 NOTES 1. Convergence of random variables. Before delving into the large samle roerties of the MLE, we review some concets from large samle theory. 1. Convergence in robability: x n x if, for

More information

Bayesian statistics: Inference and decision theory

Bayesian statistics: Inference and decision theory Bayesian statistics: Inference and decision theory Patric Müller und Francesco Antognini Seminar über Statistik FS 28 3.3.28 Contents 1 Introduction and basic definitions 2 2 Bayes Method 4 3 Two optimalities:

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 COS 424: Interacting ith Data Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 Recapitulation of Last Lecture In linear regression, e need to avoid adding too much richness to the

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

HPD Intervals / Regions

HPD Intervals / Regions HPD Intervals / Regions The HPD region will be an interval when the posterior is unimodal. If the posterior is multimodal, the HPD region might be a discontiguous set. Picture: The set {θ : θ (1.5, 3.9)

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions

Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions The Korean Communications in Statistics Vol. 14 No. 1, 2007, pp. 71 80 Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions R. Rezaeian 1)

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Lecture 12 November 3

Lecture 12 November 3 STATS 300A: Theory of Statistics Fall 2015 Lecture 12 November 3 Lecturer: Lester Mackey Scribe: Jae Hyuck Park, Christian Fong Warning: These notes may contain factual and/or typographic errors. 12.1

More information

5.2 Expounding on the Admissibility of Shrinkage Estimators

5.2 Expounding on the Admissibility of Shrinkage Estimators STAT 383C: Statistical Modeling I Fall 2015 Lecture 5 September 15 Lecturer: Purnamrita Sarkar Scribe: Ryan O Donnell Disclaimer: These scribe notes have been slightly proofread and may have typos etc

More information

2 Inference for Multinomial Distribution

2 Inference for Multinomial Distribution Markov Chain Monte Carlo Methods Part III: Statistical Concepts By K.B.Athreya, Mohan Delampady and T.Krishnan 1 Introduction In parts I and II of this series it was shown how Markov chain Monte Carlo

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Bayesian analysis in finite-population models

Bayesian analysis in finite-population models Bayesian analysis in finite-population models Ryan Martin www.math.uic.edu/~rgmartin February 4, 2014 1 Introduction Bayesian analysis is an increasingly important tool in all areas and applications of

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

Chapter 8.8.1: A factorization theorem

Chapter 8.8.1: A factorization theorem LECTURE 14 Chapter 8.8.1: A factorization theorem The characterization of a sufficient statistic in terms of the conditional distribution of the data given the statistic can be difficult to work with.

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

Statistical Inference: Maximum Likelihood and Bayesian Approaches

Statistical Inference: Maximum Likelihood and Bayesian Approaches Statistical Inference: Maximum Likelihood and Bayesian Approaches Surya Tokdar From model to inference So a statistical analysis begins by setting up a model {f (x θ) : θ Θ} for data X. Next we observe

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

MIT Spring 2016

MIT Spring 2016 Decision Theoretic Framework MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Decision Theoretic Framework 1 Decision Theoretic Framework 2 Decision Problems of Statistical Inference Estimation: estimating

More information

Bayesian statistics, simulation and software

Bayesian statistics, simulation and software Module 10: Bayesian prediction and model checking Department of Mathematical Sciences Aalborg University 1/15 Prior predictions Suppose we want to predict future data x without observing any data x. Assume:

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Introduction to Estimation Methods for Time Series models Lecture 2

Introduction to Estimation Methods for Time Series models Lecture 2 Introduction to Estimation Methods for Time Series models Lecture 2 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 2 SNS Pisa 1 / 21 Estimators:

More information

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4 General Approaches to Parameter

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information