Lecture 5 September 19

Similar documents
STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Parametric Techniques

Parametric Techniques Lecture 3

Lecture : Probabilistic Machine Learning

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Machine Learning 4771

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Statistics: Learning models from data

Lecture 7 Introduction to Statistical Decision Theory

Lecture 8: Information Theory and Statistics

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

CSC321 Lecture 18: Learning Probabilistic Models

1.1 Basis of Statistical Decision Theory

Econometrics I, Estimation

Lecture 2: Basic Concepts of Statistical Decision Theory

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 11: Probability Distributions and Parameter Estimation

Lecture 1: Bayesian Framework Basics

Density Estimation. Seungjin Choi

COMP90051 Statistical Machine Learning

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Lecture 2 Machine Learning Review

6.867 Machine Learning

ECE531 Lecture 10b: Maximum Likelihood Estimation

COS513 LECTURE 8 STATISTICAL CONCEPTS

g-priors for Linear Regression

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Lecture 4 September 15

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Sparse Linear Models (10/7/13)

PMR Learning as Inference

ST5215: Advanced Statistical Theory

Lecture 2: Statistical Decision Theory (Part I)

ECE521 week 3: 23/26 January 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 4: Probabilistic Learning

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

A General Overview of Parametric Estimation and Inference Techniques.

3.0.1 Multivariate version and tensor product of experiments

2 Statistical Estimation: Basic Concepts

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

6.867 Machine Learning

Regression I: Mean Squared Error and Measuring Quality of Fit

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

6.1 Variational representation of f-divergences

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Statistics for the LHC Lecture 1: Introduction

Lecture 14 October 13

Probability and Estimation. Alan Moses

Stat 5101 Lecture Notes

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Parameter estimation Conditional risk

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

The Expectation-Maximization Algorithm

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

IFT Lecture 7 Elements of statistical learning theory

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Models in Machine Learning

The Bayes classifier

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Ordinary Least Squares Linear Regression

Logistic Regression. Machine Learning Fall 2018

LECTURE NOTE #3 PROF. ALAN YUILLE

Introduction to Machine Learning Midterm Exam

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Mathematical statistics

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Interpretations of Regularization

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

F & B Approaches to a simple model

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Bayesian Methods: Naïve Bayes

Bayesian Decision and Bayesian Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Probabilistic Graphical Models

Review. December 4 th, Review

Statistical Methods for Particle Physics (I)

Overfitting, Bias / Variance Analysis

STAT 730 Chapter 4: Estimation

Statistical Models. David M. Blei Columbia University. October 14, 2014

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Statistical Data Analysis Stat 3: p-values, parameter estimation

Generative Clustering, Topic Modeling, & Bayesian Inference

5.2 Expounding on the Admissibility of Shrinkage Estimators

Transcription:

IFT 6269: Probabilistic Graphical Models Fall 2016 Lecture 5 September 19 Lecturer: Simon Lacoste-Julien Scribe: Sébastien Lachapelle Disclaimer: These notes have only been lightly proofread. 5.1 Statistical Decision Theory 5.1.1 The Formal Setup Statistical decision theory is a formal setup to analyze decision rules in the context of uncertainty. The standard problem in statistics of estimation and hypothesis testing fit in it, but we will see that we can also the supervised learning problem from machine learning in it (though people are less used to see this in machine learning). Let D be a random observation (in ML, it would be a training dataset, which is why we used D). Let D be the sample space for D (the set of all possible values for D). D P D. We say that D P where P is a distribution over the set D. We suppose that P belongs to a certain set of distribution P. We sometimes have that the distribution is parametrized by a parameter θ in which case we note this distribution P θ. P represents the (unknown) state of the world (there is the source of uncertainty). Let A be the set of all possible actions. We will denote a a certain action in A. Let L : P A ÞÑ R be the (statistical) loss function. So LpP, aq is the loss of doing action a when the actual distribution is P (when the world is P ). Let δ : D ÞÑ A be a decision rule. Less formally, we observe something (D) from nature (P ), but we do not actually know how mother nature generates observations (We don t know P ). Even so, we suppose that P belongs to a certain set of distribution (P). We are in a context where we have to choose an action (a) among a certain set of actions (A). Given the facts that we choose action a and that the actual reality is P, we must pay a certain cost LpP, aq. Since we get to observe a realisation of P, it makes sense to base our action on this observation. This decision process is described by δ. Important subtle point: often P will describe an i.i.d. process, e.g. D px 1,..., X n q where X i iid P 0. In this case, we often just write the dependence of the loss in terms of P 0 instead of the full joint P, i.e. LpP, aq LpP 0, aq. Note that the framework also works 5-1

for non i.i.d. data, but in this lecture we only consider i.i.d. data, and so when we write LpP, aq, we mean P as a distribution on X i, not the joint one... Examples A) Estimation: Suppose P θ belongs to a parametrized family of distribution we call P. We say that θ belongs to a parameter space denoted by Θ. We pose A Θ. In this context, δ is an estimator of θ. We pose the following loss function (the famous "squared loss"): LpP θ, aq θ a 2 2 We sometimes note Lpθ, aq instead of LpP θ, aq where this simplification applies. Remember that δpdq a. In this specific case, we can write δpdq ˆθ. We can suppose more specifically that D px 1,..., X n q where X i iid P θ. This mean we would have: Lpθ, δpdqq θ δpdq 2 2 As a concrete example, suppose that P θ belongs to a Gaussian family P tn p ; µ, 1q µ P Ru. It would mean that Θ R. For example, we could choose our decision function to be δpdq 1 n i X i. B) Hypothesis Testing: A t0, 1u where 0 might mean not rejecting the null hypothesis and 1 might mean accepting it. In this context, δ describes a statistical test. C) Machine Learning for prediction: Let D ppx 1, Y 1 q,..., px n, Y n qq. We have that X i P X and that Y i P Y for all i. We call X the input space and Y the output space. Let px i, Y i q iid P 0. Then D P where P is the joint over all the i.i.d. px i, Y i q couples. A Y X (the set of functions who maps X to Y). In the prediction setting, we define a prediction loss function l : Y 2 ÞÑ R. This function is usually a measure of the distance between a given prediction and it s associated ground truth. The actual loss function we would like to minimize is LpP, fq E P0 rlpy, fpxqqs This is traditionally called the generalization error, and is also often called the risk in machine learning. Simon calls it the Vapnik risk to distinguish it from the (frequentist) risk from statistics that we will define later. In this context, the decision rule δ is actually a learning algorithm that outputs a function ˆf P Y X. Equivalently, we can write that δpdq ˆf. 5-2

5.1.2 Procedure Analysis Given this framework, how can we compare different rules (ie. procedures)? Given δ 1 and δ 2, how can I know which is better for a given application? (Frequentist) Risk The first property to analyze a procedure is called the risk (or the frequentist risk): RpP, δq E D P rlpp, δpdqs Remarks: The risk is a function of P and we don t know what P is (in practice). So we never really know what s the value of this function for a given rule δ. On the other hand, this property is a theoretical analysis device: we can make statement like for P in some family P, procedure δ 1 has lower risk than procedure δ 2 (and is thus better in some sense). Also it is important to distinguish the (frequentist) risk from the generalization error (the Vapnik risk). In the next graph, we expose the risk profiles of two rules. For simplicity, we suppose that P θ is a parametrized distribution and that θ P R. This picture illustrates the fact that sometimes, there is no clear winner when comparing two rules. In this case, it seems that δ 1 is the best choice for values of θ near 0. But for values of θ far from 0, δ 2 is the best choice. The problem is, we don t know the value of θ, so we don t know the best rule to pick. We will see later that there are, in fact, ways to "bypass" this problem. Domination and Admissibility We say that a decision rule δ 1 dominates another decision rule δ 2 for a given loss function L if RpP, δ 1 q RpP, δ 2 q@p P P and DP P P, RpP, δ 1 q RpP, δ 2 q We say that a decision rule δ is admissible if Eδ 0 s.t. δ 0 dominates δ. 5-3

PAC theory (Aside point for your culture). An alternative to the (frequentist) risk approach, is the PAC approach, which is common in machine learning theory. PAC stands for "probably approximately correct", and instead of looking at the average loss over datasets like the frequentist risk does, it looks at tail bound of the loss, i.e. a bound B such that we know that the loss will be smaller than with high probability (this is why it is called PAC). Given a certain loss function L, a decision function δ, a distribution P over all possible D and a small real number ɛ 0; PAC theory seeks to find a bound B such that: P rtlpp, δpdqq Bu ɛ Note that we could have write B ɛ pp, δq instead of B to emphasize the fact that this bound depends on P, δ and ɛ. Next graph shows the density of LpP, δpdqq given P. Remember that LpP, δpdqq is a random variable since D is a random variable. It allow us to compare the frequentist risk (mean) approach vs. the PAC approach (tail bound). Comparing Decision Rules We would like to be able to compare rules together to figure out which one to choose. If we could find a rule that dominates all the other rules, we would choose this rule. But often, we can t find such a rule. This is why there is no universally best procedure. The frequentist approach is to analyze different properties of decision rules, and the user can then choose which one they prefer according to which properties is better to them. We now present two standard ways in statistics to reduce a risk profile curve to a scalar, and so we can then compare rules together and get a notion of "optimality": A) The Minimax Criteria: Following this criteria, the optimal rule δ is given by: δ minimax arg min δ max P PP RpP, δq In words, the minimax optimal rule is the rule who minimizes the risk we would obtain in the worst possible scenario. 5-4

B) Weighting: This criteria requires that we define a weighting over Θ (can be interpreted as a prior). Formally: δ weight arg min δ» Θ RpP θ, δqπpθqdθ Intuitively, when considering a certain rule, we are averaging the risk over all the possible θ by putting more weight on the θ we believe are more important for the phenomenon that we are studying. After that, we can compare them with each other and pick the rule corresponding to the lowest average. C) Bayesian Statistical Decision Theory: The last two criteria were not making any use of the fact that we observed data (that we observed a realization of D). The bayesian optimality criteria makes use of this information. Before defining this criteria, let s define what we call the bayesian posterior risk:» R B pδ Dq LpP θ, δqppθ Dqdθ, where ppθ Dq is the posterior for a given prior πpθq. The optimal rule following the bayesian criteria is: Θ δ bayespdq arg min R B pδ Dq δ As you recall, in the Bayesian philosophy, we treat all uncertain quantities with probabilities. The posterior on θ summarizes all the information we need about the uncertain quantity θ. As a Bayesian, the statistical loss Lpθ, aq then tells us how we can act optimally: we simply need to find the action that minimizes the Bayesian posterior risk (as θ is integrated out, there is no more uncertainty about θ!). δ bayes is thus the only procedure to use as a Bayesian! Life is quite easy for a Bayesian (no need to worry about other properties like minimax, frequentist risk, etc.). The Bayesian does not care about what could happen for other D s, it only cares that you are given a specific observation D, and want to know how to act given D. But a frequentist can still decide to analyze a Bayesian procedure from the frequentist risk perspective. In particular, one can show that most Bayesian procedures are admissible (unlike the MLE!). Also, one can show that the Bayesian procedure is the optimal procedure when using the weighted risk summary with weight function πpθq which matches the prior ppθq. This can be seen as a simple application of Fubini s theorem, and from the diamond graph below: 5-5

³ where pp θq stands for D s pdf conditional to θ, where ppq pp θqπpθqdθ and Θ where πp Dq denotes the posterior on θ. The bayesian procedure has the property that it minimizes the weighted summary. Exercise : Given A Θ and Lpθ, aq θ a 2, show that δ bayes pdq Erθ Ds (the posterior mean). Examples of Estimators 1. Maximum likelihood estimator (MLE) δ MLE pdq ˆθ MLE arg max L D pθq θ where L D pθq is the likelihood function given the observation D. 2. Maximum a posteriori (MAP) δ MAP pdq ˆθ MAP arg max πpθ Dq θ where πpθ Dq is the posterior distribution over all the possible θ. 3. Method of Moments Suppose we have that D px 1,..., X n q with X i iid P θ being scalar random variables where θ is a vector (i.e. θ pθ 1,..., θ k q). The idea is to find an bijective function h that maps any θ to the "vector of moments" (i.e. M k px 1 q perx 1 s, ErX 2 1 s,..., ErXk 1 sq). Basically, hpθq M k px 1 q Since h is bijective, we can invert it, h 1 pm k px 1 qq θ The intuition is that to approximate θ, we could evaluate the function h 1 using as input the empirical moments vector (i.e. ˆMk px 1 q pêrx 1s, ÊrX2 1 s,..., ÊrXk 1 sq where ÊrX1s j 1 n i xj i for a given j). This would be our method of moments estimator: h 1 p ˆM k px 1 qq ˆθ MM 5-6

Example: Suppose that X i iid N pµ, σ 2 q. We have that, ErX 1 s µ ErX 2 1 s σ2 µ 2 This defines our function h. Now, we invert the relation: µ ErX 1 s σ 2 ErX1 2 s E2 rx 1 s Then, we finally replace the moments by the empirical moments to get our estimator: ˆµ MM ÊrX 1s ˆσ 2 MM ÊrX2 1 s Ê2 rx 1 s Here, this MM estimator is the same as the ML estimator. This illustrates a property of the exponential family (we will see this later in this class). Note: The method of moment is quite useful for latent variable models (e.g. mixture of Gaussian), see spectral methods or tensor decomposition methods in the recent literature 1. 4. ERM for prediction function estimation In this context, A F Y X and the decision rule is a function δ : D ÞÑ Y X. F is called the hypothesis space. We are looking for a function f P Y X that minimizes the generalization error. Formally, f arg min fpy X E D P rlpy, fpxqqs Since we don t know what is P, we can t compute E D P rlpy, fpxqqs. As a replacement, we consider ˆf ERM an estimator of f : e.g. ˆf ERM arg min Êrlpy, f pxqqs fpf where Êrlpy, fpxqqs 1 n n i1 lpy i, fpx i qq. ERM stands for empirical risk minimizer (here, the Vapnik risk). 1 See: Tensor Decompositions for Learning Latent Variable Models, by Anandkumar et al., JMLR 2014 5-7

5.1.3 Convergence of random variables Convergence in distribution In general, we say that a sequence of random variable tx n u 8 n1 converges in distribution towards a random variable X if lim F nñ8 npxq F pxq @x where F n and F correspond the cumulative functions of X n @n and of X respectively. In such a case, we note X d n ÝÑ X Convergence in L k In general, we say that a sequence of random variable tx n u 8 n1 converges in the L k norm towards a random variable X if In such a case, we note X n L k ÝÑ X Convergence in Probability lim Er X n X k nñ8 k s 0 In general, we say that a sequence of random variable tx n u 8 n1 towards a random variable X if converges in probability In such a case, we note X n p ÝÑ X. @ɛ 0, lim nñ8 P t X n X ɛu 0 Note: It turns out that L 2 convergence implies convergence in probability. (The reverse implication isn t true) 5.1.4 Properties of estimator Suppose that D n px 1,..., X n q and that X i iid P θ. We will note ˆθ n δ n pd n q an estimator of θ. The subscript stands to emphasize the fact that the estimator s value depends on the number of observation n. Bias of an estimator Biaspˆθ n q Erˆθ n θs 5-8

Standard Statistical Consistency We say that an estimator ˆθ n is consistent for a parameter θ if ˆθ n converges in probability toward θ (i.e. ˆθ p n ÝÑ θ). L 2 Consistency We say that an estimator ˆθ n is L 2 consistent for a parameter θ if ˆθ n converges in L 2 norm toward θ (i.e. ˆθ L 2 n ÝÑ θ). Bias-variance decomposition Consider Lpθ, ˆθ n q θ ˆθ n. We will express the frequentist risk as a function of the bias and variance of ˆθ n. 2 RpP θ, δ n q E D Pθ r θ ˆθ n s (5.1) Er θ Erˆθ n s Erˆθ n s ˆθ n s (5.2) Er θ Erˆθ n s s Er Erˆθ n s ˆθ n s 2 Er xθ Erˆθn s, Erˆθ n s ˆθ n ys (5.3) Er θ Erˆθ n s s Er Erˆθ n s ˆθ n s 2xθ Erˆθn s, Erˆθ n s Erˆθ n sy (5.4) θ Erˆθ n s Er Erˆθ n s ˆθ n s (5.5) biaspˆθ n q V arrˆθ n s (5.6) Remark: Other loss functions would have potentially led to a different function of the bias and the variance, thus expressing a different priority between the bias and the variance. The James-Stein estimator The James-Stein estimator for estimating the mean of a X i iid N d pµ, σ 2 Iq dominates the MLE estimator for the squared loss when d 3 (and thus showing that the MLE is 5-9

inadmissible in this case). It turns out that the James-Stein estimator is biased, but that its variance is sufficiently smaller than the MLE s variance to offset the bias. Properties of the MLE Assuming sufficient regularity conditions on Θ and ppx; θq, we have: 1. Consistent: ˆθ MLE p ÝÑ θ 2. CLT:? npˆθ MLE θq d ÝÑ N p0, Ipθq 1 q where I is the Fisher information matrix. 3. Asymptotic optimality: Among all unbiased estimator of a scalar parameter θ, ˆθ MLE is the one with the lowest variance asymptotically. This results follows from the Cramér-Rao bound result, which can be stated like this: Let ˆθ be an unbiased estimator of a scalar parameter θ. Then we have that, V arrˆθs 1 Ipθq Note that this result can also be stated in the multivariate case. 4. Invariance to reparametrization: Suppose we have a bijection f : Θ ÞÑ Θ, then, yfpθq MLE fpˆθ MLE q This result can be generalized to the case where f isn t a bijection. Suppose g : Θ ÞÑ Λ (bijective or not). We define the profile likelihood: Lpηq max ppdata; θq θ ηgpθq Let also define the generalized MLE in this case as: ˆη MLE arg max ηpgpθq Lpηq Then we have that ˆη MLE gpˆθ MLE q this is called a plug-in estimator because you are simply plugging in the value ˆθ MLE in the function g to get its MLE. Examples (a) p σ 2 MLE pˆσ MLE q 2 (b) p{ sin σ2 q MLE sinp ˆσ 2 MLEq 5-10