Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Similar documents
STAT215: Solutions for Homework 2

Lecture 1: Bayesian Framework Basics

ST5215: Advanced Statistical Theory

Bayesian statistics: Inference and decision theory

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 8 October Bayes Estimators and Average Risk Optimality

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Part III. A Decision-Theoretic Approach and Bayesian testing

A Very Brief Summary of Bayesian Inference, and Examples

9 Bayesian inference. 9.1 Subjective probability

Lecture 3: Statistical Decision Theory (Part II)

Lecture 2: Statistical Decision Theory (Part I)

Machine Learning And Applications: Supervised Learning-SVM

Posterior Regularization

Lecture 7 October 13

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 2: Basic Concepts of Statistical Decision Theory

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Density Estimation. Seungjin Choi

Statistical Machine Learning Lecture 1: Motivation

MIT Spring 2016

Support Vector Machines

Naïve Bayes classification

CSC411 Fall 2018 Homework 5

MIT Spring 2016

Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood

Bayesian Inference: Posterior Intervals

MIT Spring 2016

Introduction to Machine Learning

Parametric Techniques Lecture 3

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Logistic Regression. Machine Learning Fall 2018

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

CSC321 Lecture 18: Learning Probabilistic Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistical Machine Learning Hilary Term 2018

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

STAT 830 Decision Theory and Bayesian Methods

COMP90051 Statistical Machine Learning

Parametric Techniques

Introduction to Machine Learning

Lecture 2 Machine Learning Review

Bayesian Learning (II)

Introduction to Probabilistic Machine Learning

Machine Learning Lecture 7

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Probabilistic Graphical Models for Image Analysis - Lecture 1

Lecture notes on statistical decision theory Econ 2110, fall 2013

Minimax lower bounds I

Lecture 7 Introduction to Statistical Decision Theory

David Giles Bayesian Econometrics

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

INTRODUCTION TO BAYESIAN METHODS II

A Very Brief Summary of Statistical Inference, and Examples

Generalization Bounds and Stability

ECE521 week 3: 23/26 January 2017

Lecture : Probabilistic Machine Learning

CMU-Q Lecture 24:

6.1 Variational representation of f-divergences

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction to Machine Learning

Econometrics I, Estimation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Distirbutional robustness, regularizing variance, and adversaries

Econ 2148, spring 2019 Statistical decision theory

4 Invariant Statistical Decision Problems

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Bayesian Decision Theory

Scalable robust hypothesis tests using graphical models

Introduction to Machine Learning (67577) Lecture 3

Least Squares Regression

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Statistical learning theory, Support vector machines, and Bioinformatics

Announcements. Proposals graded

Overfitting, Bias / Variance Analysis

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Naive Bayes and Gaussian Bayes Classifier

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

A Bahadur Representation of the Linear Support Vector Machine

Nearest Neighbor Pattern Classification

Time Series and Dynamic Models

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Statistical Inference

Lecture 13 and 14: Bayesian estimation theory

Does Unlabeled Data Help?

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Notes on Discriminant Functions and Optimal Classification

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Ch 4. Linear Models for Classification

Approximation Theoretical Questions for SVMs

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Qualifying Exam in Machine Learning

Hilbert Space Methods in Learning

Transcription:

Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003

Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ) Conditioned on evidence in data X, we average with respect to the posterior: ρ(π, a X) = E π( X) [L(θ, a)] = L(θ, a) p(θ X) Classical formulation: δ : X A a decision rule, risk function R(θ, δ) = E X [L(θ, δ(x)] = L(θ, δ(x)) df X (x) X 1

Bayes Risk For a prior π, the Bayes risk of a decision function is r(π, δ) = E π [R(θ, δ)] = E π [E X [L(θ, δ(x))]] Therefore, the classical and Bayesian approaches define different risks, by averaging: Bayesian expected loss: Averages over θ Risk function: Averages over X Bayes risk: Averages over both X and θ 2

Example Take X N (θ, 1), and problem of estimating θ under square loss L(θ, a) = (a θ) 2. Consider decision rules of the form δ c (x) = cx. A calculation gives that R(θ, δ c ) = c 2 + (1 c) 2 θ 2 Then δ c is inadmissible for c > 1, and admissible for 0 c 1. 3

Example (cont.) 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 2 1.5 1 0.5 0 0.5 1 1.5 2 Risk R(θ, δ c ) for admissible decision functions δ c (x) = cx, c 1, as a function of θ. The color corresponds the associated minimum Bayes risk. 4

Example (cont.) Consider now π = N (0, τ 2 ). Then the Bayes risk is r(π, δ c ) = c 2 + (1 c) 2 τ 2 Thus, the best Bayes risk is obtained by the Bayes estimator δ c with c = τ 2 1 + τ 2 (This is the same value of the Bayes risk of π.) That is, each δ c is Bayes for the N (0, τc 2 ) prior with c τ c = 1 c 5

Example (cont.) 12 10 8 6 4 2 0 10 8 6 4 2 0 2 4 6 8 10 At a larger scale, it becomes clearer that the decision function with c = 1 is minimax. It corresponds to the (improper) prior N (0, τ 2 ) with τ. 6

Bayes Actions δ π (x) is a posterior Bayes action for x if it minimizes L(θ, a) p(θ x) dθ Equivalently, it minimizes L(θ, a) p(x θ) π(θ) dθ Θ Need not be unique. Θ 7

Equivalence of Bayes actions and Bayes decision rules A decision rule δ π minimizing the Bayes risk r(π, δ) can be found pointwise, by minimizing Θ L(θ, a) p(x θ) π(θ) dθ for each x. So, the two problems are equivalent. 8

Classes of Loss Function Three distinguished classes of problems/loss functions Regression: squared loss and relatives Classification: zero-one loss Density estimation: log-loss 9

Special Case: Squared Loss For L(θ, a) = (θ a) 2, the Bayes decision rule is the posterior mean δ π (x) = E[θ x] For weighted squared loss, L(θ, a) = w(θ)(θ a) 2, the Bayes decision rule is weighted posterior mean (when a is unrestricted): δ π θ w(θ) f(x θ) π(θ) dθ Θ (x) = w(θ) f(x θ) π(θ) dθ Note: w acts like a prior here We will see later how L 2 Θ case posterior mean applies to some classification problems, in particular learning with labeled/unlabeled 10

data. 11

Special Case: Linear Loss For L(θ, a) = θ a, the Bayes decision rule is a posterior median. More generally, for L(θ, a) = { c0 (θ a) θ a 0 c 1 (a θ) θ a < 0 a c 0 c 0 +c 1 -fractile of posterior p(θ x) is a Bayes estimate. 12

Generic Learning Problem In machine learning we re often interested in prediction. Given input X X, what is output Y Y Incur loss L(X, Y, f(x)) due to predicting f(x) when input is X and true output is Y. 13

Generic Learning Problem (cont.) Given a training set (X 1, Y 1 ),... (X n, Y n ) and possibly unlabeled data (X 1,... X m) determine f : X Y from some family F. Thus, a learning algorithm is a mapping A : n 0(X Y) n F in the supervised case and A : (X Y) n X m F n 0,m 0 in the semi-supervised case of labeled/unlabeled data. 14

Average loss on new data Criterion for Success R[f] = E[L(X, Y, f(x))] = L(x, y, f(x)) dp (x, y) X Y for some (unknown) measure P on X Y. Generalization error of a learning algortithm A is R[A] inf f F R[f] Note: not assuming correctness of the model risk may be greater than the Bayes error rate. 15

Empirical Risk Since P is not typically known, can t compute the risk, and work instead with the empirical risk R emp [f] = R emp [f, (x n, y n )] = 1 n n i=1 L(x i, y i, f(x i )) How might this be modified to take into account unlabeled data (X 1,..., X m)? 16

Standard Example Consider F = {f(x) = x, w, }, the set of linear functions, and squared error L(x, y, f(x)) = (y f(x)) 2. Minimizing empirical risk: R emp [f] = min w 1 n n (y i w, x i ) 2 i=1 Solution w = (X X) 1 X y. Note: can use a set of basis functions φ i (x). 17

Other Loss Functions (cont.) Most natural loss function for classification is 0-1 loss: L(x, y, f(x)) = { 0 if f(x) = y 1 otherwise Can specify other off-diagonal costs, e.g. distance function d(y, f(x)). 18

Other Loss Functions (cont.) L 2 loss strongly affected by outliers. L 1 loss more forgiving, though not differentiable. Again assume a linear model min w R emp[f] = min w 1 n n y i w, x i i=1 Can transform to a linear program: minimize subject to 1 n n (z i + zi ) i=1 y i x i, w z i y i x i, w z i 19

Other Loss Functions (cont.) Median property: at an optimal solution, an equal number of points will have y i f(x i ) > 0 and y i f(x i ) < 0. 20

Other Loss Functions (cont.) Huber s robust loss combines L 2 and L 1 for large errors L σ (x, y, f(x)) = { 1 2σ (y f(x))2 if y f(x) σ y f(x) σ 2 otherwise (As we ll see, regularization is generally to be preferred over changing the loss function...but many learning algorithms do both) 21

Comparison of Loss Functions 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 3 2 1 0 1 2 3 Comparison of L 2 (green), L 1 (red) and Huber s robust loss (blue) 22

Probabilistic Error Models Suppose we assume Y = f(x) + ξ where ξ p θ. Then conditional probability p(y f, X) can be computed in terms of p θ (Y f(x)). Note: error distribution could depend on X 23

Probabilistic Error Models (cont.) Assume log-loss for the errors: L(x, y, f(x) = log p θ (y f(x)) Then under the iid assumption, we have that R emp [f] = 1 n = 1 n L(x i, y i, f(x i )) i log p θ (y i f(x i )) + constant i Looked at differently, the conditional distribution of y is given by p(y x, f) exp ( L(x, y, f(x))) 24

Consistency For a consistent learning algorithm, we require that lim P (R n [A(Xn, Y n )] R[f ] > ɛ) = 0 where the probability is w.r.t. the choice of training sample and f = arg min f F R[f] achieves the minimum risk. Relying on R emp alone may require very large sample size to achieve small generalization error. May also lead to ill-posed problems (non-unique, poorly conditioned). A small change in training data can lead to classifiers with very different expected risks (high variance) 25

Consistency (cont.) Empirical risk R emp [f] converges to R[f] for any fixed function f (e.g., McDiarmid s inequality) But minimizing empirical risk gives different function for each sample. Showing consistency requires uniform convergence arguments. As we ll discuss, such results and rates of convergence involve measures of complexity such as VC dimension or covering numbers (or some more recent notions...) 26

Regularization Idea: want to restrict f F to some compact subset, e.g., Ω[f] c. May lead to difficult optimization problem. Regularized risk is given by A Ω,λ (X n, Y n ) = arg min f F (R emp [f] + λω[f]) where Ω : F R is some (typically convex) function. Then for appropriate choice of λ 0, regularization will achieve optimal risk R[f ] as n 27

Bayesian Connection If we assume a prior distribution on classifiers given by π(f) exp ( nλω[f]) then the posterior is given by ( ) n P (f (X n, Y n )) exp L(X i, Y i, f(x i )) i=1 = exp ( R emp [f] λω[f]) exp ( nλω[f]) so regularization corresponds to MAP estimation We ll return to this when we discuss generative vs. discriminative models for learning, model selection 28