Bias-Variance Trade-off in ML. Sargur Srihari

Similar documents
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

PATTERN RECOGNITION AND MACHINE LEARNING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Today. Calculus. Linear Regression. Lagrange Multipliers

Data Analysis and Uncertainty Part 2: Estimation

Bayesian Linear Regression. Sargur Srihari

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Models for Regression. Sargur Srihari

Overfitting, Bias / Variance Analysis

STA 4273H: Sta-s-cal Machine Learning

Bayesian methods in economics and finance

Linear Models for Regression

ECE521 week 3: 23/26 January 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Gaussian Processes for Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Variational Bayesian Logistic Regression

Linear Models for Regression

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Linear Models for Regression

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Linear Models for Regression CS534

Machine Learning Basics: Maximum Likelihood Estimation

Outline Lecture 2 2(32)

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. 7. Logistic and Linear Regression

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probability and Statistical Decision Theory

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Variational Inference and Learning. Sargur N. Srihari

Relevance Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Outline lecture 2 2(30)

Introduction to Probabilistic Graphical Models: Exercises

Mathematical Formulation of Our Example

12 Statistical Justifications; the Bias-Variance Decomposition

MODULE -4 BAYEIAN LEARNING

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Parameter Norm Penalties. Sargur N. Srihari

CSCI567 Machine Learning (Fall 2014)

CMU-Q Lecture 24:

Reading Group on Deep Learning Session 2

Introduction to Bayesian Statistics

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning. Ensemble Methods. Manfred Huber

CS-E3210 Machine Learning: Basic Principles

Learning Bayesian network : Given structure and completely observed data

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Need for Sampling in Machine Learning. Sargur Srihari

1. Non-Uniformly Weighted Data [7pts]

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Probability and Information Theory. Sargur N. Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

STA 4273H: Statistical Machine Learning

The connection of dropout and Bayesian statistics

Machine Learning Linear Classification. Prof. Matteo Matteucci

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

y Xw 2 2 y Xw λ w 2 2

Basic Sampling Methods

What does Bayes theorem give us? Lets revisit the ball in the box example.

Forecasting with Expert Opinions

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Linear Models for Classification

GWAS IV: Bayesian linear (variance component) models

Bayesian Machine Learning

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Variational Inference. Sargur Srihari

Calculus with Algebra and Trigonometry II Lecture 21 Probability applications

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Bayesian Methods for Machine Learning

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

STA 414/2104, Spring 2014, Practice Problem Set #1

Linear Dynamical Systems

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Transcription:

Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1

Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression Choosing λ in maximum likelihood/least squares estimation Formulation for regression Example Choice of optimal λ 2

Model Complexity in Linear Regression We looked at linear regression where form of basis functions ϕ and their no. M are fixed Using maximum likelihood (equivalently least squares) leads to severe overfitting if complex models are trained with limited data However limiting M to avoid overfitting has side effect of not capturing important trends in the data Regularization can control overfitting for models with many parameters But seeking to minimize wrt both w and λ leads to unregularized solution with λ=0 3

Overfitting is a property of Max Likelihood Does not happen when we marginalize over parameters in a Bayesian setting Before considering Bayesian view, instructive to consider frequentist viewpoint of model complexity It is called Bias-Variance trade-off 4

Bias-Variance in Regression Low degree polynomial has high bias (fits poorly) but has low variance with different data sets High degree polynomial has low bias (fits well) but has high variance with different data sets 5

Bias-Variance in Point Estimate True height of Chinese emperor: 200cm (6.5ft) Poll question: How tall is the emperor? Determine how wrong people are, on average If all answer 200, ave. squared error is 0. Consider three datasets with mean=180 (or bias error = 20), but increasing variance (0, 10 and 20) 200 180 Bias No variance Dataset 1 Everyone believes it is 180 (variance=0) Answer is always 180 The error is always -20 Ave squared error is 400 Average bias error is 20 400=400+0 Bias Some variance 200 180 Dataset 2 Normally distributed beliefs with mean 180 and std dev 10 (variance 100) Poll two: One says 190, other 170 Bias Errors are -10 and -30 Average bias error is -20 Squared errors: 100 and 900 Ave squared error: 500 500 = 400 + 100 Bias More variance Average Squared Error = (Bias error) 2 + Variance As variance increases, error increases 200 180 Dataset 3 Normally distributed beliefs with mean 180 and std dev 20 (variance=400) Poll two: One says 200 and other 160 Errors: 0 and -40 Ave error is -20 Sq. errors: 0 and 1600 Ave squared error: 800 800 = 400 + 400

Prediction in Linear Regression Need to apply Decision Theory: choose a specific estimate y(x) of the value of t for a given x In doing so if we incur incur loss L(t,y(x)) Then the expected loss is E[ L] = L(t,y( x))p(x, t) dxdt Using squared loss function L(t,y(x))={y(x)-t} 2 E[L] = (y( x) t) 2 p(x,t)dxdt Taking derivative of E wrt y(x), using calculus of variations, δe[l] δy(x) = 2 (y( x) t)p(x,t)dt Setting equal to zero, solving for y(x) and using sum and product rules y(x) = tp(x,t)dt p(x) = tp(t x)dt = E t [t x] Regression function y(x), which minimizes the expected squared loss, is given by the mean of the conditional distribution p(t x) 7

Alternative Derivation We can show that the optimal prediction is equal to the conditional mean in another way First we have {y(x) t} 2 = y(x) E[t x]+ E[t x] t Substituting into the loss function, we obtain expression for the loss function as E[L] = y(x) E[t x] { } 2 { } 2 p(x)dx + var(t x)p(x)dx The function y(x) we seek to determine enters only in the first term, which will be minimum when y(x) = E[t x] 8

Bias -Variance in Regression y(x): regression function using some method h(x): optimal prediction (using squared loss) h (x) = E[ t x] = tp( t x) dt If we assume loss function L(t,y(x))={y(x)-t} 2 E[L] for a particular data set D can be written as expected loss = (bias) 2 + variance + noise where (bias) 2 = {E D [y(x;d)] h(x)} 2 p(x)dx variance = E D [{y(x;d)] E D [y(x;d)]} 2 ]p(x)dx noise = { h(x) t} 2 p(x,t)dxdt Difference between expected value and optimal 9

Goal: Minimize Expected Loss We have decomposed expected loss into sum of (squared) bias, a variance and a constant noise term There is a trade-off between bias and variance Very flexible models have low bias and high variance Rigid models have high bias and low variance Optimal model has has the best balance 10

Dependence of Bias-Variance on Model Complexity h(x)=sin(2πx) Regularization parameter λ L=100 data sets Each with N=25 24 Gaussian Basis functions No of parameters M=25 Total Error function: 1 2 N n =1 { t n w T φ(x n ) } 2 + λ 2 wt w Where ϕ is a vector of basis functions 20 Fits for 25 data points each High λ Low Variance High bias Red: Average of Fits Green: Sinusoid from which data was generated Result of averaging multiple solutions with complex model gives good fit Weighted averaging of multiple solutions is at heart of Bayesian approach: not wrt multiple data sets but wrt posterior distribution of parameters Low λ High Variance Low bias 11

Determining optimal λ Average Prediction y(x) = 1 L l =1 Squared Bias L y (l ) (x) (bias) 2 = Variance variance = 1 N 1 N N n =1 { y(x n ) h(x n )} N L 1 y (l ) (x n ) y(x n ) L n =1 l =1 { } 2 2 12

Squared Bias and Variance vs λ Test error minimum occurs close to minimum of (bias 2 +variance) ln λ=-0.31 Small values of λ allow model to become finely tuned to noise leading to large variance Large values of λ pull weight parameters to zero leading to large bias 13

Bias-Variance vs Bayesian Bias-Variance decomposition provides insight into model complexity issue Limited practical value since it is based on ensembles of data sets In practice there is only a single observed data set If there are many training samples then combine them which would reduce over-fitting for a given model complexity Bayesian approach gives useful insights into over-fitting and is also practical 14