Bias-Variance Tradeoff

Similar documents
Generative v. Discriminative classifiers Intuition

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative v. Discriminative classifiers Intuition

Machine Learning Gaussian Naïve Bayes Big Picture

Generative v. Discriminative classifiers Intuition

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

ECE 5984: Introduction to Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Stochastic Gradient Descent

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Logistic

PAC-learning, VC Dimension and Margin-based Bounds

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Classification Logistic Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Machine Learning Fall 2018

Support Vector Machines

Machine Learning, Fall 2012 Homework 2

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Week 5: Logistic Regression & Neural Networks

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Support Vector Machines

ECE521 week 3: 23/26 January 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Learning Theory Continued

Ad Placement Strategies

Linear classifiers: Overfitting and regularization

SVMs, Duality and the Kernel Trick

Machine Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear classifiers: Logistic regression

Lecture 2 Machine Learning Review

Gaussians Linear Regression Bias-Variance Tradeoff

ECE 5984: Introduction to Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Dimensionality reduction

Machine Learning

Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CMU-Q Lecture 24:

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

COMP90051 Statistical Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. William Cohen

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear discriminant functions

CSE446: Clustering and EM Spring 2017

Linear Models for Classification

Lecture : Probabilistic Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

VBM683 Machine Learning

Discriminative v. generative

Bayesian Learning (II)

Voting (Ensemble Methods)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Expectation Maximization Algorithm

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Generative Clustering, Topic Modeling, & Bayesian Inference

CSC 411: Lecture 04: Logistic Regression

Estimating Parameters

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 5: Logistic Regression. Neural Networks

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Midterm: CS 6375 Spring 2018

Bayesian Methods: Naïve Bayes

Machine Learning Practice Page 2 of 2 10/28/13

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Announcements. Proposals graded

CS534 Machine Learning - Spring Final Exam

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Midterm Exam Solutions, Spring 2007

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Lecture 7

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

MLE/MAP + Naïve Bayes

STA414/2104 Statistical Methods for Machine Learning II

Density Estimation. Seungjin Choi

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Transcription:

What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 1

Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L 2 ) for regression Training set error: For a particular set of parameters, loss function on training data: Training set error as a function of model complexity 2

Prediction error Training set error can be poor measure of quality of solution Prediction error: We really care about error over all possible input points, not just training data: Prediction error as a function of model complexity 3

Computing prediction error Computing prediction hard integral May not know t(x) for every x Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error??? 4

Why training set error doesn t approximate prediction error? Sampling approximation Because of you prediction cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error??? Test set error Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using: 5

Test set error as a function of model complexity Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 6

How many points to I use for training/testing? Very hard question to answer! Too few training points, learned w is bad Too few test points, you never know if you reached a good solution Bounds, such as Hoeffding s inequality can help: More on this later this semester, but still hard to answer Typically: if you have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning if you have little data, then you need to pull out the big guns e.g., bootstrapping Error estimators 7

Error as a function of number of training examples for a fixed model complexity little data infinite data Error estimators Be careful!!! Test set only unbiased if you never never never never do any any any any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the semester) 8

Announcements First homework is out: Programming part and Analytic part Remember collaboration policy: can discuss questions, but need to write your own solutions and code Remember you are not allowed to look at previous years solutions, search the web for solutions, use someone else s solutions, etc. Due Oct. 3 rd beginning of class Start early! Recitation this week: Bayes optimal classifiers, Naïve Bayes What s (supervised) learning, more formally Given: Dataset: Instances { x 1 ;t(x 1 ),, x N ;t(x N ) } e.g., x i ;t(x i ) = (GPA=3.9,IQ=120,MLscore=99);150K Hypothesis space: H e.g., polynomials of degree 8 Loss function: measures quality of hypothesis h2h e.g., squared error for regression Obtain: Learning algorithm: obtain h2h that minimizes loss function e.g., using matrix operations for regression Want to minimize prediction error, but can only minimize error in dataset 9

Types of (supervised) learning problems, revisited Regression, e.g., dataset: position; temperature hypothesis space: Loss function: Density estimation, e.g., dataset: grades hypothesis space: Loss function: Classification, e.g., dataset: brain image; {verb v. noun} hypothesis space: Loss function: Learning is (simply) function approximation! The general (supervised) learning problem: Given some data (including features), hypothesis space, loss function Learning is no magic! Simply trying to find a function that fits the data Regression Density estimation Classification (Not surprisingly) Seemly different problem, very similar solutions 10

What is NB really optimizing? Naïve Bayes assumption: Features are independent given class: More generally: NB Classifier: MLE for the parameters of NB Given dataset Count(A=a,B=b) Ã number of examples where A=a and B=b MLE for NB, simply: Prior: P(Y=y) = Likelihood: P(X i =x i Y i =y i ) = 11

What is NB really optimizing? Let s use an example Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target classes Bayes optimal classifier P(Y X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available 12

Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly! Assume a particular functional form Sigmoid applied to a linear function of the data: Z Features can be discrete or continuous! Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 =-0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 13

Logistic Regression a Linear classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 Very convenient! implies implies implies linear classification rule! 14

Logistic regression v. Naïve Bayes Consider learning f: X Y, where X is a vector of real-valued features, < X1 Xn > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli(θ,1-θ) What does that imply about the form of P(Y X)? Cool!!!! Derive form for P(Y X) for continuous X i 15

Ratio of class-conditional probabilities Derive form for P(Y X) for continuous X i 16

Gaussian Naïve Bayes v. Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what s the difference??? LR makes no assumptions about P(X Y) in learning!!! Loss function!!! Optimize different functions! Obtain different solutions Logistic regression for more than 2 classes Logistic regression in more general case, where Y 2 {Y 1... Y R } : learn R-1 sets of weights 17

Logistic regression more generally Logistic regression in more general case, where Y 2 {Y 1... Y R } : learn R-1 sets of weights for k<r for k=r (normalization, so no weights for this class) Features can be discrete or continuous! Announcements Don t forget recitation tomorrow And start the homework early 18

Loss functions: Likelihood v. Conditional Likelihood Generative (Naïve Bayes) Loss function: Data likelihood Discriminative models cannot compute P(x j w)! But, discriminative (logistic regression) loss function: Conditional Data Likelihood Doesn t waste effort learning P(X) focuses on P(Y X) all that matters for classification Expressing Conditional Log Likelihood 19

Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave! Find optimum with gradient ascent Gradient: Learning rate, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) 20

Maximize Conditional Log Likelihood: Gradient ascent Gradient Descent for LR Gradient ascent algorithm: iterate until change < ε For i = 1 n, repeat 21

That s all M(C)LE. How about MAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance Pushes parameters towards zero Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate M(C)AP as Regularization Penalizes high weights, also applicable in linear regression 22

Gradient of M(C)AP MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate 23

Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled G. Naïve Bayes vs. Logistic Regression 1 Generative and Discriminative classifiers [Ng & Jordan, 2002] Asymptotic comparison (# training examples infinity) when model correct GNB, LR produce identical classifiers when model incorrect LR is less biased does not assume conditional independence therefore LR expected to outperform GNB 24

G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers Non-asymptotic analysis [Ng & Jordan, 2002] convergence rate of parameter estimates, n = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates Naïve bayes Logistic Regression Some experiments from UCI data sets 25

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit 26