Machine Learning

Similar documents
Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

ECE 5984: Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Gaussians Linear Regression Bias-Variance Tradeoff

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Generative v. Discriminative classifiers Intuition

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Machine Learning

Introduction to Machine Learning

Stochastic Gradient Descent

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Estimating Parameters

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Introduction to Bayesian Learning. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018

Machine Learning, Fall 2012 Homework 2

Support Vector Machines

Naïve Bayes classification

Bayesian Learning (II)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

10-701/ Machine Learning - Midterm Exam, Fall 2010

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

ECE521 week 3: 23/26 January 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Notes on Machine Learning for and

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

MLE/MAP + Naïve Bayes

Machine Learning

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 2 Machine Learning Review

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

MODULE -4 BAYEIAN LEARNING

Logistic Regression. William Cohen

COMP90051 Statistical Machine Learning

Bayesian Methods: Naïve Bayes

Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

An Introduction to Statistical and Probabilistic Linear Models

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Week 5: Logistic Regression & Neural Networks

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

y Xw 2 2 y Xw λ w 2 2

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Lecture : Probabilistic Machine Learning

Classification Logistic Regression

PAC-learning, VC Dimension and Margin-based Bounds

CPSC 340: Machine Learning and Data Mining

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

The Naïve Bayes Classifier. Machine Learning Fall 2017

Parametric Models: from data to models

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic Machine Learning. Industrial AI Lab.

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Machine Learning 4771

Overfitting, Bias / Variance Analysis

Probability and Statistical Decision Theory

Voting (Ensemble Methods)

MLE/MAP + Naïve Bayes

CSC 411: Lecture 09: Naive Bayes

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Algorithmisches Lernen/Machine Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Gradient Ascent Chris Piech CS109, Stanford University

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning (CSE 446): Probabilistic Machine Learning

Announcements. Proposals graded

Statistical learning. Chapter 20, Sections 1 3 1

Discriminative v. generative

Density Estimation. Seungjin Choi

Support Vector Machines

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Transcription:

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into bias, variance, unavoidable Readings: Mitchell: Naïve Bayes and Logistic Regression (see class website) Ng and Jordan paper (class website) Bishop, Ch 9.1, 9.2 Logistic Regression Consider learning f: X Y, where X is a vector of real-valued features, < X 1 X n > Y is boolean assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli (π) Then P(Y X) is of this form, and we can directly estimate W Furthermore, same holds if the X i are boolean trying proving that to yourself Train by gradient ascent estimation of w s (no assumptions!) 1

MLE vs MAP Maximum conditional likelihood estimate Maximum a posteriori estimate with prior W~N(0,σI) Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X Y, or P(Y X) Generative classifiers (e.g., Naïve Bayes) Assume some functional form for P(Y), P(X Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y=y X= x) Discriminative classifiers (e.g., Logistic regression) Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data NOTE! even though our derivation of the form of P(Y X) made GNBstyle assumptions, the training procedure for Logistic Regression does not! 2

Use Naïve Bayes or Logisitic Regression? Consider Restrictiveness of modeling assumptions Rate of convergence (in amount of training data) toward asymptotic hypothesis i.e., the learning curve Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters to estimate: NB: LR: 3

Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled G.Naïve Bayes vs. Logistic Regression Recall two assumptions deriving form of LR from GNBayes: 1. X i conditionally independent of X k given Y 2. P(X i Y = y k ) = N(µ ik,σ i ), not N(µ ik,σ ik ) Consider three learning methods: GNB (assumption 1 only) GNB2 (assumption 1 and 2) LR Which method works better if we have infinite training data, and... Both (1) and (2) are satisfied Neither (1) nor (2) is satisfied (1) is satisfied, but not (2) [Ng & Jordan, 2002] 4

G.Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] Recall two assumptions deriving form of LR from GNBayes: 1. X i conditionally independent of X k given Y 2. P(X i Y = y k ) = N(µ ik,σ i ), not N(µ ik,σ ik ) Consider three learning methods: GNB (assumption 1 only) -- decision surface can be non-linear GNB2 (assumption 1 and 2) decision surface linear LR -- decision surface linear, trained differently Which method works better if we have infinite training data, and... Both (1) and (2) are satisfied: LR = GNB2 = GNB Neither (1) nor (2) is satisfied: LR > GNB2, GNB>GNB2 (1) is satisfied, but not (2) : GNB > LR, LR > GNB2 G.Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] What if we have only finite training data? They converge at different rates to their asymptotic ( data) error Let refer to expected error of learning algorithm A after n training examples Let d be the number of features: <X 1 X d > So, GNB requires n = O(log d) to converge, but LR requires n = O(d) 5

Some experiments from UCI data sets [Ng & Jordan, 2002] Naïve Bayes vs. Logistic Regression The bottom line: GNB2 and LR both use linear decision surfaces, GNB need not Given infinite data, LR is better than GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y X) did). But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might beat the other 6

What you should know: Logistic regression Functional form follows from Naïve Bayes assumptions For Gaussian Naïve Bayes assuming variance σ i,k = σ i For discrete-valued Naïve Bayes too But training procedure picks parameters without the conditional independence assumption MLE training: pick W to maximize P(Y X, W) MAP training: pick W to maximize P(W X,Y) regularization: e.g., P(W) ~ N(0,σ) helps reduce overfitting Gradient ascent/descent General approach when closed-form solutions for MLE, MAP are unavailable Generative vs. Discriminative classifiers Bias vs. variance tradeoff Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Linear regression Decomposition of error into bias, variance, unavoidable Readings: Mitchell: Naïve Bayes and Logistic Regression (see class website) Ng and Jordan paper (class website) Bishop, Ch 9.1, 9.2 7

Regression So far, we ve been interested in learning P(Y X) where Y has discrete values (called classification ) What if Y is continuous? (called regression ) predict weight from gender, height, age, predict Google stock price today from Google, Yahoo, MSFT prices yesterday predict each pixel intensity in robot s current camera image, from previous image and previous action Regression Wish to learn f:x Y, where Y is real, given {<x 1,y 1 > <x n,y n >} Approach: 1. choose some parameterized form for P(Y X; θ) ( θ is the vector of parameters) 2. derive learning algorithm as MLE or MAP estimate for θ 8

1. Choose parameterized form for P(Y X; θ) Y X Assume Y is some deterministic f(x), plus random noise where Therefore Y is a random variable that follows the distribution and the expected value of y for any given x is f(x) Consider Linear Regression E.g., assume f(x) is linear function of x Notation: to make our parameters explicit, let s write 9

Training Linear Regression How can we learn W from the training data? Training Linear Regression How can we learn W from the training data? Learn Maximum Conditional Likelihood Estimate! where 10

Training Linear Regression Learn Maximum Conditional Likelihood Estimate where Training Linear Regression Learn Maximum Conditional Likelihood Estimate where so: 11

Training Linear Regression Learn Maximum Conditional Likelihood Estimate Can we derive gradient descent rule for training? How about MAP instead of MLE estimate? 12

Regression What you should know Under general assumption 1. MLE corresponds to minimizing sum of squared prediction errors 2. MAP estimate minimizes SSE plus sum of squared weights 3. Again, learning is an optimization problem once we choose our objective function maximize data likelihood maximize posterior prob of W 4. Again, we can use gradient descent as a general learning algorithm as long as our objective fn is differentiable wrt W though we might learn local optima ins 5. Almost nothing we said here required that f(x) be linear in x Bias/Variance Decomposition of Error 13

Bias and Variance given some estimator Y for some parameter θ, we define the bias of estimator Y = the variance of estimator Y = e.g., define Y as the MLE estimator for probability of heads, based on n independent coin flips biased or unbiased? variance decreases as sqrt(1/n) Bias Variance decomposition of error Reading: Bishop chapter 9.1, 9.2 Consider simple regression problem f:x Y y = f(x) + ε deterministic noise N(0,σ) What are sources of prediction error? learned estimate of f(x) 14

Sources of error What if we have perfect learner, infinite data? Our learned h(x) satisfies h(x)=f(x) Still have remaining, unavoidable error σ 2 Sources of error What if we have only n training examples? What is our expected error Taken over random training sets of size n, drawn from distribution D=p(x,y) 15

Sources of error 16