Discriminative v. generative

Similar documents
Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning

Bias-Variance Tradeoff

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Generative v. Discriminative classifiers Intuition

Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative v. Discriminative classifiers Intuition

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Voting (Ensemble Methods)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

ECE 5984: Introduction to Machine Learning

Support Vector Machines

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Stochastic Gradient Descent

FINAL: CS 6375 (Machine Learning) Fall 2014

Decision trees COMS 4771

ECE 5424: Introduction to Machine Learning

Learning with multiple models. Boosting.

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

VBM683 Machine Learning

Machine Learning, Midterm Exam

Statistical Machine Learning from Data

Midterm, Fall 2003

Machine Learning. Ensemble Methods. Manfred Huber

Decision Tree Learning Lecture 2

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Introduction to Machine Learning Midterm Exam

Data Mining und Maschinelles Lernen

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine Learning, Fall 2012 Homework 2

Machine Learning

Naïve Bayes classification

Variance Reduction and Ensemble Methods

CS7267 MACHINE LEARNING

A first model of learning

PDEEC Machine Learning 2016/17

Notes on Discriminant Functions and Optimal Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Logistic Regression. Machine Learning Fall 2018

the tree till a class assignment is reached

Chapter 14 Combining Models

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

day month year documentname/initials 1

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning CMU-10701

Announcements Kevin Jamieson

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Learning (II)

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

MLE/MAP + Naïve Bayes

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning Midterm Exam Solutions

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Click Prediction and Preference Ranking of RSS Feeds

A Simple Algorithm for Learning Stable Machines

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Decision Tree Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CS534 Machine Learning - Spring Final Exam

Machine Learning & Data Mining

Dan Roth 461C, 3401 Walnut

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning

Learning Decision Trees

CS145: INTRODUCTION TO DATA MINING

Introduction to Machine Learning (67577) Lecture 5

Classifier Performance. Assessment and Improvement

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

BAGGING PREDICTORS AND RANDOM FOREST

CSC 411: Lecture 09: Naive Bayes

C4.5 - pruning decision trees

Generative v. Discriminative classifiers Intuition

Decision Trees: Overfitting

An Introduction to Statistical and Probabilistic Linear Models

Dyadic Classification Trees via Structural Risk Minimization

Announcements. Proposals graded

Machine Learning

The Bayes classifier

ECE 5984: Introduction to Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Transcription:

Discriminative v. generative

Naive Bayes 2

Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1 y i =+)=b j a j = P [(y i = ) ^ (x ij = 1)]/ P [y i = ] b j = P [(y i =+)^ (x ij = 1)]/ P [y i = ] P (y i =+ x ij )=1/(1 + exp( z i )) 2k+1 parameters z i = w 0 + P j w jx ij 3

Logistic regression P (y i =+ x ij )=1/(1 + exp( z i )) arg max w Y i = arg min w = arg min w z i = w 0 + P j w jx ij P (y i x ij ) X ln(1 + exp( y i z i )) i X h(y i z i ) i 4

Same model, different answer Why? max P(X, Y) vs. max P(Y X) generative vs. discriminative MLE v. MCLE (max conditional likelihood estimate) How to pick? Typically MCLE better if lots of data, MLE better if not 5

MCLE as MLE max Y i P (x i,y i ) max Y i P (y i x i, )

MCLE as MLE Recipe: MCLE = MLE + extra parameters to decouple P(x) from P(y x) Bias-variance tradeoff: MLE places additional constraints on θ by coupling to P(x) MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θx to θy w/ prior Tom Minka. Discriminative models, not discriminative training. MSR tech report TR-2005-144, 2005 7

Comparison As #examples if Bayes net is right: NB & LR get same answer if not: LR has minimum possible training error train error test error so LR does at least as well as NB, usually better 8

Comparison Finite sample: n examples with k attributes how big should n be for excess risk ϵ? GNB needs n = θ(log k) as long as at least a constant fraction of attributes are relevant Hoeffding for each weight + union bound over weights + bound z away from 0 LR needs n = θ(k) VC-dimension of linear classifier GNB converges much faster to its (perhaps lessaccurate) final estimates see [Ng & Jordan, 2002] 9

Comparison on UCI 0.4 voting records (discrete) 0.5 pima (continuous) 0.3 0.45 error 0.2 error 0.4 0.35 0.1 0.3 0 0 20 40 60 80 m 0.25 0 20 40 60 m NB: solid LR: dashed see [Ng & Jordan, 2002] 10

Comparison on UCI m 0.5 lymphography (discrete) 0.4 optdigits (0 s breast and 1 s, cancer continuous) (discrete) 0.5 0.4 0.3 0.45 error 0.3 error 0.2 error 0.4 0.35 0.2 0.1 0.3 0.1 0 50 100 150 m 0 0.25 0 0 50 100 150 200 200 m m 0.8 NB: solid LR: dashed sick (discrete) 0.4 voting records (discrete) see [Ng & Jordan, 2002] 11

Decision trees

Dichotomous classifier 1. a. Insect has 1 pair of wings... Order Diptera (flies, mosquitoes) b. Insect has 2 pair of wings... go to #2 2. a. Insect has extremely long prothorax (neck)... go to #3 b. Insect has a regular length or no prothorax... go to #4 3. a. Forelegs come together in a 'praying' position... Order Mantodea (mantids) b. Forelegs do not come together in a 'praying' position... Order Raphidoptera (snakeflies) 4. a. Wings are armour-like with membraneous hindwings underneath them... Order Coleoptera (beetles) b. Wings are not armour-like... go to #5 5. a. Wings twist when insect is in flight... Order Strepsiptera (twistedwing parasites) b. Wings flap up and down (no twisting) when in flight... go to #6 6. a. Wings are triangular in shape... go to #7 b. Wings are not triangular in shape... go to #8 http://www.insectidentification.org/winged-insect-key.asp 13

Decision tree Problem: classification (or regression) n training examples (x1, y1), (x2, y2), (xn, yn) xi R k, yi {0, 1} well-known implementations: ID3, C4.5, J48, CART 14

The picture 15

The picture Composition II in Red, Blue, and Yellow Piet Mondrian, 1930 16

Variants Type of question at internal nodes Type of label at leaf nodes Labels on internal nodes or edges 17

Variants Decision list Decision diagram (DAG) 18

Example Sepal Length Petal Length 20

Representational power AND OR XOR 21

Why decision trees? Why? flexible hypothesis class work pretty well fairly interpretable very fast at test time closed under common operations Why not DTs? learning is NP-hard often not state-of-art error rate but: see bagging, boosting 22

Learning red? fuzzy? Class T T T F + T F F T F F + 23

Learning Bigger data sets with more attributes: finding training set MLE is NP-hard Heuristic search: build tree greedily, root down start with all training examples in one bin pick an impure bin try some candidate splits (e.g., all single-attribute binary tests), pick the best (largest increase in likelihood) repeat until all bins are either pure or have no possible splits left (ran out of attributes to split on) 24

Information gain red? fuzzy? Class T T T F + T F F T F F + Initially: L = 2 log(.4) + 3 log(.6) Split on red: bin T: 2 log(.667) + log(.333) bin F: 2 log(.5) Split on fuzzy: bin T: 2 log 1 + 0 log 0 = 0 bin F: 2 log(.667) + log(.333) In general: H(Y) EX[H(Y X)] 25

Real-valued attributes 26

Multi-way discrete splits SS# Temp Sick? 123-45-6789 36 010-10-1010 36.5 + 555-55-1212 41 + Split on temp yields {, } and {+,,+} Split on SS# yields 5 pure leaves 314-15-9265 37 271-82-8183 40 + 27

Pruning Build tree on training set Prune on holdout set: while removing last split along some path improves holdout error, do so if a node N s children are all pruned, then N becomes eligible for pruning 28

Prune as rules Alternately, convert each leaf to a rule then prune test1 test2 test3 while dropping a test from a rule improves performance, do so 29

Bagging Bagging = bootstrap aggregating Can be used with any classifier, but particularly effective with decision trees Generate M bootstrap resamples Train a decision tree on each one Final classifier: vote all M trees e.g., tree 1 says p(+) =.7, tree 2 says p(+) =.9: predict.8 30

Out-of-bag error estimates Each bag contains (1 1/e) (~67%) of examples Use out-of-bag examples to estimate error of each tree To estimate error of overall vote for each example, classify using all out-of-bag trees average across all examples Conservative: we re averaging over ~67% of our trees but if we have lots of trees, bias is small 31

Boosting

Voted classifiers f: R k { 1, 1} Voted classifier: j fj(x) > 0 Weighted vote: j αj fj(x) > 0 assume wlog αj > 0 optionally scale so αj sum to 1 5 halfspaces (or add constant classifier for H =6) 33

Extending the hypothesis space 34 Hastie, Tibshirani, Friedman (2nd ed)

Voted classifiers the matrix T distinct classifiers (T < 2 n ) n training examples 35

Finding the best voted classifier 36