Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Similar documents
Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Machine Learning

Generative v. Discriminative classifiers Intuition

Introduction to Machine Learning

Introduction to Machine Learning

Directed Graphical Models or Bayesian Networks

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Be able to define the following terms and answer basic questions about them:

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Midterm, Fall 2003

Naive Bayes classification

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Machine Learning Midterm, Tues April 8

CSC2515 Assignment #2

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

the tree till a class assignment is reached

Machine Learning for Signal Processing Bayes Classification

Naïve Bayes classification

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Information Theory & Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

CHAPTER-17. Decision Tree Induction

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning Algorithm. Heejun Kim

Machine Learning, Fall 2009: Midterm

10-701/ Machine Learning - Midterm Exam, Fall 2010

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Bayesian Learning (II)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

In today s lecture. Conditional probability and independence. COSC343: Artificial Intelligence. Curse of dimensionality.

Introduction to Machine Learning

Machine Learning

CISC 4631 Data Mining

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Fundamentals of Machine Learning for Predictive Data Analytics

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Learning from Data 1 Naive Bayes

10701/15781 Machine Learning, Spring 2007: Homework 2

Machine Learning

Be able to define the following terms and answer basic questions about them:

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Learning Decision Trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Classification and Regression Trees

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 3: Decision Trees

ECE521 Lecture7. Logistic Regression

Directed Graphical Models

Probability Review and Naive Bayes. Mladen Kolar and Rob McCulloch

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Probability, Statistics, and Bayes Theorem Session 3

ML (cont.): SUPPORT VECTOR MACHINES

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Brief Introduction of Machine Learning Techniques for Content Analysis

Bayesian Learning Extension

Introduction to Machine Learning

Statistical Learning. Philipp Koehn. 10 November 2015

Final Exam, Spring 2006

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning, Fall 2012 Homework 2

CS6220: DATA MINING TECHNIQUES

UVA CS 4501: Machine Learning

Final Exam, Machine Learning, Spring 2009

Qualifier: CS 6375 Machine Learning Spring 2015

Predicting Protein Interactions with Motifs

Introduction to Machine Learning

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CSCI-567: Machine Learning (Spring 2019)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Support Vector Machines

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

EECS 349:Machine Learning Bryan Pardo

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

15-381: Artificial Intelligence. Decision trees

Generative v. Discriminative classifiers Intuition

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Classification Algorithms

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree Learning Lecture 2

Algorithmisches Lernen/Machine Learning

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014

10-701/ Machine Learning, Fall

Hidden Markov Models Part 1: Introduction

Transcription:

1/21 Tutorial 2 CPSC 340: Machine Learning and Data Mining Fall 2016

Overview 2/21 1 Decision Tree Decision Stump Decision Tree 2 Training, Testing, and Validation Set 3 Naive Bayes Classifier

Decision Stump 3/21 Decision stump: simple decision tree with 1 splitting rule based on 1 feature. Binary example: Assigns a label to each leaf based on the most frequent label. Most intuitive score: classification accuracy.

The newsgroups.mat Dataset 4/21 The newsgroups.mat Matlab file contains the following objects: 1 X: A sparse binary matrix. Each row corresponds to a post, and each column corresponds to a word from the word list. A value of 1 means that the word occurred in the post. 2 y: A vector with values 1 through 4, with the value corresponding to the newsgroup that the post came from. 3 Xtest and ytest: the word lists and newsgroup labels for additional newsgroup posts. 4 groupnames: The names of four newsgroups. 5 wordlist: A list of words that occur in posts to these newsgroups.

Example: Binary Decision Stump for newsgroups.mat 5/21

Decision Tree 6/21 Decision stumps have only 1 rule based on only 1 feature. Very limited class of models: usually not very accurate for most tasks. Decision trees allow sequences of splits based on multiple features. Very general class of models: can get very high accuracy. However, it s computationally infeasible to find the best decision tree. Most common decision tree learning algorithm in practice: Greedy recursive splitting.

Problem 1: Decision Tree for newsgroups.mat 7/21 For a maximum depth of 2, 1) draw the learned decision tree. and 2) re-write the function as a simple program using if/else statements.

Solution: Decision Tree for newsgroups.mat 8/21

Solution: Decision Tree for newsgroups.mat 9/21 Decision tree: If-else statement:

Training, Testing, and Validation Set 10/21 Given training data, we would like to learn a model to minimize error on the testing data How do we decide decision tree depth? We care about test error. But we can t look at test data. So what do we do????? One answer: Use part of your train data to approximate test error. Split training objects into training set and validation set: Train model on the training data. Test model on the validation data.

Cross-Validation 11/21 Isn t it wasteful to only use part of your data? k-fold cross-validation: Train on k-1 folds of the data, validate on the other fold. Repeat this k times with different splits, and average the score. Figure 1: Adapted from Wikipedia. Note: if examples are ordered, split should be random.

12/21 Problem 2: 2-Fold Cross Validation for newsgroups.mat Modify the code below to compute the 2-fold cross-validation scores on the training data alone. Find the depth that would be chosen by cross-validation.

Solution: 2-Fold Cross Validation for newsgroups.mat 13/21

Naive Bayes Classifier 14/21 Naive Bayes is a probabilistic classifier. Based on Bayes theorem. Strong independence assumption between features.

Naive Bayes Classifier 14/21 Naive Bayes is a probabilistic classifier. Based on Bayes theorem. Strong independence assumption between features. In the rest of this tutorial, We use y i for the label of object i (element i of y). We use x i for the features of object i (row i of X). We use x ij for feature j of object i. We use d for the number of features in object i.

Naive Bayes Classifier 15/21 Bayes rule

Naive Bayes Classifier 15/21 Bayes rule Since the denominator does not depend on y i, we are only interested in the numerator: p(y i x i ) p(x i y i )p(y i )

Naive Bayes Classifier 16/21 The numerator is equal to the joint probability: p(x i y i )p(y i ) = p(x i, y i ) = p(x i1,..., x id, y i )

Naive Bayes Classifier 16/21 The numerator is equal to the joint probability: p(x i y i )p(y i ) = p(x i, y i ) = p(x i1,..., x id, y i ) Chain rule: p(x i1,..., x id, y i ) = p(x i1 x i2,..., x id, y i )p(x i2,..., x id, y i ) =... = p(x i1 x i2,..., x id, y i )p(x i2 x i3,..., x id, y i )... p(x id y i )p(y i )

Naive Bayes Classifier 16/21 The numerator is equal to the joint probability: p(x i y i )p(y i ) = p(x i, y i ) = p(x i1,..., x id, y i ) Chain rule: p(x i1,..., x id, y i ) = p(x i1 x i2,..., x id, y i )p(x i2,..., x id, y i ) =... = p(x i1 x i2,..., x id, y i )p(x i2 x i3,..., x id, y i )... p(x id y i )p(y i ) Each feature in x i is independent of the others given y i : p(x ij x ij+1,..., x id, y i ) = p(x ij y i )

Naive Bayes Classifier The numerator is equal to the joint probability: p(x i y i )p(y i ) = p(x i, y i ) = p(x i1,..., x id, y i ) Chain rule: p(x i1,..., x id, y i ) = p(x i1 x i2,..., x id, y i )p(x i2,..., x id, y i ) =... = p(x i1 x i2,..., x id, y i )p(x i2 x i3,..., x id, y i )... p(x id y i )p(y i ) Each feature in x i is independent of the others given y i : p(x ij x ij+1,..., x id, y i ) = p(x ij y i ) Therefore: d p(y i, x i ) p(y i ) p(x ij y i ) j=1 16/21

Problem 4: Naive Bayes Classifier 17/21

Problem 4: Naive Bayes Classifier 17/21

Solution: Naive Bayes Classifier 18/21 We need

Solution: Naive Bayes Classifier 18/21 We need p(flu = N headache = Y, runny nose = N, fever = Y ) p(headache = Y flu = N)p(runny nose = N flu = N)p(fever = Y flu = N)p(flu = N) = 1 3 2 3 1 3 1 2 = 0.0370

Solution: Naive Bayes Classifier 18/21 We need p(flu = N headache = Y, runny nose = N, fever = Y ) p(headache = Y flu = N)p(runny nose = N flu = N)p(fever = Y flu = N)p(flu = N) = 1 3 2 3 1 3 1 2 = 0.0370 p(flu = Y headache = Y, runny nose = N, fever = Y ) p(headache = Y flu = Y )p(runny nose = N flu = Y )p(fever = Y flu = Y )p(flu = Y ) = 2 3 1 3 2 3 1 2 = 0.0741

Solution: Naive Bayes Classifier 18/21 We need p(flu = N headache = Y, runny nose = N, fever = Y ) p(headache = Y flu = N)p(runny nose = N flu = N)p(fever = Y flu = N)p(flu = N) = 1 3 2 3 1 3 1 2 = 0.0370 p(flu = Y headache = Y, runny nose = N, fever = Y ) p(headache = Y flu = Y )p(runny nose = N flu = Y )p(fever = Y flu = Y )p(flu = Y ) = 2 3 1 3 2 3 1 2 = 0.0741

Bayes Theorem 19/21 Bayes Theorem enables us to reverse probabilities:

Problem 3: Prosecutor s fallacy 20/21 A crime has been committed in a large city and footprints are found at the scene of the crime. The guilty person matches the footprints, p(f G) = 1. Out of the innocent people, 1% match the footprints by chance, p(f G) = 0.01. A person is interviewed at random and his/her footprints are found to match those at the crime scene. Determine the probability that the person is guilty, or explain why this is not possible, p(g F ) =? Let F be the event that the footprints match. Let G be the event that the person is guilty G be the event that the person is innocent.

Solution: Prosecutor s fallacy 21/21 p(g F ) = p(f G)p(G) p(f ) = p(f G)p(G) p(f G)p(G) + p(f G)p( G)

Solution: Prosecutor s fallacy 21/21 p(g F ) = p(f G)p(G) p(f ) = p(f G)p(G) p(f G)p(G) + p(f G)p( G) p(g) =? Impossible!