Chapter 6 Classification and Prediction (2)

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Naïve Bayesian. From Han Kamber Pei

CS6220: DATA MINING TECHNIQUES

Data Mining Part 4. Prediction

CS6220: DATA MINING TECHNIQUES

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Generative Model (Naïve Bayes, LDA)

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Data Mining: Concepts and Techniques

CS145: INTRODUCTION TO DATA MINING

Data Mining: Concepts and Techniques

Bayesian Classification. Bayesian Classification: Why?

Mining Classification Knowledge

Classification and Prediction

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Assignment No A-05 Aim. Pre-requisite. Objective. Problem Statement. Hardware / Software Used

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning for Signal Processing Bayes Classification

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Linear & nonlinear classifiers

CS 412 Intro. to Data Mining

Support Vector Machines

Introduction to Support Vector Machines

BAYESIAN CLASSIFICATION

Machine Learning for Signal Processing Bayes Classification and Regression

Introduction to SVM and RVM

CS249: ADVANCED DATA MINING

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear & nonlinear classifiers

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Classification Lecture 2: Methods

Support Vector Machine. Industrial AI Lab.

FINAL: CS 6375 (Machine Learning) Fall 2014

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines: Maximum Margin Classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mining Classification Knowledge

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Kernel Methods and Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

PAC-learning, VC Dimension and Margin-based Bounds

DATA MINING LECTURE 10

Machine Learning Lecture 5

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machine (continued)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naïve Bayes classification

Machine Learning Overview

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

CS590D: Data Mining Prof. Chris Clifton

Machine Learning 2010

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Classifier Complexity and Support Vector Classifiers

Formulation with slack variables

CS798: Selected topics in Machine Learning

ECE521 week 3: 23/26 January 2017

Midterm: CS 6375 Spring 2018

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Machine Learning 4771

Support Vector Machine

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Support Vector Machines

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning Midterm, Tues April 8

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

A Tutorial on Support Vector Machine

Support Vector Machines

The Naïve Bayes Classifier. Machine Learning Fall 2017

Midterm Exam, Spring 2005

Qualifier: CS 6375 Machine Learning Spring 2015

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Stephen Scott.

CSCE 478/878 Lecture 6: Bayesian Learning

PAC-learning, VC Dimension and Margin-based Bounds

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

CS145: INTRODUCTION TO DATA MINING

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Introduction to Bayesian Learning. Machine Learning Fall 2018

Transcription:

Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature Selection Methods Ensemble Methods Applications Summary 1

Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Bayesian Classification: Why? Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 2

Bayesian Theorem: Basics Let X be a data sample (evidence) with class label unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H X), (posteriori probability), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability, which is independent of X. E.g., A customer will buy computer, regardless of age, income, P(X): probability that sample data is observed P(X H) (likelihood), the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X age is 31..40, medium income Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H X), follows the Bayes theorem P( H X) P( X H ) P( H ) P( X) Informally, this can be written as posteriori = likelihood x prior/evidence Predicts X belongs to Ci iff the probability P(Ci X) is the highest among all the P(C k X) for all the k classes 3

Towards Naïve Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-d attribute vector X = (x 1, x 2,, x n ) Suppose there are m classes C 1, C 2,, C m. Classification is to derive the maximum posteriori, i.e., the maximal P(C i X) This can be derived from Bayes theorem P( X C ) P( C ) P( C X) i i i P( X) Since P(X) is constant for all classes, only needs to be maximized P( C X) P( X C ) P( C ) i i i Derivation of Naïve Bayes Classifier Naïve: A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): This greatly reduces the computation cost: Only counts the class distribution If Attribute A k is categorical, P(x k C i ) is the # of tuples in C i having value x k for A k divided by C i, D (# of tuples of C i in D) If A k is continous-valued, P(x k C i ) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ: probability density function 2 and P(x k C i ) is n P( X C i) P( x k 1 g k C i) P( x 1 C i) P( x ( x ) 1 2 2 ( x,, ) e 2 P( X Ci) g( x k, C i, C ) i 2 C i)... P( x n C i) 4

Naïve Bayesian Classifier: Training Dataset Class: C1:buys_computer = yes C2:buys_computer = no Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age income studentcredit_rating_comp 28 high no fair no 25 high no excellent no 35 high no fair yes 45 medium no fair yes 46 low yes fair yes 47 low yes excellent no 35 low yes excellent yes 25 medium no fair no 27 low yes fair yes 59 medium yes fair yes 26 medium yes excellent yes 32 medium no excellent yes 32 high yes fair yes 50 medium no excellent no Converted Dataset ID Age Income Student Credit Rating Buy Computer Age * 1 : <=30 2 : 31 40 3 : >40 Income 1 : low 2 : medium 3 : high Student 0 : no 1 : yes Credit_rating 1 : fair 2 : excellent Buy_computer (class label) 0 : no 1 : yes 1 1 3 0 1 0 2 1 3 0 2 0 3 2 3 0 1 1 4 3 2 0 1 1 5 3 1 1 1 1 6 3 1 1 2 0 7 2 1 1 2 1 8 1 2 0 1 0 9 1 1 1 1 1 10 3 2 1 1 1 11 1 2 1 2 1 12 2 2 0 2 1 13 2 3 1 1 1 14 3 2 0 2 0 5

Train the Naïve Bayesian Classifier P(C i ) P(buys_computer = yes ) = 9/14 = 0.643 P(buys_computer = no ) = 5/14= 0.357 Compute P(X C i ) for each class (part of classifiers are listed) P(age = <=30 buys_computer = yes ) = 2/9 = 0.222 P(age = <= 30 buys_computer = no ) = 3/5 = 0.6 P(age = 31-40 buys_computer = yes ) P(age= 31-40 buys_computer = no ) P(age= >40 buys_computer= yes ) P(age= >40 buys_computer= no ) Train the Naïve Bayesian Classifier Compute P(X C i ) for each class (part of classifiers are listed) P(income = medium buys_computer = yes ) = 4/9 = 0.444 P(income = medium buys_computer = no ) = 2/5 = 0.4 P(student = yes buys_computer = yes) = 6/9 = 0.667 P(student = yes buys_computer = no ) = 1/5 = 0.2 P(credit_rating = fair buys_computer = yes ) = 6/9 = 0.667 P(credit_rating = fair buys_computer = no ) = 2/5 = 0.4 6

Predict The Class Label A new record X = (age <= 30, income = medium, student = yes, credit_rating = fair) Predict the probability P(X C i ) P(X buys_computer = yes ) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X buys_computer = no ) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X C i )*P(C i ) P(X buys_computer = yes ) * P(buys_computer = yes ) = 0.028 P(X buys_computer = no ) * P(buys_computer = no ) = 0.007 Therefore, X belongs to class ( buys_computer = yes ) Avoiding the 0-Probability Problem Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero P ( X C i ) n P ( x k C i ) k 1 Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts 7

Naïve Bayesian Classifier: Comments Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature Selection Methods Ensemble Methods Applications Summary 8

SVM Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training data into a higher dimension With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary ) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane SVM finds this hyperplane using support vectors ( essential training tuples) and margins (defined by the support vectors) SVM History and Applications Vapnik and colleagues (1992) groundwork from Vapnik & Chervonenkis statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) Used both for classification and prediction Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 9

SVM General Philosophy Small Margin Support Vectors Large Margin SVM When Data Is Linearly Separable m Let data D be (X 1, y 1 ),, (X D, y D ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) 10

SVM Linearly Separable A separating hyperplane can be written as W X + b = 0 where W={w 1, w 2,, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as w 0 + w 1 x 1 + w 2 x 2 = 0 The hyperplane defining the sides of the margin: H 1 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers Why Is SVM Effective on High Dimensional Data? The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data The support vectors are the essential or critical training examples they lie closest to the decision boundary (MMH) If all other training examples are removed and the training is repeated, the same separating hyperplane would be found The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high 11

SVM Related Links SVM Website http://www.kernel-machines.org/ Representative implementations LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-svm, one-class SVM, including also various interfaces with java, python, etc. SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only C language SVM-torch: another recent implementation also written in C. 12