CS6220: DATA MINING TECHNIQUES

Similar documents
CS6220: DATA MINING TECHNIQUES

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Naïve Bayesian. From Han Kamber Pei

Chapter 6 Classification and Prediction (2)

Data Mining Part 4. Prediction

Mining Classification Knowledge

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Mining Classification Knowledge

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Bayesian Networks BY: MOHAMAD ALSABBAGH

Generative Model (Naïve Bayes, LDA)

MODULE -4 BAYEIAN LEARNING

CSCE 478/878 Lecture 6: Bayesian Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Bayesian Learning Features of Bayesian learning methods:

Stephen Scott.

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Classification

Bayesian Classification. Bayesian Classification: Why?

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Machine Learning

Introduction to Bayesian Learning

Introduction to Machine Learning

Notes on Machine Learning for and

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Networks. Motivation

Data Mining: Concepts and Techniques

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Directed Graphical Models

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Quantifying uncertainty & Bayesian networks

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probability. CS 3793/5233 Artificial Intelligence Probability 1

CS 188: Artificial Intelligence Fall 2009

BAYESIAN CLASSIFICATION

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Bayes Nets III: Inference

Lecture 5: Bayesian Network

Bayesian Networks Inference with Probabilistic Graphical Models

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Data Mining: Concepts and Techniques

Naïve Bayes classification

Review: Bayesian learning and inference

ECE521 week 3: 23/26 January 2017

An Introduction to Bayesian Networks: Representation and Approximate Inference

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

CS 188: Artificial Intelligence Spring Announcements

Lecture 9: Bayesian Learning

Building Bayesian Networks. Lecture3: Building BN p.1

Bayesian Approaches Data Mining Selected Technique

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

CS 188: Artificial Intelligence. Our Status in CS188

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Computational Genomics

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Reasoning Under Uncertainty: Belief Network Inference

Bayesian Updating: Discrete Priors: Spring

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistical learning. Chapter 20, Sections 1 4 1

CSC 412 (Lecture 4): Undirected Graphical Models

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

{ p if x = 1 1 p if x = 0

Classification and Prediction

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

CSE 473: Artificial Intelligence Probability Review à Markov Models. Outline

Learning Bayesian Networks (part 1) Goals for the lecture

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Conditional Independence

Bayesian Network Representation

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

CS145: INTRODUCTION TO DATA MINING

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

DATA MINING LECTURE 10

Probabilistic Graphical Models (I)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CS 188: Artificial Intelligence Fall 2008

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Introduction to Bayesian Learning. Machine Learning Fall 2018

CHAPTER-17. Decision Tree Induction

CS 188: Artificial Intelligence Spring Announcements

Exact Inference Algorithms Bucket-elimination

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

CS 188: Artificial Intelligence Spring Announcements

CS6220: DATA MINING TECHNIQUES

Transcription:

CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013

Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4 <60 1 Statistics 20 15 10 5 0 #Students <60 60-69 70-79 80-89 90-100 Count 39 Minimum Value 55.00 Maximum Value 98.00 Average 82.54 Median 84.00 Standard Deviation 9.18 2

Midterm Solution Announcement https://blackboard.neu.edu/bbcswebdav/pid-12532-dt-wiki-rid- 8320466_1/courses/CS6220.32435.201330/mid_term.pdf Course Project: Midterm report due next week A draft for final report Don t forget your project title Main purpose Check the progress and make sure you can finish it by the deadline 3

Chapter 8&9. Classification: Part 3 Bayesian Learning Naïve Bayes Bayesian Belief Network Instance-Based Learning Summary 4

Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 5

Basic Probability Review Have two dices h 1 and h 2 The probability of rolling an i given die h 1 is denoted P(i h 1 ). This is a conditional probability Pick a die at random with probability P(h j ), j=1 or 2. The probability for picking die h j and rolling an i with it is called joint probability and is P(i, h j )=P(h j )P(i h j ). For any events X and Y, P(X,Y)=P(X Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as ( X ) P = P( X, Y ) Y 6

Bayes Theorem: Basics Bayes Theorem: P ( h X) = P( X h) P( h) P( X) Let X be a data sample ( evidence ) Let h be a hypothesis that X belongs to class C P(h) (prior probability): the initial probability E.g., X will buy computer, regardless of age, income, P(X h) (likelihood): the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is 31..40, medium income P(X): marginal probability that sample data is observed P X = P X h h P(h) P(h X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X 7

Classification: Choosing Hypotheses Maximum Likelihood (maximize the likelihood): h ML = arg max P( D h) h H Maximum a posteriori (maximize the posterior): Useful observation: it does not depend on the denominator P(D) h MAP = arg max P( h h H D) = arg max P( D h H h) P( h) D: the whole training data set 8

Classification by Maximum A Posteriori Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-d attribute vector X = (x 1, x 2,, x n ) Suppose there are m classes C 1, C 2,, C m. Classification is to derive the maximum posteriori, i.e., the maximal P(C i X) This can be derived from Bayes theorem Since P(X) is constant for all classes, only needs to be maximized P( X C ) P( C ) P ( C X) = i i i P( X) P( C, X ) = P( X C ) P( C ) i i i 9

Example: Cancer Diagnosis A patient takes a lab test with two possible results (+ve, -ve), and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases (true positive); and a correct negative result in only 97% of the cases (true negative). Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis? 10

Solution P(cancer) =.008 P( cancer) =.992 P(+ve cancer) =.98 P(-ve cancer) =.02 P(+ve cancer) =.03 P(-ve cancer) =.97 Using Bayes Formula: P(cancer +ve) = P(+ve cancer)xp(cancer) / P(+ve) = 0.98 x 0.008/ P(+ve) =.00784 / P(+ve) P( cancer +ve) = P(+ve cancer)xp( cancer) / P(+ve) = 0.03 x 0.992/P(+ve) =.0298 / P(+ve) So, the patient most likely does not have cancer. 11

Chapter 8&9. Classification: Part 3 Bayesian Learning Naïve Bayes Bayesian Belief Network Instance-Based Learning Summary 12

Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent given the class (class conditional independency): n P( X Ci) = P( x k = 1 k Ci) = P( x Ci) P( x Ci)... P( x Ci) This greatly reduces the computation cost: Only counts the class distribution P C i = C i,d / D ( C i,d = # of tuples of C i in D) 1 2 n If A k is categorical, P(x k C i ) is the # of tuples in C i having value x k for A k divided by C i, D If A k is continuous-valued, P(x k C i ) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(x k C i ) is g ( x µ ) 1 2 2σ (, µ, σ ) x = e 2πσ P( X Ci) = g( x k, µ C i, σ C ) i 2 13

Naïve Bayes Classifier: Training Dataset Class: C1:buys_computer = yes C2:buys_computer = no Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age income studentcredit_rating_comp <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no

Naïve Bayes Classifier: An Example age income studentcredit_rating_comp <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no P(C i ): P(buys_computer = yes ) = 9/14 = 0.643 P(buys_computer = no ) = 5/14= 0.357 Compute P(X C i ) for each class P(age = <=30 buys_computer = yes ) = 2/9 = 0.222 P(age = <= 30 buys_computer = no ) = 3/5 = 0.6 P(income = medium buys_computer = yes ) = 4/9 = 0.444 P(income = medium buys_computer = no ) = 2/5 = 0.4 P(student = yes buys_computer = yes) = 6/9 = 0.667 P(student = yes buys_computer = no ) = 1/5 = 0.2 P(credit_rating = fair buys_computer = yes ) = 6/9 = 0.667 P(credit_rating = fair buys_computer = no ) = 2/5 = 0.4 X = (age <= 30, income = medium, student = yes, credit_rating = fair) P(X C i ) : P(X buys_computer = yes ) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X buys_computer = no ) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X C i )*P(C i ) : P(X buys_computer = yes ) * P(buys_computer = yes ) = 0.028 P(X buys_computer = no ) * P(buys_computer = no ) = 0.007 Therefore, X belongs to class ( buys_computer = yes ) 15

Avoiding the Zero-Probability Problem Naïve Bayesian prediction requires each conditional prob. be nonzero. Otherwise, the predicted prob. will be zero P( X Ci) n P( xk k = 1 Use Laplacian correction (or Laplacian smoothing) Adding 1 to each case = Ci) P x k = j C i = n ii,j+1 where n j (n ii,jj +1) ii,j is # of tuples in C i having value x k = j Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts 16

*Notes on Parameter Learning Why the probability of P X k C i is estimated in this way? http://www.cs.columbia.edu/~mcollins/em.pdf http://www.cs.ubc.ca/~murphyk/teaching/cs340- Fall06/reading/NB.pdf 17

Naïve Bayes Classifier: Comments Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes Classifier How to deal with these dependencies? Bayesian Belief Networks 18

Chapter 8&9. Classification: Part 3 Bayesian Learning Naïve Bayes Bayesian Belief Network Instance-Based Learning Summary 19

Bayesian Belief Networks (BNs) Bayesian belief network (also known as Bayesian network, probabilistic network): allows class conditional independencies between subsets of variables Two components: (1) A directed acyclic graph (called a structure) and (2) a set of conditional probability tables (CPTs) A (directed acyclic) graphical model of causal influence relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables X Z Y P Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P conditional on Y Has no cycles 20

A Bayesian Network and Some of Its CPTs Fire (F) Tampering (T) CPT: Conditional Probability Tables F F S.90.01 S.10.99 Smoke (S) Alarm (A) F, T F, T F, T F, T A.5.99.85.0001 A.95.01.15.9999 Leaving (L) Report (R) Derivation of the probability of a particular combination of values of X, from CPT (joint probability): CPT shows the conditional probability for each possible combination of its parents P( x 1,..., x n ) = n P( xi i = 1 Parents( xi)) 21

Inference in Bayesian Networks Infer the probability of values of some variable given the observations of other variables E.g., P(Fire = True Report = True, Smoke = True)? Computation Exact computation by enumeration In general, the problem is NP hard Approximation algorithms are needed 22

Inference by enumeration To compute posterior marginal P(X i E=e) Add all of the terms (atomic event probabilities) from the full joint distribution If E are the evidence (observed) variables and Y are the other (unobserved) variables, then: P(X e) = α P(X, E) = α P(X, E, Y) Each P(X, E, Y) term can be computed using the chain rule Computationally expensive! 23

Example: Enumeration a b c d e P (d e) = α Σ ABC P(a, b, c, d, e) = α Σ ABC P(a) P(b a) P(c a) P(d b,c) P(e c) With simple iteration to compute this expression, there s going to be a lot of repetition (e.g., P(e c) has to be recomputed every time we iterate over C=true) A solution: variable elimination 24

How Are Bayesian Networks Constructed? Subjective construction: Identification of (direct) causal structure People are quite good at identifying direct causes from a given set of variables & whether the set contains all relevant direct causes Markovian assumption: Each variable becomes independent of its non-effects once its direct causes are known E.g., S F A T, path S A is blocked once we know F A Synthesis from other specifications E.g., from a formal system design: block diagrams & info flow Learning from data E.g., from medical records or student admission record Learn parameters give its structure or learn both structure and parms Maximum likelihood principle: favors Bayesian networks that maximize the probability of observing the given data set 25

Learning Bayesian Networks: Several Scenarios Scenario 1: Given both the network structure and all variables observable: compute only the CPT entries (Easiest case!) Scenario 2: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function Weights are initialized to random probability values At each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtracking Weights are updated at each iteration & converge to local optimum Scenario 3: Network structure unknown, all variables observable: search through the model space to reconstruct network topology Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this purpose D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, 1999. 26

Chapter 8&9. Classification: Part 3 Bayesian Learning Naïve Bayes Bayesian Belief Network Instance-Based Learning Summary 27

Lazy vs. Eager Learning Lazy vs. eager learning Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify Lazy: less time in training but more time in predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function Eager: must commit to a single hypothesis that covers the entire instance space 28

Lazy Learner: Instance-Based Methods Instance-based learning: Store training examples and delay the processing ( lazy evaluation ) until a new instance must be classified Typical approaches k-nearest neighbor approach Instances represented as points in a Euclidean space. Locally weighted regression Constructs local approximation 29

The k-nearest Neighbor Algorithm All instances correspond to points in the n-d space The nearest neighbor are defined in terms of Euclidean distance, dist(x 1, X 2 ) Target function could be discrete- or real- valued For discrete-valued, k-nn returns the most common value among the k training examples nearest to x q Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples +. + _ + x q _ +..... 30

Discussion on the k-nn Algorithm k-nn for real-valued prediction for a given unknown tuple Returns the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query x q Give greater weight to closer neighbors y q = w iy i w i, where x i s are x q s nearest neighbors Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes w 1 d( x, x ) i 2 q To overcome it, axes stretch or elimination of the least relevant attributes 31

Chapter 8&9. Classification: Part 3 Bayesian Learning Naïve Bayes Bayesian Belief Network Instance-Based Learning Summary 32

Summary Bayesian Learning Bayes theorem Naïve Bayes, class conditional independence Bayesian Belief Network, DAG, conditional probability table Instance-Based Learning Lazy learning vs. eager learning K-nearest neighbor algorithm 33