Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Similar documents
Algorithms for Classification: The Basic Methods

Data Mining Part 4. Prediction

Naïve Bayes Lecture 6: Self-Study -----

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

The Naïve Bayes Classifier. Machine Learning Fall 2017

Administrative notes. Computational Thinking ct.cs.ubc.ca

The Solution to Assignment 6

Bayesian Classification. Bayesian Classification: Why?

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Bayesian Learning. Bayesian Learning Criteria

Mining Classification Knowledge

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining. Chapter 1. What s it all about?

Data classification (II)

Naïve Bayes classification

Probability Based Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Naive Bayes Classifier. Danushka Bollegala

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Mining Classification Knowledge

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

COMP 328: Machine Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Numerical Learning Algorithms


Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

COMP61011 : Machine Learning. Probabilis*c Models + Bayes Theorem

Induction on Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Stephen Scott.

Decision Trees. Gavin Brown

Classification and Regression Trees

Answers Machine Learning Exercises 2

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Support. Dr. Johan Hagelbäck.

Bayesian Learning Features of Bayesian learning methods:

MLE/MAP + Naïve Bayes

Empirical Approaches to Multilingual Lexical Acquisition. Lecturer: Timothy Baldwin

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Decision Tree Learning and Inductive Inference

Administrative notes February 27, 2018

Learning Decision Trees

Tools of AI. Marcin Sydow. Summary. Machine Learning

Classification: Decision Trees

Consider an experiment that may have different outcomes. We are interested to know what is the probability of a particular set of outcomes.

Learning Classification Trees. Sargur Srihari

Symbolic methods in TC: Decision Trees

Econ 325: Introduction to Empirical Economics

Introduction to Bayesian Learning. Machine Learning Fall 2018

Learning Decision Trees

The Bayes classifier

Decision Tree Learning - ID3

UVA CS / Introduc8on to Machine Learning and Data Mining

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Machine Learning

10-701/ Machine Learning: Assignment 1

An Introduction to Statistical and Probabilistic Linear Models

Symbolic methods in TC: Decision Trees

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning for Signal Processing Bayes Classification

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Quiz3_NaiveBayesTest

Modern Information Retrieval

Decision Trees. Tirgul 5

Classification Lecture 2: Methods

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

BAYES CLASSIFIER. Ivan Michael Siregar APLYSIT IT SOLUTION CENTER. Jl. Ir. H. Djuanda 109 Bandung

Day 5: Generative models, structured classification

Be able to define the following terms and answer basic questions about them:

Information Science 2

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Classification II: Decision Trees and SVMs

Chapter 6 Classification and Prediction (2)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Decision Theory

Lecture 9: Bayesian Learning

Slides for Data Mining by I. H. Witten and E. Frank

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Unsupervised Learning. k-means Algorithm

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning Alternatives to Manual Knowledge Acquisition

Brief Introduction to Machine Learning

Transcription:

Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU March 1, 2017 1 / 13

Bayes Rule Bayes Rule Assume that {B 1, B 2,..., B k } is a partition of S such that P(B i ) > 0, for i = 1, 2,..., k. Then P(B j A) = P(A B j)p(b j ). k P(A B i )P(B i ) i=1 B1 A B1 A B2 B2 A B3 B3 S 2 / 13

Applying Baye s Rule to Classification Credit Cards Scoring: Low-risk vs. High-risk According to the past transactions, some customers are low-risk in that they paid back their loan and the bank profited from them and other customers are high-risk in that they defaulted. We would like to learn the class high-risk customer We observe customer s yearly income and savings, which we represent by two random variables X 1 and X 2 The credibility of a customer is denoted by a Bernoulli random variable C where C = 1 indicates a high-risk customer and C = 0 indicated a low-risk customer 3 / 13

Applying Baye s Rule to Classification How to make the decision when a new application arrives? When a new application arrives with X 1 = x 1 and X 2 = x 2 If we know the probability of C conditioned on the observation X = [x 1, x 2 ] our decision will be C = 1 if P(C = 1 [x 1, x 2 ]) > 0.5 C = 0 otherwise The probability of error we made based on this rule is 1 max{p(c = 1 [x 1, x 2 ]), P(C = 0 [x 1, x 2 ])} < 0.5 Please note P(C = 1 [x 1, x 2 ]) + P(C = 0 [x 1, x 2 ]) = 1 4 / 13

The Posterior Probability:P(C x) = P(C)P(x C) P(x) P(C = 1) is called the prior probability that C = 1 In our example, it corresponds to a probability that a customer is high-risk, regardless of the x value. It is called the prior probability because it is the knowledge we have before looking at the observation x P(x C) is called the class likelihood and is the conditional probability that an event belonging to the class C has the associated observation value x P(x), the evidence is the probability that an observation x to be seen, regardless of whether it is a positive or negative example All above information can be extracted from the past transactions (historical data) 5 / 13

The Posterior Probability:P(C x) = P(C)P(x C) P(x) Because of normalization by the evidence, the posteriors sum up to 1 In our example, P(X 1, X 2 ) is called the joined probability of two random variables X 1 and X 2 Under the assumption, these two random variables X 1 and X 2 are conditional probability independent, we have P(X 1, X 2 C) = P(X 1 C)P(X 2 C) It is one of key assumptions of Naive Bayes Classifier Although it is over simplified the problem it is very easy to use for real applications 6 / 13

Extend to Multi-class classification We have K mutually and exhaustive classes; C i, i = 1, 2,..., K For example, in optical digit recognition, the input is a bitmap image and there are 10 classes We can think of that these K classes define a partition of the input space Please refer to the slides of the Partition Theorem and Baye s Rule The Bayes classifier choose the class with the highest posterior probability; that is we choose C i if P(C i x) = max P(C k x) k Question: Is it very important to have P(x), the evidence? 7 / 13

Naïve Bayes for Classification Also Good for Multi-class Classification Estimate a posteriori probability of class label Let each attribute (variable) be a random variable. What is the probibility of Pr(y = 1 x) = Pr(y = 1 X 1 = x 1, X 2 = x 2,..., X n = x n ) Naïve Bayes TWO not reasonable assumptions: The importance of each attribute is equal All attributes are conditional probability independent! Pr(y = 1 x) = 1 Pr(X = x) n Pr(X i = x i y = 1) i=1 8 / 13

The Weather Data Example Ian H. Witten & Eibe Frank, Data Mining Outlook Temperature Humidity Windy Play(Label) Sunny Hot High False -1 Sunny Hot High True -1 Overcast Hot High False +1 Rainy Mild High False +1 Rainy Cool Normal False +1 Rainy Cool Normal True -1 Overcast Cool Normal True +1 Sunny Mild High False -1 Sunny Cool Normal False +1 Rainy Mild Normal False +1 Sunny Mild Normal True +1 Overcast Mild High True +1 Overcast Hot Normal False +1 Rainy Mild High True -1 9 / 13

MLE for Bernoulli Distribution play vs. not play Likelihood Function The probability to observe the random sample X = {x t } N t=1 is N p xt (1 p) 1 xt t=1 Why don t we choose the parameter p which will maximize the probability for observing the random sample X = {x t } N t=1? Based on MLE, we will choose the parameter p p = N t=1 x t N 10 / 13

MLE for Multinomial Distribution Multinomial Distribution: Sunny, Cloudy and Rainy Consider the generalization of Bernoulli where instead of two possible outcomes, the outcome of a random event is one of k classes, each of which has a probability of occurring p i and k p i = 1. Let x 1, x 2,..., x k be k indicator variables where x i = 1 i=1 if the outcome is class i and x i = 0 otherwise. i.e., k P(x 1, x 2,..., x k ) = i=1 p x i i Let X = {x t } N t=1 be N independent radom experiments. Based on MLE, we will choose the parameter ˆp i ˆp i = N t=1 x t i N, i = 1, 2,... k 11 / 13

Probabilities for Weather Data Using Maximum Likelihood Estimation Based on MLE, we will choose the parameter ˆp i ˆp i = N t=1 x t i N, i = 1, 2,... k Outlook Temp. Humidity Windy Play Play Yes No Yes No Yes No Yes No Yes No Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 T 3/9 3/5 Overcast 4/9 0/5 Mild 4/9 3/5 9/14 5/14 Normal 6/9 1/5 F 6/9 2/5 Rainy 3/9 2/5 Cool 3/9 1/5 Likelihood of the two classes: Pr(y = 1 sunny, cool, high, T ) 2 9 3 9 3 9 3 9 9 14 Pr(y = 1 sunny, cool, high, T ) 3 5 1 5 4 5 3 5 5 14 12 / 13

Zero-frequency Problem What if an attribute value does NOT occur with a class value? The posterior probability will all be zero! No matter how likely the other attribute values are! Laplace estimator will fix zero-frequency, k + λ n + aλ Question: Roll a dice 8 times. The outcomes are as: 2, 5, 6, 2, 1, 5, 3, 6. What is the probability for showing 4? Pr(X = 4) = 0 + λ 2 + λ, Pr(X = 5) = 8 + 6λ 8 + 6λ 13 / 13