UVA CS / Introduc8on to Machine Learning and Data Mining

Similar documents
Where are we? è Five major sec3ons of this course

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 11: Probability Review. Dr. Yanjun Qi. University of Virginia. Department of Computer Science

COMP 328: Machine Learning

Basics on Probability. Jingrui He 09/11/2007

The Naïve Bayes Classifier. Machine Learning Fall 2017

UVA CS 4501: Machine Learning

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Naïve Bayes classification

Bayesian Learning Features of Bayesian learning methods:

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Data Mining Part 4. Prediction

Bayesian Classification. Bayesian Classification: Why?

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Stephen Scott.

Naive Bayes Classifier. Danushka Bollegala

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Machine Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

MLE/MAP + Naïve Bayes

Intelligent Systems (AI-2)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Lecture 9: Bayesian Learning

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 12: Bayes Classifiers. Dr. Yanjun Qi. University of Virginia

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Introduction to Probability and Statistics (Continued)

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

MLE/MAP + Naïve Bayes

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computational Genomics

Bayesian Methods: Naïve Bayes

Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Bayesian Learning

Intelligent Systems (AI-2)

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Intelligent Systems (AI-2)

ECE 5984: Introduction to Machine Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Bias-Variance Tradeoff

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

COMP90051 Statistical Machine Learning

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them:

Data classification (II)

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Learning (II)

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Probability Review

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Machine Learning

Generative Model (Naïve Bayes, LDA)

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Learning. Bayesian Learning Criteria

Dimensionality reduction

Numerical Learning Algorithms

Data Mining Techniques. Lecture 3: Probability

Machine Learning Algorithm. Heejun Kim

Day 5: Generative models, structured classification

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Bayesian Learning. Instructor: Jesse Davis

Probability Based Learning

Machine Learning, Fall 2012 Homework 2

Generative v. Discriminative classifiers Intuition

Introduc)on to Ar)ficial Intelligence

CSCE 478/878 Lecture 6: Bayesian Learning

CSC 411: Lecture 09: Naive Bayes

Machine Learning 2nd Edi7on

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning

Introduction to Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning

Introduction to Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Bayesian Machine Learning

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Mining Classification Knowledge

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

Transcription:

UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 13: Probability and Sta3s3cs Review (cont.) + Naïve Bayes Classifier Yanjun Qi / Jane, PhD University of Virginia Department of Computer Science Announcements: Schedule Change Midterm rescheduled Oct. 30 th / 3:35pm 4:35pm Homework 4 is totally for sample midterm ques3ons HW3 will be out next Monday, due on Oct 25 th HW4 will be out next Tuesday, due on Oct 29 th (i.e. for a good prepara3on for midterm. Solu3on will be released before due 3me. ) Grading of HW1 will be available to you this evening Solu3on of HW1 will be available this Wed Grading of HW2 will be available to you next week Solu3on of HW2 will be available next week 1

Where are we? è Five major sec3ons of this course q Regression (supervised) q Classifica3on (supervised) q Unsupervised models q Learning theory q Graphical models Where are we? è Three major sec3ons for classifica3on We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 2

Last Lecture Recap: Probability Review The big picture Events and Event spaces Random variables Joint probability distribu3ons, Marginaliza3on, condi3oning, chain rule, Bayes Rule, law of total probability, etc. Structural proper3es Independence, condi3onal independence Mean and Variance The Big Picture Probability Model i.e. Data genera3ng process Observed Data Es3ma3on / learning / Inference / Data mining But how to specify a model? 3

Probability as a measure of uncertainty Probability is a measure of certainty of an event taking place. i.e. in the example, we were measuring the chances of hi;ng the shaded area. prob = # RedBoxes # Boxes Adapt from Prof. Nando de Freitas s review slides e.g. Coin Flips You flip a coin Head with probability 0.5 You flip 100 coins How many heads would you expect 4

e.g. Coin Flips cont. You flip a coin Head with probability p Binary random variable Bernoulli trial with success probability p You flip k coins How many heads would you expect Number of heads X: discrete random variable Binomial distribu3on with parameters k and p Discrete Random Variables Random variables (RVs) which may take on only a countable number of dis8nct values E.g. the total number of heads X you get if you flip 100 coins X is a RV with arity k if it can take on exactly one value out of { x 1,, x k } E.g. the possible values that X can take on are 0, 1, 2,, 100 5

Probability of Discrete RV Probability mass func3on (pmf): P(X = x i ) Easy facts about pmf Σ i P(X = x i ) = 1 P(X = x i X = x j ) = 0 if i j P(X = x i U X = x j ) = P(X = x i ) + P(X = x j ) if i j P(X = x 1 U X = x 2 U U X = x k ) = 1 Common Distribu3ons Uniform X U! 1,..., N# " $ X takes values 1, 2,, N PX ( = i) = 1N E.g. picking balls of different colors from a box Binomial X : Bin( n, p) X takes values 0, 1,, n n i n i P( X= i) = p ( 1 p) i E.g. coin flips 6

Coin Flips of Two Persons Your friend and you both flip coins Head with probability 0.5 You flip 50 3mes; your friend flip 100 3mes How many heads will both of you get Joint Distribu3on Given two discrete RVs X and Y, their joint distribu8on is the distribu3on of X and Y together E.g. P(You get 21 heads AND you friend get 70 heads) E.g. PX ( = x Y= y) = 1 50 100 i= 0 j= 0 x y ( i j ) PYou get heads AND your friend get heads = 1 7

Condi3onal Probability ( ) P X = x Y = y is the probability of X = x, given the occurrence of Y = y E.g. you get 0 heads, given that your friend gets 61 heads P X P X= x Y= y = ( ) ( = x Y= y) P( Y= y) Condi3onal Probability ( x Y y) P X= = = ( = x Y= y) P( Y= y) P X events But we normally write it this way: ( x y) P = p( x, y) p( y) X=x Y=y 8

Law of Total Probability Given two discrete RVs X and Y, which take values in { x and, We have 1,, x m } { y 1,, y n } ( = xi) = ( = xi = yj) j PX PX Y ( xi yj) ( yj) = PX= Y= PY= j Marginaliza3on B 4 B 5 B 3 B 2 A B 7 B 6 B 1 Marginal Probability Joint Probability ( = xi) = ( = xi = yj) j PX PX Y ( xi yj) ( yj) = PX= Y= PY= j Condi3onal Probability Marginal Probability 9

Bayes Rule X and Y are discrete RVs ( x Y y) P X= = = ( = x Y= y) P( Y= y) P X ( xi Y yj) P X = = = ( = yj = xi) ( = xi) P( Y = yj X = xk) P( X = xk) P Y X P X k Today : Naïve Bayes Classifier ü Probability review Structural proper3es, i.e., Independence, condi3onal independence ü Naïve Bayes Classifier Spam email classifica3on 10

Independent RVs Intui3on: X and Y are independent means that X = x neither makes it more or less probable that Y = y Defini3on: X and Y are independent iff PX= x Y= y = PX= x PY= y ( ) ( ) ( ) More on Independence PX= x Y= y = PX= x PY= y ( ) ( ) ( ) ( = x = y) = ( = x) P( Y = y X = x) = P( Y = y) P X Y P X E.g. no mater how many heads you get, your friend will not be affected, and vice versa 11

More on Independence X is independent of Y means that knowing Y does not change our belief about X. P(X Y=y) = P(X) P(X=x, Y=y) = P(X=x) P(Y=y) The above should hold for all x i, y j It is symmetric and writen as X Y X Y Condi3onally Independent RVs Intui3on: X and Y are condi3onally independent given Z means that once Z is known, the value of X does not add any addi8onal informa3on about Y Defini3on: X and Y are condi3onally independent given Z iff ( = x = y = z) = ( = x = z) ( = y = z) P X Y Z P X Z P Y Z X Y Z 12

More on Condi3onal Independence ( = x = y = z) = ( = x = z) ( = y = z) P X Y Z P X Z P Y Z ( = x = y = z) = ( = x = z) P X Y,Z P X Z ( = y = x = z) = ( = y = z) P Y X,Z P Y Z Today : Naïve Bayes Classifier ü Probability review Structural proper3es, i.e., Independence, condi3onal independence ü Naïve Bayes Classifier Spam email classifica3on 13

Where are we? è Three major sec3ons for classifica3on We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors C A Dataset for classifica3on C Output as Discrete Class Label C 1, C 2,, C L Data/points/instances/examples/samples/records: [ rows ] Features/aBributes/dimensions/independent variables/covariates/ predictors/regressors: [ columns, except the last] Target/outcome/response/label/dependent variable: special column to be predicted [ last column ] 14

Bayes classifiers Treat each attribute and class label as random variables. Given a sample x with attributes ( x 1, x 2,, x p ): Goal is to predict class C. Specifically, we want to find the value of C i that maximizes p( C i x 1, x 2,, x p ). Can we estimate p(c i x) = p( C i x 1, x 2,, x p ) directly from data? Bayes classifiers è MAP classification rule Establishing a probabilistic model for classification è MAP classification rule MAP: Maximum A Posterior Assign x to c* if * P( C = c X = x) > P( C = c X = x) c c, c = c1,,c L * Adapt from Prof. Ke Chen NB slides 15

Review: Bayesian Rule Prior, conditional and marginal probability Prior probability: P(C) Likelihood (through a generative model): P(X C) P(C 1 ), P(C 2 ),, P(C L ) Evidence (marginal prob. of sample ): P(X) Posterior probability: P(C X) P(C 1 x), P(C 2 x),, P(C L x) Bayesian Rule P(C X) = P(X C)P(C) P(X) Posterior = Likelihood Prior Evidence Adapt from Prof. Ke Chen NB slides Bayes Classifica3on Rule Establishing a probabilistic model for classification (1) Discriminative model P(C X) C = c 1,,c L, X = (X 1,, X n ) P ( c 1 x) P ( c 2 x) P( c L x) Discriminative Probabilistic Classifier x1 x2 x = (x 1, x 2,, x n ) xn Adapt from Prof. Ke Chen NB slides 16

Bayes Classifica3on Rule Establishing a probabilistic model for classification (cont.) (2) Generative model P(X C) C = c 1,,c L, X = (X 1,, X p ) P x c ) ( 1 P x c ) ( 2 P x c ) ( L Generative Probabilistic Model for Class 1 x1 x2 x p Generative Probabilistic Model for Class 2 x1 x2 x p Generative Probabilistic Model for Class L x1 x2 x p x = (x 1, x 2,, x p ) Adapt from Prof. Ke Chen NB slides Bayes Classifica3on Rule MAP classification rule MAP: Maximum A Posterior Assign x to c* if * P( C = c X = x) > P( C = c X = x) c c, c = c1,,c L Generative classification with the MAP rule Apply Bayesian rule to convert them into posterior probabilities P( X = x C = ci ) P( C = ci ) P( C = ci X = x) = P( X = x) P( X = x C = c ) P( C = c ) for i = 1,2,, L Then apply the MAP rule i * i Adapt from Prof. Ke Chen NB slides 17

Naïve Bayes Bayes classification P(C X) P(X C)P(C) = P(X 1,, X p C)P(C) Difficulty: learning the joint probability Naïve Bayes classification Assumption that all input attributes are conditionally independent! MAP classification rule: for P(X 1,, X p C) P(X 1, X 2,, X p C) = P(X 1 X 2,, X p,c)p(x 2,, X p C) = P(X 1 C)P(X 2,, X p C) = P(X 1 C)P(X 2 C) P(X p C) x = ( x1, x2,, xn) [P(x 1 c * ) P(x p c * )]P(c * ) > [P(x 1 c) P(x p c)]p(c), c c *, c = c 1,, c L Adapt from Prof. Ke Chen NB slides Naïve Bayes Naïve Bayes Algorithm (for discrete input attributes) Learning Phase: Given a training set S, For each target value of c i (c i = c 1,,c L ) ˆP(C = c i ) estimate P(C = c i ) with examples in S; For every attribute value x jk of each attribute X j ( j =1,, p; k =1,,K j ) ˆP(X j = x jk C = c i ) estimate P(X j = x jk C = c i ) with examples in S; Output: conditional probability tables; for X j, K j L elements Test Phase: Given an unknown instance Look up tables to assign the label c* to X if X! = ( a 1!,, a! p ) [ ˆP( a 1! c * ) ˆP( a! p c * )] ˆP(c * ) > [ ˆP( a 1! c) ˆP( a! p c)] ˆP(c), c c *, c = c 1,,c L Adapt from Prof. Ke Chen NB slides 18

Example C Example: Play Tennis Example Learning Phase Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Humidity Play=Ye s Play=N o High 3/9 4/5 Normal 6/9 1/5 P(X 2 C 1 ), P(X 2 C 2 ) Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 P(X 4 C 1 ), P(X 4 C 2 ) Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 P(C 1 ), P(C 2 ),, P(C L ) 19

Example Test Phase Given a new instance, x =(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) Look up tables P(Outlook=Sunny Play=Yes) = 2/9 P(Temperature=Cool Play=Yes) = 3/9 P(Huminity=High Play=Yes) = 3/9 P(Wind=Strong Play=Yes) = 3/9 P(Play=Yes) = 9/14 MAP rule P(Outlook=Sunny Play=No) = 3/5 P(Temperature=Cool Play==No) = 1/5 P(Huminity=High Play=No) = 4/5 P(Wind=Strong Play=No) = 3/5 P(Play=No) = 5/14 P(Yes x ): [P(Sunny Yes)P(Cool Yes)P(High Yes)P(Strong Yes)]P(Play=Yes) = 0.0053 P(No x ): [P(Sunny No) P(Cool No)P(High No)P(Strong No)]P(Play=No) = 0.0206 Given the fact P(Yes x ) < P(No x ), we label x to be No. Adapt from Prof. Ke Chen NB slides Next: Naïve Bayes Classifier ü Probability review Structural proper3es, i.e., Independence, condi3onal independence ü Naïve Bayes Classifier Text ar3cle classifica3on 20

References q Prof. Andrew Moore s review tutorial q Prof. Ke Chen NB slides q Prof. Carlos Guestrin recita3on slides 21