Machine Learning Algorithm. Heejun Kim

Similar documents
Bayesian Methods: Naïve Bayes

Be able to define the following terms and answer basic questions about them:

The Naïve Bayes Classifier. Machine Learning Fall 2017

Our Status. We re done with Part I Search and Planning!

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Naïve Bayes classification

Day 5: Generative models, structured classification

Numerical Learning Algorithms

CSE 473: Artificial Intelligence

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CS188 Outline. We re done with Part I: Search and Planning! Part II: Probabilistic Reasoning. Part III: Machine Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Uncertain Knowledge and Bayes Rule. George Konidaris

Bayesian Learning. Instructor: Jesse Davis

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 11: Probability Review. Dr. Yanjun Qi. University of Virginia. Department of Computer Science

Machine Learning

MLE/MAP + Naïve Bayes

Final Examination CS 540-2: Introduction to Artificial Intelligence

Bayesian Learning (II)

COMP 328: Machine Learning

Our Status in CSE 5522

CS 343: Artificial Intelligence

MLE/MAP + Naïve Bayes

The Bayes classifier

Machine Learning

Where are we? è Five major sec3ons of this course

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

CS 188: Artificial Intelligence Spring Announcements

Naive Bayes Classifier. Danushka Bollegala

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

CPSC340. Probability. Nando de Freitas September, 2012 University of British Columbia

Review of probability

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to probability

Introduction to Bayesian Learning

Generative Models for Classification

Review of Probability Mark Craven and David Page Computer Sciences 760.

Theoretical Probability (pp. 1 of 6)

Behavioral Data Mining. Lecture 2

When working with probabilities we often perform more than one event in a sequence - this is called a compound probability.

Introduction to Probability and Statistics (Continued)

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

UVA CS / Introduc8on to Machine Learning and Data Mining

MATH 3510: PROBABILITY AND STATS July 1, 2011 FINAL EXAM

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Classification Algorithms

Bayesian Learning Features of Bayesian learning methods:

Recitation 2: Probability

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Independence. CS 109 Lecture 5 April 6th, 2016

MAE 493G, CpE 493M, Mobile Robotics. 6. Basic Probability

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Machine Learning

Lecture 3 Probability Basics

Be able to define the following terms and answer basic questions about them:

Qualifier: CS 6375 Machine Learning Spring 2015

2011 Pearson Education, Inc

Basic Probability and Statistics

Introduction to Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning for natural language processing

Dimensionality reduction

Classification & Information Theory Lecture #8

Machine Learning

III. Naïve Bayes (pp.70-72) Probability review

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CS 188: Artificial Intelligence Fall 2011

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Naïve Bayes, Maxent and Neural Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Probabilistic Machine Learning

CS 188: Artificial Intelligence. Our Status in CS188

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

STAT 201 Chapter 5. Probability

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Today. Why do we need Bayesian networks (Bayes nets)? Factoring the joint probability. Conditional independence

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Review of Basic Probability

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

NLP: Probability. 1 Basics. Dan Garrette December 27, E : event space (sample space)

Applied Natural Language Processing

Estimating Parameters

Probability and Independence Terri Bittner, Ph.D.

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Computational Genomics

Introduction to probability theory

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

the time it takes until a radioactive substance undergoes a decay

Machine Learning CISC 5800 Dr Daniel Leeds

Transcription:

Machine Learning Algorithm Heejun Kim June 12, 2018

Machine Learning Algorithms Machine Learning algorithm: a procedure in developing computer programs that improve their performance with experience. Types of machine learning: Naïve Bayes, Linear Regression, Support Vector Machine, Decision Tree, Neural Network, Deep Learning, and so on. Learning all algorithms are over the scope of this summer boot camp. We will focus on Naïve Bayes as an example of Machine Learning.

Outline Basic Probability and Notation Bayes Law and Naive Bayes Classification Smoothing Class Prior Probabilities Naive Bayes Classification Summary

Crash Course in Basic Probability

Discrete Random Variable A is a discrete random variable if: A describes an event with a finite number of possible outcomes (discrete vs continuous) A describes and event whose outcomes have some degree of uncertainty (random vs. predetermined)

Discrete Random Variables Examples A = the outcome of a coin-flip outcomes: heads, tails A = it will rain tomorrow outcomes: rain, no rain A = you have the flu outcomes: flu, no flu

Discrete Random Variables Examples A = the color of a ball pulled out from this bag outcomes: RED, BLUE, ORANGE

Probabilities Let P(A=X) denote the probability that the outcome of event A equals X For simplicity, we often express P(A=X) as P(X) Ex: P(RAIN), P(NO RAIN), P(FLU), P(NO FLU),...

Probability Distribution A probability distribution gives the probability of each possible outcome of a random variable P(RED) = probability of pulling out a red ball P(BLUE) = probability of pulling out a blue ball P(ORANGE) = probability of pulling out an orange ball

Probability Distribution For it to be a probability distribution, two conditions must be satisfied: the probability assigned to each possible outcome must be between 0 and 1 (inclusive) the sum of probabilities assigned to all outcomes must equal 1 0 P(RED) 1 0 P(BLUE) 1 0 P(ORANGE) 1 P(RED) + P(BLUE) + P(ORANGE) = 1

Probability Distribution Estimation Let s estimate these probabilities based on what we know about the contents of the bag P(RED) =? P(BLUE) =? P(ORANGE) =?

Probability Distribution estimation Let s estimate these probabilities based on what we know about the contents of the bag P(RED) = 10/20 = 0.5 P(BLUE) = 5/20 = 0.25 P(ORANGE) = 5/20 = 0.25 P(RED) + P(BLUE) + P(ORANGE) = 1.0

Probability Distribution assigning probabilities to outcomes Given a probability distribution, we can assign probabilities to different outcomes I reach into the bag and pull out an orange ball. What is the probability of that happening? P(RED) = 0.5 P(BLUE) = 0.25 P(ORANGE) = 0.25 I reach into the bag and pull out two balls: one red, one blue. What is the probability of that happening? What about three orange balls?

What can we do with a probability distribution? If we assume that each outcome is independent of previous outcomes, then the probability of a sequence of outcomes is calculated by multiplying the individual probabilities P(RED) = 0.5 P(BLUE) = 0.25 P(ORANGE) = 0.25 Note: we re assuming that when you take out a ball, you put it back in the bag before taking another

What can we do with a probability distribution? P( ) =?? P( ) =?? P( ) =?? P(RED) = 0.5 P(BLUE) = 0.25 P(ORANGE) = 0.25 P( ) =?? P( ) =?? P( ) =??

What can we do with a probability distribution? P( ) = 0.25 P( ) = 0.5 P( ) = P( ) = P( ) = 0.25 x 0.25 x 0.25 0.25 x 0.25 x 0.25 0.25 x 0.5 x 0.25 P(RED) = 0.5 P(BLUE) = 0.25 P(ORANGE) = 0.25 P( ) = 0.25 x 0.5 x 0.25 x 0.50

Joint and Conditional Probability P(A,B): the probability that event A and event B both occur P(A B): the probability of event A occurring given prior knowledge that event B occurred

Conditional Probability P(RED) = 0.50 P(BLUE) = 0.25 P(ORANGE) = 0.25 P(RED) = 0.50 P(BLUE) = 0.00 P(ORANGE) = 0.50 A P( A) = 0.25 B P( B) = 0.00 P( A) = 0.5 P( B) = 0.25 P( A) = 0.016 P( B) = 0.00

Chain Rule P(A, B) = P(A B) x P(B) Example: probability that it will rain today (B) and tomorrow (A) probability that it will rain today (B) probability that it will rain tomorrow (A) given that it will rain today (B)

Independence P(A, B) = P(A B) x P(B) = P(A) x P(B) Example: probability that it will rain today (B) and tomorrow (A) probability that it will rain today (B) probability that it will rain tomorrow (A) given that it will rain today (B) probability that it will rain tomorrow (A)

Independence A B P( A)?= P( )

Independence A B P( A) > P( )

Independence A B P( A)?= P( )

Independence A B P( A) = P( )

Outline Basic Probability and Notation Bayes Law and Naive Bayes Classification Smoothing Class Prior Probabilities Naive Bayes Classification Summary

Bayes Law

Bayes Law

Derivation of Bayes Law Always true! P(A B) P(B) = P(B A) P (A) Chain Rule! Divide both sides by P(B)!

Naive Bayes Classification example: positive/negative movie reviews Bayes Rule Confidence of POS prediction given instance D Confidence of NEG prediction given instance D

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: Otherwise, predict negative (NEG)

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: Otherwise, predict negative (NEG)

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: Otherwise, predict negative (NEG) Are these necessary?

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: Otherwise, predict negative (NEG)

Naive Bayes Classification example: positive/negative movie reviews Our next goal is to estimate these parameters from the training data! P(NEG) =?? P(POS) =?? P(D NEG) =?? P(D POS) =?? Easy! Not so easy!

Naive Bayes Classification example: positive/negative movie reviews Our next goal is to estimate these parameters from the training data! P(NEG) = % of training set documents that are NEG P(POS) = % of training set documents that are POS P(D NEG) =?? P(D POS) =??

Remember Conditional Probability? P(RED) = 0.50 P(BLUE) = 0.25 P(ORANGE) = 0.25 P(RED) = 0.50 P(BLUE) = 0.00 P(ORANGE) = 0.50 A P( A ) = 0.25 P( A ) = 0.50 P( A ) = 0.25 B P( B ) = 0.00 P( B ) = 0.50 P( B ) = 0.50

Naive Bayes Classification example: positive/negative movie reviews Training Instances Positive Training Instances Negative Training Instances P(D POS) =?? P(D NEG) =??

Naive Bayes Classification example: positive/negative movie reviews w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8... w_n sentiment 1 0 1 0 1 0 0 1... 0 positive 0 1 0 1 1 0 1 1... 0 positive 0 1 0 1 1 0 1 0... 0 positive 0 0 1 0 1 1 0 1... 1 positive....................... 1 1 0 1 1 0 0 1... 1 positive

Naive Bayes Classification example: positive/negative movie reviews We have a problem! What is it? Training Instances Positive Training Negative Training Instances Instances P(D POS) =?? P(D NEG) =??

Naive Bayes Classification example: positive/negative movie reviews We have a problem! What is it? Assuming n binary features, the number of possible combinations is 2 n 2 1000 = 1.071509e+301 And in order to estimate the probability of each combination, we would require multiple occurrences of each combination in the training data! We could never have enough training data to reliably estimate P(D NEG) or P(D POS)! The document for the prediction in test data would not appear in training data

Naive Bayes Classification example: positive/negative movie reviews Assumption: given a particular class value (i.e, POS or NEG), the value of a particular feature is independent of the value of other features In other words, the value of a particular feature is only dependent on the class value This is what Naïve means 4

Naive Bayes Classification example: positive/negative movie reviews w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8... w_n sentiment 1 0 1 0 1 0 0 1... 0 positive 0 1 0 1 1 0 1 1... 0 positive 0 1 0 1 1 0 1 0... 0 positive 0 0 1 0 1 1 0 1... 1 positive....................... 1 1 0 1 1 0 0 1... 1 positive

Naive Bayes Classification example: positive/negative movie reviews Assumption: given a particular class value (i.e, POS or NEG), the value of a particular feature is independent of the value of other features Example: we have seven features and D = 1011011 P(1011011 POS) = P(w1=1 POS) x P(w2=0 POS) x P(w3=1 POS) x P(w4=1 POS) x P(w5=0 POS) x P(w6=1 POS) x P(w7=1 POS) P(1011011 NEG) = P(w1=1 NEG) x P(w2=0 NEG) x P(w3=1 NEG) x P(w4=1 NEG) x P(w5=0 NEG) x P(w6=1 NEG) x P(w7=1 NEG)

Naive Bayes Classification example: positive/negative movie reviews Question: How do we estimate P(w1=1 POS)?

Naive Bayes Classification example: positive/negative movie reviews w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8... w_n sentiment 1 0 1 0 1 0 0 1... 0 positive 0 1 0 1 1 0 1 1... 0 negative 0 1 0 1 1 0 1 0... 0 negative 0 0 1 0 1 1 0 1... 1 positive....................... 1 1 0 1 1 0 0 1... 1 negative

Naive Bayes Classification example: positive/negative movie reviews Question: How do we estimate P(w1=1 POS)? POS NEG P(w1=1 POS) =?? w1 = 1 a b w1 = 0 c d

Naive Bayes Classification example: positive/negative movie reviews Question: How do we estimate P(w1=1 POS)? POS NEG P(w1=1 POS) = a / (a + c) w1 = 1 a b w1 = 0 c d

Naive Bayes Classification example: positive/negative movie reviews Question: How do we estimate P(w1=1/0 POS/NEG)? POS NEG P(w1=1 POS) = a / (a + c) w1 = 1 a b P(w1=0 POS) =?? w1 = 0 c d P(w1=1 NEG) =?? P(w1=0 NEG) =??

Naive Bayes Classification example: positive/negative movie reviews Question: How do we estimate P(w2=1/0 POS/NEG)? POS NEG P(w2=1 POS) = a / (a + c) w2 = 1 a b P(w2=0 POS) = c / (a + c) w2 = 0 c d P(w2=1 NEG) = b / (b + d) P(w2=0 NEG) = d / (b + d) The value of a, b, c, and d would be different for different features w1, w2, w3, w4, w5,..., wn

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: Otherwise, predict negative (NEG)

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: P(1011011 POS) P(1011011 NEG) Otherwise, predict negative (NEG)

Naive Bayes Classification example: positive/negative movie reviews Given instance D = 1011011, predict positive (POS) if: P(w1=1 POS) x P(w2=0 POS) x P(w3=1 POS) x P(w4=1 POS) x P(w5=0 POS) x P(w6=1 POS) x P(w7=1 POS) x P(POS) P(w1=1 NEG) x P(w2=0 NEG) x P(w3=1 NEG) x P(w4=1 NEG) x P(w5=0 NEG) x P(w6=1 NEG) x P(w7=1 NEG) x P(NEG) Otherwise, predict negative (NEG)

Naive Bayes Classification example: positive/negative movie reviews We still have a problem! What is it?

Naive Bayes Classification example: positive/negative movie reviews Given instance D = 1011011, predict positive (POS) if: P(w1=1 POS) x P(w2=0 POS) x P(w3=1 POS) x P(w4=1 POS) x P(w5=0 POS) x P(w6=1 POS) x P(w7=1 POS) x P(POS) P(w1=1 NEG) x P(w2=0 NEG) x P(w3=1 NEG) x P(w4=1 NEG) x P(w5=0 NEG) x P(w6=1 NEG) x P(w7=1 NEG) x P(NEG) Otherwise, predict negative (NEG) What if this never happens in the training data?

Smoothing Probability Estimates When estimating probabilities, we tend to... Over-estimate the probability of observed outcomes Under-estimate the probability of unobserved outcomes The goal of smoothing is to... Decrease the probability of observed outcomes Increase the probability of unobserved outcomes It s usually a good idea You probably already know this concept!

Add-One Smoothing Question: How do we estimate P(w1=1/0 POS/NEG)? POS NEG P(w1=1 POS) = a / (a + c) w1 = 1 a b P(w1=0 POS) = c / (a + c) w1 = 0 c d P(w1=1 NEG) = b / (b + d) P(w1=0 NEG) = d / (b + d)

Add-One Smoothing Question: How do we estimate P(w1=1/0 POS/NEG)? POS NEG P(w1=1 POS) =?? w1 = 1 w1 = 0 a + 1 b + 1 c + 1 d + 1 P(w1=0 POS) =?? P(w1=1 NEG) =?? P(w1=0 NEG) =??

Add-One Smoothing Question: How do we estimate P(w1=1/0 POS/NEG)? POS NEG P(w1=1 POS) = (a + 1) / (a + c + 2) w1 = 1 w1 = 0 a + 1 b + 1 c + 1 d + 1 P(w1=0 POS) = (c + 1) / (a + c + 2) P(w1=1 NEG) = (b + 1) / (b + d + 2) P(w1=0 NEG) = (d + 1) / (b + d + 2)

Naive Bayes Classification how to emphasize important features? Given instance D = 1011011, predict positive (POS) if: P(w1=1 POS) x P(w2=0 POS) x P(w3=1 POS) x P(w4=1 POS) x P(w5=0 POS) x P(w6=1 POS) x P(w7=1 POS) x P(POS) P(w1=1 NEG) x P(w2=0 NEG) x P(w3=1 NEG) x P(w4=1 NEG) x P(w5=0 NEG) x P(w6=1 NEG) x P(w7=1 NEG) x P(NEG) Otherwise, predict negative (NEG) What if this word seems to be really important in the training data?

Naive Bayes Classification example: positive/negative movie reviews How can we emphasize important words (features)? Hint: it s very simple! Add the features redundantly!

Naive Bayes Classification how to emphasize important features? Given instance D = 1011011, predict positive (POS) if: P(w1=1 POS) x P(w2=0 POS) x P(w3=1 POS) x P(w4=1 POS) x P(w5=0 POS) x P(w6=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(POS) P(w1=1 NEG) x P(w2=0 NEG) x P(w3=1 NEG) x P(w4=1 NEG) x P(w5=0 NEG) x P(w6=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(NEG) What does it mean? W7 are repeated five times.

Naive Bayes Classification how to emphasize important features? If P(w7=1 POS) = 0.4 and P(w7=1 NEG) = 0.1 P(w1=1 POS) x P(w2=0 POS) x P(w3=1 POS) x P(w4=1 POS) x P(w5=0 POS) x P(w6=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(w7=1 POS) x P(POS) 0.01024 0.00001 P(w1=1 NEG) x P(w2=0 NEG) x P(w3=1 NEG) x P(w4=1 NEG) x P(w5=0 NEG) x P(w6=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(w7=1 NEG) x P(NEG) P(w7=1 POS) used to 4 times bigger than P(w7=1 NEG), but it becomes about one thousand times bigger by being used five times.

Naive Bayes Classification example: positive/negative movie reviews Given instance D, predict positive (POS) if: Otherwise, predict negative (NEG)

Naive Bayes Classification Naive Bayes Classifiers are simple, effective, robust, and very popular Simplicity-first Assumes that feature values are conditionally independent given the target class value This assumption does not hold in natural language Even so, NB classifiers are very powerful It works very well particularly when combined with some of the attribute selection procedures Smoothing is necessary in order to avoid zero probabilities Important terms can be used redundantly

Any Questions?

Instance-Based Learning Next Class