Probability Review and Naive Bayes. Mladen Kolar and Rob McCulloch
|
|
- Osborne Kerry Farmer
- 6 years ago
- Views:
Transcription
1 Probability Review and Naive Bayes Mladen Kolar and Rob McCulloch
2 1. Probability and Graphical Models 2. Quick Probability Review 3. The Sinus Example 4. Bayesian Networks 5. Bayes Rule Classification 6. Naive Bayes Classification 7. Sentiment Analysis: Spam or Ham
3 1. Probability and Graphical Models Joint probability distributions help us understand uncertain relationships between many variables. Let X = (X 1, X 2,..., X p ) be a vector of random variables. Let x = (x 1, x 2,..., x p ) be an outcome. If x is continuous we write P(X A) = If x is discrete we write A p(x)dx P(X A) = A p(x) 1
4 Figuring out p(x) and understanding it in high dimensions is a very difficult and very important problem. As usual, to make things manageable we look for simplifying structure. The simplest structure is independence: p(x 1, x 2,..., x p ) = p(x 1 ) p(x 2 )... p(x p ) = p(x i ). 2
5 But complete independence is completely boring. It rules out relationships amongst the x s!!!! Instead we will look for conditional independence. A pair of variable may not be independent, but they may be independent conditional on knowledge of a third!! 3
6 In ML/DS, graphical models means a nice way to think about conditional independence structures in joint distributions. Generally, graphical models capture kinds of conditional independence. We will quickly review basic probability notions, using some of the language associated with graphical models. Later in the course we may study graphical models more systematically. 4
7 2. Quick Probability Review Note: we will use notation appropriate for discrete random variables: and so on. Two Variables p(x 1, x 2 ) = P(X 1 = x 1, X 2 = x 2 ) Suppose we have joint disbtribution p(x 1, x 2 ) then we can compute the marginal p(x 1 ) (or p(x 2 )) and the conditional (or p(x 1 x 2 )). p(x 2 x 1 ) = p(x 1, x 2 ) p(x 1 ) 5
8 We can always get (x 1, x 2 ) by first getting x 1 and then getting x 2 conditional on x 1 : p(x 1, x 2 ) = p(x 1 ) p(x 2 x 1 ). X 1 and X 2 are independent if p(x 2 x 1 ) = p(x 2 ) (x 1, x 2 ). so that, p(x 1, x 2 ) = p(x 1 ) p(x 2 ). X 1 and X 2 are independent if no matter what you learn about X 1 your beliefs about X 2 are the same as if you knew nothing about X 1. 6
9 Three Variables Have p(x 1, x 2, x 3 ) = P(X 1 = x 1, X 2 = x 2, X 3 = x 3 ). Then, p(x 3 x 1, x 2 ) = p(x 1, x 2, x 3 ) p(x 1, x 2 ) and p(x 3, x 2 x 1 ) = p(x 1, x 2, x 3 ) p(x 1 ) 7
10 Note: p(x 3 x 1, x 2 ) p(x 1, x 2, x 3 ) p(x 3, x 2 x 1 ) p(x 1, x 2, x 3 ) The conditional is proportional to the joint. We always have: p(x 1, x 2, x 3 ) = p(x 1 ) p(x 2, x 3 x 1 ) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 1, x 2 ) 8
11 Conditional Independence X 3 is independent of X 2 conditional on X 1 if p(x 3 x 2, x 1 ) = p(x 3 x 1 ) (x 1, x 2, x 3 ). Given you know X 1 = x 1, X 2 tells you nothing about X 3. We write X 2 X 3 X 1 9
12 If X 2 X 3 X 1 then, p(x 1, x 2, x 2 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1, x 2 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1 ) It is this kind of simplification of the joint structure that graphical models are meant to capture. 10
13 X 2 X 3 X 1 then we could have, p(x 1, x 2, x 3 ) = p(x 1 ) p(x 3 x 1 ) p(x 2 x 1 ). This is a directed graphical model. Each variable is a node in the graph. The arrows pointing into a node tell you which (of the previous) variables it depends on. 11
14 If X 1 X 3 X 2 then we could have, p(x 1, x 2, x 3 ) = p(x 1 ) p(x 2 x 1 ) p(x 3 x 2 ). 12
15 Complete independence: Complete dependence: 13
16 Our graphs are DAGs: Directed Acyclic Graphs: Directed: The connections are arrows that point from one node (variable) to another. Acyclic: You can t start at a node and follow the arrows back to the same node. 14
17 Several Variables: Our definitions extend to many variables easily by considering groups. Suppose we have (X 1, X 2,..., X p ). Let, G = {i 1,, i 2,..., i q } {1, 2,..., p} and then,. X G = {X i : i G} We can say one group of variables is independent of another group conditional on a third group X G3 X G2 X G1. 15
18 Conditional independence can be a very powerfull way to capture the assumptions in a model. For example in the model: Y = X β + ɛ, ɛ N(0, σ 2 ), ɛ X where X is considered a random vector we have: Y X X β. Given you know X β, X has no more information about Y. 16
19 3. The Sinus Example Suppose we know the following: The flu causes sinus inflammation. Allergies cause sinus inflammation. Sinus inflammation causes a runny nose. Sinus inflammation causes headaches. 17
20 The flu causes sinus inflammation. Allergies cause sinus inflammation. Sinus inflammation causes a runny nose. Sinus inflammation causes headaches. Note: In this example we explicitly interpret the relationships as causal. More generally they may just be observational. 18
21 Let F = 1 if a person has the flu, 0 else, A = 1 if a person has allergies, 0 else, S = 1 if a person has sinus inflammation, 0 else, H = 1 if a person has a headache, 0 else, N = 1 if a person has a runny nose, 0 else. Then, p(f, a, s, h, n) = p(f ) p(a) p(s f, a) p(h s) p(n s). 19
22 The absence of edges contains information!!! p(f, a, s, h, n) = p(f ) p(a) p(s f, a) p(h s) p(n s). so, for example, (H, N) (F, A) S. F effects N, but only through S. 20
23 4. Bayesian Networks A Bayian Network is a DAG: Directed Acyclic Graphical representation of the joint distribuiton of X = (X 1, X 2,..., X p ). Given a specific ordering of the variables, let P i be a set of variables such that p(x i x 1, x 2,..., x i 1 ) = p(x i P i ) where P i are the parents of X i, P i {x 1, x 2,..., x i 1 }. Then, p(x 1, x 2,..., x p ) = i p(x i P i ). Local Markov Assumption: A variable X is independent of its non-descendants given its parents (only the parents). 21
24 Bayes Nets: Algorithms for fitting DAG structures to data are in important part of Machine Learning/Data Science/Artificial Intelligence. Also important are algorithms for computing conditionals and marginals. In our sinus example we might want to know p(a = 1 N = 1). 22
25 An important example of a Machine Learning technique which exploits a certain conditional independence structure is the Naive Bayes Classifier. We will study NB below. 23
26 5. Bayes Rule Classification Consider the directed data mining problem with a categorical Y. Many methods can be viewed as an attempt to estimate: p(y x) the conditional distribution of Y given X = x. For example in logistic regression we have: p(y = 1 x) Bernoulli(p(x)), p(x) = x β. 24
27 An alternative approach is to estimate the full joint distribution of (X, Y ) by estimating the marginal for for Y (p(y)) and the conditional for X (p(x y)). We then have the joint via: p(x, y) = p(y) p(x y). And classification is then obtained from Bayes Theorem: p(y x) = p(y)p(x y) p(x) p(y)p(x y) As usual we can predict the most probable y or make a decision based on the probabilities. 25
28 In general, it seems more straighforward to estimate p(y x) directly but some well known approaches take the Bayes-rule path. For example, for x numeric, classic discriminant analysis assumes p(x y) N(µ y, Σ). Remember, Y is discrete and we have to estimate p(x y) for each possible y. 26
29 6. Naive Bayes Classification Naive Bayes classification simplifies the problem by assumming that the elements of X are conditionally independent given Y. p(x, y) = p(y) i p(x i y) so that p(y x) p(y) i p(x i y) Each coordinate x i of x gets to multiply in it s own contribution of evidence about y depending on how likely x i would be if Y = y. Given x, a simple approach to classification just guesses the y having the largest p(y x). 27
30 28
31 NB has some key advantages: We only have to estimate the low dimension p(x i y) instead of the high dimensional p(x y)!!!! Many small bits of information from each x i can be combined. It is simple. The main disadvantage is that the conditional independence assumption often seems inappropriate. However, this does not seem to keep from working very well in practice!!! According to Mladen, NB is the single most used classifier out there. NB often performs well, even when the assumption is violated. 29
32 7. Sentiment Analysis: Spam or Ham Sentiment analysis tries to classify text documents. A popular approach is to combine bag of words with NB. Each word in the document provides an additional independent piece of evidence about the kind of document it is. A simple example is trying to classify the document as spam or not: spam or ham. 30
33 Bag of words means just that, we ignore the order of the words. The document: When the lecture is over, remember to wake up the person sitting next to you in the lecture room. is the same as the document, in is lecture lecture next over person remember room sitting the the the to to up wake when you 31
34 Sentiment Analysis has been used very creatively: Classify a bunch of discussion posts on the economy at time t and then predict whether the stock market will go up up at time t + 1 given how many are classified as positive. Classify a bunch of discussion posts as favorable to candidate A or candidate B and then predict how they will do in the next poll or the election. 32
35 SMS Spam Data: Note: this follows Chapter 4 of Machine Learning with R, by Brett Lanz. Note: sms: short message service. Have 5,559 sms text message documents. Each one is labelled as spam or ham. Here is the first (ham) and fourth (spam) observation: > smsraw[1,] type text 1 ham Hope you are having a good week. Just checking in > smsraw[4,] type 4 spam 4 complimentary 4 STAR Ibiza Holiday or 10,000 cash needs your URGENT collection NOW from Landline not to lose out! Box434SK38WP150PPM18+ 33
36 Work flow: clean: tolower, kill numbers, punctuation, stopwords stem: (help,helped,helping,helps) becomes (help,help,help,help) tokenization: split a document up into single words (or tokens or terms ). document term matrix (DTM): rows indicate documents columns are counts for terms. train/test split. throw away low count terms. convert DTM to indicators: Yes if the word (term) is in the document, 0 else. do Naive Bayes!! 34
37 Note: Most of the work is processing the data!!!!! This is typically the case in real world applications. Getting the data into a form that allows you to analysize it is time consuming and very important. Garbage in, garbage out!!! In class we will typically focus on getting a basic understanding of the methods and not emphasize the data wrangling. 35
38 Clean and Stem Here are the first two documents: > smsraw$text[1] [1] "Hope you are having a good week. Just checking in" > smsraw$text[2] [1] "K..give back my thanks." Here are the first 2 docs after cleaning. smscc is the cleaned data in the Corpus data structure from the tm R package. smscc is for sms data as a Cleaned Corpus. > smscc[[1]][1] $content [1] "hope good week just check" > smscc[[2]][1] $content [1] "kgive back thank" 36
39 Tokenize and get DTM Tokenization gives us 6518 words (or terms) from all the sms documents. The i th row of the DTM gives us the count for each term in document i. > print(dim(smsdtm)) [1] > library(slam) #for col_sums > summary(col_sums(smsdtm)) #summarize total time a term is used. Min. 1st Qu. Median Mean 3rd Qu. Max > terms = smsdtm$dimnames$terms > nterm = length(terms) > set.seed(14) > ii = sample(1:nterm,20) > terms[ii] [1] "effect" "pinku" "wikipediacom" "mundh" "wwwsmsconet" [6] "marsm" "voic" "itz" "logo" "hip" [11] "transfr" "colin" "leo" "technolog" "text" [16] "scratch" "graze" "prolli" "tech" "ofsi" 37
40 Word frequencies for ham: Word frequencies for spam: 38
41 Train/Test #train and test # creating training and test datasets smstrain = smsdtm[1:4169, ] smstest = smsdtm[4170:5559, ] # also save the labels smstrainy = smsraw[1:4169, ]$type smstesty = smsraw[4170:5559, ]$type > prop.table(table(smstrainy)) smstrainy ham spam > prop.table(table(smstesty)) smstesty ham spam
42 Throw Away Terms with Low Frequency # save frequently-appearing terms to a character vector smsfreqwords = findfreqterms(smstrain, 5) > str(smsfreqwords) chr [1:1136] "abiola" "abl" "abt" "accept" "access" "account"... > length(smsfreqwords) [1] 1136 # create DTMs with only the frequent terms smsfreqtrain = smstrain[, smsfreqwords] smsfreqtest = smstest[, smsfreqwords] > dim(smsfreqtrain) [1] > dim(smstest) [1]
43 Convert Counts to Indicators #convert counts to if(count>0) (yes,no) convertcounts <- function(x) { x <- ifelse(x > 0, "Yes", "No") } # apply() convert_counts() to columns of train/test data # these are just matrices smstrain = apply(smsfreqtrain, MARGIN = 2, convertcounts) smstest <- apply(smsfreqtest, MARGIN = 2, convertcounts) > dim(smstrain) [1] > smstrain[1:3,1:5] Terms Docs abiola abl abt accept access 1 "No" "No" "No" "No" "No" 2 "No" "No" "No" "No" "No" 3 "No" "No" "No" "No" "No" 41
44 We are ready for NB!!! library(e1071) smsnb = naivebayes(smstrain, smstrainy) > smsnb$tables[1:3] $abiola abiola smstrainy No Yes ham spam $abl abl smstrainy No Yes ham spam $abt abt smstrainy No Yes ham spam The tables are our p(x i y) terms!! y is ham or spam and x i are the words(terms): abiola, abl, abt,... That is p(abiola = Yes y = ham) =
45 $age age smstrainy No Yes ham spam $adult adult smstrainy No Yes ham spam What is p(y = spam age = Yes)? (The prob the sms is spam given the word age is in it). Let s use p(y = spam) =.14, the training data proportion. p(y = spam age = Yes) = p(y=spam)p(age=yes y=spam) p(y=spam)p(age=yes y=spam)+p(y=ham)p(age=yes y=ham) >.14* /(.14* * ) [1]
46 $age age smstrainy No Yes ham spam $adult adult smstrainy No Yes ham spam What is p(y = spam age = Yes, adult = Yes)? (The prob the sms is spam given the word age is in it and the word adult is in it). p(y = spam age = Yes, adult = Yes) = p(spam)p(age=y spam)p(adult=y spam) p(spam)p(age=y spam)p(adult=y spam)+p(ham)p(age=y ham)p(age=y ham) >.14* * /(.14* * * * ) [1]
47 Out of Sample Confusion Matrix yhat = predict(smsnb,smstest) library(gmodels) CrossTable(yhat, smstesty, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c( predicted, actual )) actual predicted ham spam Row Total ham spam Column Total Missclassification rate: 36/1390 = % spam detected:.836. Not bad!! 45
48 Try it again with laplace=1, add 1 to each count when estimating th 2x2 tables. > smsnb2 = naivebayes(smstrain, smstrainy,laplace=1) > smsnb2$tables[1:3] $abiola abiola smstrainy No Yes ham spam $abl abl smstrainy No Yes ham spam $abt abt smstrainy No Yes ham spam Got rid of the 0/1 conditional probabilities, which are extreme. 46
49 > yhat2 = predict(smsnb2,smstest) > CrossTable(yhat2, smstesty, + prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, + dnn = c( predicted, actual )) actual predicted ham spam Row Total ham spam Column Total Not too different. 47
Probabilistic Graphical Models
Probabilistic Graphical Models Carlos Carvalho, Mladen Kolar and Robert McCulloch 11/19/2015 Classification revisited The goal of classification is to learn a mapping from features to the target class.
More informationDirected Graphical Models or Bayesian Networks
Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Bayesian Networks One of the most exciting recent advancements in statistical AI Compact
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationBayesian Networks Representation
Bayesian Networks Representation Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 19 th, 2007 Handwriting recognition Character recognition, e.g., kernel SVMs a c z rr r r
More informationCS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More informationRepresentation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2
Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training
More informationNaïve Bayes. Vibhav Gogate The University of Texas at Dallas
Naïve Bayes Vibhav Gogate The University of Texas at Dallas Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationAxioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.
Bayesian Networks Today we ll introduce Bayesian Networks. This material is covered in chapters 13 and 14. Chapter 13 gives basic background on probability and Chapter 14 talks about Bayesian Networks.
More informationDirected Graphical Models
CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationConditional probabilities and graphical models
Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within
More informationTutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining
1/21 Tutorial 2 CPSC 340: Machine Learning and Data Mining Fall 2016 Overview 2/21 1 Decision Tree Decision Stump Decision Tree 2 Training, Testing, and Validation Set 3 Naive Bayes Classifier Decision
More informationDirected Graphical Models
Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express
More informationProbabilistic Classification
Bayesian Networks Probabilistic Classification Goal: Gather Labeled Training Data Build/Learn a Probability Model Use the model to infer class labels for unlabeled data points Example: Spam Filtering...
More informationIntroduction to Bayesian Learning
Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline
More informationBayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:
Bayesian Inference The purpose of this document is to review belief networks and naive Bayes classifiers. Definitions from Probability: Belief networks: Naive Bayes Classifiers: Advantages and Disadvantages
More informationGenerative Learning algorithms
CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x. For instance,
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationNaïve Bayes Classifiers
Naïve Bayes Classifiers Example: PlayTennis (6.9.1) Given a new instance, e.g. (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong ), we want to compute the most likely hypothesis: v NB
More informationCOMP 328: Machine Learning
COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang
More informationProbability Review and Naïve Bayes
Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this
More informationToday. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?
Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates
More informationCMPSCI 240: Reasoning about Uncertainty
CMPSCI 240: Reasoning about Uncertainty Lecture 17: Representing Joint PMFs and Bayesian Networks Andrew McGregor University of Massachusetts Last Compiled: April 7, 2017 Warm Up: Joint distributions Recall
More informationGenerative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul
Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far
More informationStructure Learning: the good, the bad, the ugly
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationCS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine
CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationMachine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler
+ Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table
More informationBayesian Approaches Data Mining Selected Technique
Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationMachine Learning Algorithm. Heejun Kim
Machine Learning Algorithm Heejun Kim June 12, 2018 Machine Learning Algorithms Machine Learning algorithm: a procedure in developing computer programs that improve their performance with experience. Types
More informationCMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th
CMPT 882 - Machine Learning Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th Stephen Fagan sfagan@sfu.ca Overview: Introduction - Who was Bayes? - Bayesian Statistics Versus Classical Statistics
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationIntroduction to Artificial Intelligence. Unit # 11
Introduction to Artificial Intelligence Unit # 11 1 Course Outline Overview of Artificial Intelligence State Space Representation Search Techniques Machine Learning Logic Probabilistic Reasoning/Bayesian
More informationCS 188: Artificial Intelligence Spring Today
CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More information10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging
10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will
More informationBehavioral Data Mining. Lecture 2
Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationExamination Artificial Intelligence Module Intelligent Interaction Design December 2014
Examination Artificial Intelligence Module Intelligent Interaction Design December 2014 Introduction This exam is closed book, you may only use a simple calculator (addition, substraction, multiplication
More informationEvents A and B are independent P(A) = P(A B) = P(A B) / P(B)
Events A and B are independent A B U P(A) = P(A B) = P(A B) / P(B) 1 Alternative Characterization of Independence a) P(A B) = P(A) b) P(A B) = P(A) P(B) Recall P(A B) = P(A B) / P(B) (if P(B) 0) So P(A
More informationIntroduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013
Introduction to Bayes Nets CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Introduction Review probabilistic inference, independence and conditional independence Bayesian Networks - - What
More informationCSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas
ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem
More informationBasics of Probability
Basics of Probability Lecture 1 Doug Downey, Northwestern EECS 474 Events Event space E.g. for dice, = {1, 2, 3, 4, 5, 6} Set of measurable events S 2 E.g., = event we roll an even number = {2, 4, 6} S
More informationRecall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem
Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)
More informationStephen Scott.
1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions
More informationCategorization ANLP Lecture 10 Text Categorization with Naive Bayes
1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition
More informationANLP Lecture 10 Text Categorization with Naive Bayes
ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition
More informationCOS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference
COS402- Artificial Intelligence Fall 2015 Lecture 10: Bayesian Networks & Exact Inference Outline Logical inference and probabilistic inference Independence and conditional independence Bayes Nets Semantics
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian
More informationConditional Independence
Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu 1 Conditional Independence Topics 1. What is Conditional Independence? Factorization of probability distribution into marginals 2. Why
More informationData Mining Part 4. Prediction
Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationBayesian Networks Basic and simple graphs
Bayesian Networks Basic and simple graphs Ullrika Sahlin, Centre of Environmental and Climate Research Lund University, Sweden Ullrika.Sahlin@cec.lu.se http://www.cec.lu.se/ullrika-sahlin Bayesian [Belief]
More informationBayesian Networks Structure Learning (cont.)
Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic
More informationIntroduction to Machine Learning
Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with
More informationCS 188: Artificial Intelligence. Machine Learning
CS 188: Artificial Intelligence Review of Machine Learning (ML) DISCLAIMER: It is insufficient to simply study these slides, they are merely meant as a quick refresher of the high-level ideas covered.
More informationNaïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014
Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation
More informationHomework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn
Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn CMU 10-701: Machine Learning (Fall 2016) https://piazza.com/class/is95mzbrvpn63d OUT: September 13th DUE: September
More informationDirected and Undirected Graphical Models
Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationCSE446: Naïve Bayes Winter 2016
CSE446: Naïve Bayes Winter 2016 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, Luke ZeIlemoyer Supervised Learning: find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximanon
More informationMaxent Models and Discriminative Estimation
Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much
More informationMachine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012
Machine Learning Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421 Introduction to Artificial Intelligence 8 May 2012 g 1 Many slides courtesy of Dan Klein, Stuart Russell, or Andrew
More informationIntroduction to Machine Learning
Uncertainty Introduction to Machine Learning CS4375 --- Fall 2018 a Bayesian Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell Most real-world problems deal with
More informationGenerative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang
Generative Classifiers: Part 1 CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang 1 This Week Discriminative vs Generative Models Simple Model: Does the patient
More information1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution
NETWORK ANALYSIS Lourens Waldorp PROBABILITY AND GRAPHS The objective is to obtain a correspondence between the intuitive pictures (graphs) of variables of interest and the probability distributions of
More informationPattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The
More informationMachine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL
Machine learning: lecture 20 ommi. Jaakkola MI CAI tommi@csail.mit.edu opics Representation and graphical models examples Bayesian networks examples, specification graphs and independence associated distribution
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationBayesian Networks. CompSci 270 Duke University Ron Parr. Why Joint Distributions are Important
Bayesian Networks CompSci 270 Duke University Ron Parr Why Joint Distributions are Important Joint distributions gives P(X 1 X n ) Classification/Diagnosis Suppose X1=disease X2 Xn = symptoms Co-occurrence
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationMore Smoothing, Tuning, and Evaluation
More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w
More informationIntroduction to AI Learning Bayesian networks. Vibhav Gogate
Introduction to AI Learning Bayesian networks Vibhav Gogate Inductive Learning in a nutshell Given: Data Examples of a function (X, F(X)) Predict function F(X) for new examples X Discrete F(X): Classification
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationDimensionality reduction
Dimensionality Reduction PCA continued Machine Learning CSE446 Carlos Guestrin University of Washington May 22, 2013 Carlos Guestrin 2005-2013 1 Dimensionality reduction n Input data may have thousands
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationIntelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2016/2017 Lesson 13 24 march 2017 Reasoning with Bayesian Networks Naïve Bayesian Systems...2 Example
More informationClassification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016
Classification: Naïve Bayes Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 1 Sentiment Analysis Recall the task: Filled with horrific dialogue, laughable characters,
More informationLearning P-maps Param. Learning
Readings: K&F: 3.3, 3.4, 16.1, 16.2, 16.3, 16.4 Learning P-maps Param. Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 24 th, 2008 10-708 Carlos Guestrin 2006-2008
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends
More informationGeneral Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering
CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationCHAPTER-17. Decision Tree Induction
CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationCS Lecture 3. More Bayesian Networks
CS 6347 Lecture 3 More Bayesian Networks Recap Last time: Complexity challenges Representing distributions Computing probabilities/doing inference Introduction to Bayesian networks Today: D-separation,
More informationCOS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007
COS 424: Interacting with ata Lecturer: ave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007 1 Graphical Models Wrap-up We began the lecture with some final words on graphical models. Choosing a
More informationLecture 6: Graphical Models
Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:
More informationBayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III
CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III Bayesian Networks Announcements: Drop deadline is this Sunday Nov 5 th. All lecture notes needed for T3 posted (L13,,L17). T3 sample
More informationData Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the
More information