Probability Based Learning

Size: px
Start display at page:

Download "Probability Based Learning"

Transcription

1 Probability Based Learning Lecture 7, DD2431 Machine Learning J. Sullivan, A. Maki September 2013

2 Advantages of Probability Based Methods Work with sparse training data. More powerful than deterministic methods - decision trees - when training data is sparse. Results are interpretable. More transparent and mathematically rigorous than methods such as ANN, Evolutionary methods. Tool for interpreting other methods. Framework for formalizing other methods - concept learning, least squares.

3 Outline Probability Theory Basics Bayes rule MAP and ML estimation Minimum Description Length principle Naïve Bayes Classifier EM Algorithm

4 Probability Theory Basics

5 Random Variables A random variable x denotes a quantity that is uncertain the result of flipping a coin, the result of measuring the temperature The probability distribution P (x) of a randam variable (r.v.) captures the fact that the r.v. will have different values when observed and Some values occur more than others.

6 Random Variables 26 2 Introduction to probability A discrete random variable takes values from a predefined set. For a Boolean discrete random variable this Figure 2.1 Two different representations for discrete probabilities a) A bar predefined set has two members - graph representing the probability that a biased 6-sided {0, die 1}, lands on {yes, each no} etc. face. The height of the bar represents the probability: the sum of all heights is one. b) A Hinton diagram illustrating the probability of observing different weather types in England. The area of the square represents the probability, A continuous so the sum ofrandom all areas is one. variable takes values that 26 are real numbers. 2 Introduction to probability Figure 2.2 Continuous probability distribution (probability density function or pdf for short) for time taken to complete a test. Note that the probability density can exceed one, but the area under the curve must always have unit area. discrete pdf Figure 2.1 Two different representations for discrete probabilities a) A bar graph representing the probability that a biased 6-sided die lands on each 2.2 face. Joint The height probability of the bar represents the probability: the sum of all heights is one. b) A Hinton diagram illustrating the probability of observing different Figures taken from weather Computer types in England. Vision: The area models, of the square learning represents and the probability, inference by Simon Prince. so Consider the sum of two allrandom areas isvariables, one. x and y. If we observe multiple paired instances Problem 2.1 continuous pdf

7 Joint Probabilities Consider two random variables x and y. Observe multiple paired instances of x and y. Some paired outcomes will occur more frequently. 2 Introduction to probability This information is encoded in the joint probability distribution P (x, y). P (x) denotes the joint probability of x = (x 1,..., x K ). discrete joint pdf Figure from Computer Vision: models, learning and inference by Simon Prince.

8 Joint Probabilities (cont.) a) b) c) d) e) f) Figure from Computer Vision: models, learning and inference by Simon Prince.

9 Marginalization The probability distribution of any single variable can be recovered from a joint distribution by summing for the discrete case P (x) = y P (x, y) and integrating for the continuous case P (x) = P (x, y) dy y

10 Marginalization (cont.) a) b) c) Figure from Computer Vision: models, learning and inference by Simon Prince.

11 Conditional Probability The conditional probability of x given that y takes value y indicates the different values of r.v. x which we ll observe given that y is fixed to value y. The conditional probability can be recovered from the joint distribution P (x, y): P (x y = y ) = P (x, y = y ) P (x, y = y ) P (y = y = ) x P (x, y = y ) dx 2.4 Conditional probability 29 Extract an appropriate slice, and then normalize it. Figure 2.5 Conditional Probability. Joint pdf of x and y and two conditional Figure from Computer probability Vision: distributions models, Pr(x y learning = y1) andpr(x y and= inference y2). These are byformed Simon Prince.

12 Bayes Rule Bayes Rule P (y x) = P (x y)p (y) P (x) = P (x y)p (y) P (x y)p (y) y Each term in Bayes rule has a name: P (y x) Posterior (what we know about y given x.) P (y) Prior (what we know about y before we consider x.) P (x y) Likelihood (propensity for observing a certain value of x given a certain value of y) P (x) Evidence (a constant to ensure that the l.h.s. is a valid distribution)

13 Bayes Rule In many of our applications y is a discrete variable and x is a multi-dimensional data vector extracted from the world. Then P (y x) = P (x y)p (y) P (x) P (x y) Likelihood represents the probability of observing data x given the hypothesis y. P (y) Prior of y represents the background knowledge of hypothesis y being correct. P (y x) Posterior represents the probability that hypothesis y is true after data x has been observed.

14 Learning and Inference Bayesian Inference: The process of calculating the posterior probability distribution P (y x) for certain data x. Bayesian Learning: The process of learning the likelihood distribution P (x y) and prior probability distribution P (y) from a set of training points {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}

15 Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information given: The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

16 Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information given: The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

17 Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Example: Which Gender? Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information distributions over given: hair length for each class Given: classes h = {A, B}, A = men, B = women, Task: Determine the gender of a person given the hair length The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 and it might be interesting to know that the observation of hair length is made in a boys school: P(A) = P A = 0.95, P(B) = P B = 0.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

18 Example: Which Gender? Task: Determine the gender of a person given their measured hair length = calculate P (g x). Solution: Apply Bayes Rule to get P (x g = m )P (g = m ) P (g = m x) = P (x) P (x g = m )P (g = m ) = P (x g = f )P (g = f ) + P (x g = m )P (g = m ) Can calculate P (g = f x) = 1 P (g = m x)

19 Selecting the most probably hypothesis Maximum A Posteriori (MAP) Estimate: Hypothesis with highest probability given observed data y MAP = arg max y Y = arg max y Y = arg max y Y P (y x) P (x y) P (y) P (x) P (x y) P (y) Maximum Likelihood Estimate (MLE): Hypothesis with highest likelihood of generating observed data. y MLE = arg max y Y P (x y) Useful if we do not know prior distribution or if it is uniform.

20 Selecting the most probably hypothesis Maximum A Posteriori (MAP) Estimate: Hypothesis with highest probability given observed data y MAP = arg max y Y = arg max y Y = arg max y Y P (y x) P (x y) P (y) P (x) P (x y) P (y) Maximum Likelihood Estimate (MLE): Hypothesis with highest likelihood of generating observed data. y MLE = arg max y Y P (x y) Useful if we do not know prior distribution or if it is uniform.

21 Example: Cancer or Not? Scenario: A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have cancer. Scenario in probabilities: Priors: Likelihoods: P (disease) =.008 P (not disease) =.992 P (+ disease) =.98 P (+ not disease) =.03 P ( disease) =.02 P ( not disease) =.97

22 Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) = =.0078 P (+ not disease) P (not disease) = =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = ( ) = P (not disease +) = ( ) =.79

23 Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) = =.0078 P (+ not disease) P (not disease) = =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = ( ) = P (not disease +) = ( ) =.79

24 Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) = =.0078 P (+ not disease) P (not disease) = =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = ( ) = P (not disease +) = ( ) =.79

25 Relation to Occams s Razor Occam s Razor: Choose the simplest explanation for the observed data Information theoretic perspective Occam s razor corresponds to choosing the explanation requiring the fewest bits to represent. The optimal representation requires log 2 p(y x) bits to store. (Remember: the Shannon information content) Minimum description length principle: Choose hypothesis y MDL = arg min y Y log 2 P (y x) = arg min y Y log 2 P (x y) log 2 P (y) The MDL estimate is equal to the MAP estimate y MAP = arg max y Y log 2 P (x y) + log 2 P (y)

26 Relation to Occams s Razor Occam s Razor: Choose the simplest explanation for the observed data Information theoretic perspective Occam s razor corresponds to choosing the explanation requiring the fewest bits to represent. The optimal representation requires log 2 p(y x) bits to store. (Remember: the Shannon information content) Minimum description length principle: Choose hypothesis y MDL = arg min y Y log 2 P (y x) = arg min y Y log 2 P (x y) log 2 P (y) The MDL estimate is equal to the MAP estimate y MAP = arg max y Y log 2 P (x y) + log 2 P (y)

27 Naïve Bayes Classifier

28 Feature Space ources of Noise Sources of Noise Sensors give measurements which can be converted to features. Ideally a feature value is identical for all samples in one class. Samples Feature space

29 Feature Space Sources of Noise Sensors give measurements which can be converted to features. However in the real world Samples because of Measurement noise Intra-class variation Poor choice of features Feature space 25 25

30 Feature Space Feature Space End result: a K dimensional space in which each dimension is a feature containing n labelled samples (objects) 26

31 Problem: Large Feature Space Size of feature space exponential in number of features. More features = potential for better description of the objects but... More features = more difficult to model P (x y). Extreme Solution: Naïve Bayes classifier All features (dimensions) regarded as independent. Model k one-dimensional distributions instead of one k-dimensional distribution.

32 Problem: Large Feature Space Size of feature space exponential in number of features. More features = potential for better description of the objects but... More features = more difficult to model P (x y). Extreme Solution: Naïve Bayes classifier All features (dimensions) regarded as independent. Model k one-dimensional distributions instead of one k-dimensional distribution.

33 Naïve Bayes Classifier One of the most common learning methods. When to use: Moderate or large training set available. Features x i of a data instance x are conditionally independent given classification (or at least reasonably independent, still works with a little dependence). Successful applications: Medical diagnoses (symptoms independent) Classification of text documents (words independent)

34 Naïve Bayes Classifier x is a vector (x 1,..., x K ) of attribute or feature values. Let Y = {1, 2,..., Y } be the set of possible classes. The MAP estimate of y is y MAP = arg max y Y P (y x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) P (x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) Naïve Bayes assumption: P (x 1,..., x K y) = K k=1 P (x k y) This give the Naïve Bayes classifier: y MAP = arg max P (y) K P (x k y) y Y k=1

35 Naïve Bayes Classifier x is a vector (x 1,..., x K ) of attribute or feature values. Let Y = {1, 2,..., Y } be the set of possible classes. The MAP estimate of y is y MAP = arg max y Y P (y x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) P (x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) Naïve Bayes assumption: P (x 1,..., x K y) = K k=1 P (x k y) This give the Naïve Bayes classifier: y MAP = arg max P (y) K P (x k y) y Y k=1

36 Example: Play Tennis? Question: Will I go and play tennis given the forecast? My measurements: 1 forecast {sunny, overcast, rainy}, 2 temperature {hot, mild, cool}, 3 humidity {high, normal}, 4 windy {false, true}. Possible decisions: y {yes, no}

37 Example: Play Tennis? : What Play I did in Tennis? the past: nd ata

38 Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

39 Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

40 Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

41 Example: Play Tennis? Inference: Use the learnt model to classify a new instance. New instance: x = (sunny, cool, high, true) Apply Naïve Bayes Classifier: 4 y MAP = arg max P (y) P (x i y) y {yes, no} i=1 P (yes) P (sunny yes) P (cool yes) P (high yes) P (true yes) = =.005 P (no) P (sunny no) P (cool no) P (high no) P (true no) = =.021 = y MAP = no

42 Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

43 Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

44 Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

45 Naïve Bayes: Estimating Probabilities Problem: What if none of the training instances with target value y have attribute x i? Then P (x i y) = 0 = P (y) K P (x i y) = 0 Solution: Add as prior knowledge that P (x i y) must be larger than 0: i=1 where P (x i y) = n y + mp n + m n = number of training samples with label y n y = number of training samples with label y and value x i p = prior estimate of P (x i y) m = weight given to prior estimate (in relation to data)

46 Naïve Bayes: Estimating Probabilities Problem: What if none of the training instances with target value y have attribute x i? Then P (x i y) = 0 = P (y) K P (x i y) = 0 Solution: Add as prior knowledge that P (x i y) must be larger than 0: i=1 where P (x i y) = n y + mp n + m n = number of training samples with label y n y = number of training samples with label y and value x i p = prior estimate of P (x i y) m = weight given to prior estimate (in relation to data)

47 Example: Spam detection Aim: Build a classifier to identify spam s. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam s. is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam)

48 Example: Spam detection Aim: Build a classifier to identify spam s. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam s. is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam)

49 Example: Spam detection Aim: Build a classifier to identify spam s. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam s. is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam) E Vector: e Dear customer, A fully licensed Online Pharmacy is offering pharmaceuticals: - brought to you directly from abroad -produced by the same multinational corporations selling through the major US pharmacies -priced up to 5 times cheaper as compared to major US pharmacies. Enjoy the US dollar purchasing power on ('dear', 'customer', ',', 'a', 'fully', 'licensed',...,'/') Concatenate words from into a vector

50 Example: Spam detection Aim: Build a classifier to identify spam s. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam s. is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam) Inference Given an , E, compute e = (e 1,..., e K ). Use Bayes rule to compute P (spam e 1,..., e K ) P (e 1,..., e K spam) P (spam)

51 Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from to .. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

52 Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from to .. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

53 Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from to .. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

54 Example: Spam detection Learning: Assume one has n training s and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). S = {(e 1, y 1 ),..., (e n, y n )}

55 Example: Spam detection Learning: Assume one has n training s and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). Create dictionary S = {(e 1, y 1 ),..., (e n, y n )} 1 Make a union of all the distinctive words and tokens in e 1,..., e n to create W = {w 1,..., w L }. (Note: words such as and, the,... omitted)

56 Example: Spam detection Learning: Assume one has n training s and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). Learn probabilities For y {spam, not spam} S = {(e 1, y 1 ),..., (e n, y n )} 1 Set P (y) = n i=1 Ind(yi=y) n proportion of s from class y. 2 n y = n i=1 K i Ind (y i = y) total # of words in the class y s. 3 For each word w l compute n yl = n i=1 Ind (y i = y) occurrences of word w l in the class y s. 4 P (w l y) = n yl+1 n y+ W ( Ki k=1 Ind (e ik = w l ) ) # of assume prior value of P (w l y) is 1/ W.

57 Example: Spam detection Inference: Classify a new e = (e 1,..., e K ) y = arg max y { 1,1} K P (y) P (e k y) k=1

58 Summary so far Bayesian theory: Combines prior knowledge and observed data to find the most probable hypothesis. Naïve Bayes Classifier: All variables considered independent.

59 Expectation-Maximization (EM) Algorithm

60 Mixture of Gaussians This distribution is a weight sum of K Gaussian distributions K P (x) = π k N (x; µ k, σk 2 ) ure 7.6 Mixture of Gaussians el in 1D. A complex multimodal bability density function (black d curve) is created by taking a ghted sum or mixture of several stituent normal distributions with erent means and variances (red, n and blue dashed curves). To ure that the final distribution is aliddensity,theweightsmustbe itive and sum to one. k=1 where π π K = 1 7 Modeling complex data densities and π k > 0 (k = 1,..., K). This model can describe complex multi-modal probability distributions by combining simpler distributions. st function for the M-Step (equation 7.12) improves the bound. For now we ssume that these things are true and proceed with the main thrust of the er. We will return to these issues in section 7.8.

61 Mixture of Gaussians P (x) = K π k N (x; µ k, σk 2 ) k=1 Learning the parameters of this model from training data x 1,..., x n is not trivial - using the usual straightforward maximum likelihood approach. Instead learn parameters using the Expectation-Maximization (EM) algorithm.

62 Mixture of Gaussians as a marginalization We can interpret the Mixture of Gaussians model with the introduction of a discrete hidden/latent variable h and P (x, h): K K P (x) = P (x, h = k) = P (x h = k)p (h = k) k=1 = k=1 K π k N (x; µ k, σk) Mixture of Gaussians 111 k=1 Figure 7.7 Mixture of Gaussians as a marginalization. The mixture of Gaussians can also be thought of in terms of a joint distribution Pr(x, h) between the observed variable x and a discrete hidden variable h. To create the mixture density we marginalize over h. The hidden variable has a straightforward interpretation: it is the index of the constituent normal distribution. mixture density Figures must takenbe from positive Computer definite. Vision: For a simpler models, approach, learning weand express inference the observed by Simon density Prince. as a marginalization and use the EM algorithm to learn the parameters.

63 EM for two Gaussians Assume: We know the pdf of x has this form: P (x) = π 1 N (x; µ 1, σ1) 2 + π 2 N (x; µ 2, σ2) 2 where π 1 + π 2 = 1 and π k > 0 for components k = 1, 2. Unknown: Values of the parameters (Many!) Θ = (π 1, µ 1, σ 1, µ 2, σ 2 ). Have: Observed n samples x 1,..., x n drawn from p(x). Want to: Estimate Θ from x 1,..., x n. How would it be possible to get them all???

64 EM for two Gaussians For each sample x i introduce a hidden variable h i { 1 if sample x i was drawn from N (x; µ 1, σ 2 h i = 1) 2 if sample x i was drawn from N (x; µ 2, σ2) 2 and come up with initial values for each of the parameters. Θ (0) = (π (0) 1, µ (0) 1, σ (0) 1, µ (0) 2, σ (0) 2 ) EM is an iterative algorithm which updates Θ (t) using the following two steps...

65 EM for two Gaussians: E-step The responsibility of k-th Gaussian for each sample x (indicated by the size of the projected data point) Modeling complex data densities Look at each sample x along hidden variable h in Figure 7.8 E-Step for fitting the mixture of Gaussians model. For each of the I data points x1...i, we calculate the posterior distribution Pr(hi xi) over the hidden variable hi. The posterior probability Pr(hi = k xi) thathi takes value k can be understood thease-step responsibility of normal distribution k for data point xi. For example, for data point x1 (magenta circle) the component 1 (red curve) is more than twice as likely to be responsible than component 2 (green curve). Note that in the joint distribution (left), the Figure from Computer size ofvision: the projected models, data point learning indicates the responsibility. and inference by Simon Prince.

66 EM for two Gaussians: E-step (cont.) E-step: Compute the posterior probability that x i was generated by component k given the current estimate of the parameters Θ (t). (responsibilities) for i = 1,... n for k = 1, 2 γ (t) ik = P (h i = k x i, Θ (t) ) = π (t) k N (x i; µ (t) k, σ(t) k ) π (t) 1 N (x i ; µ (t) 1, σ (t) 1 ) + π (t) 2 N (x i ; µ (t) 2, σ (t) 2 ) Note: γ (t) i1 + γ(t) i2 = 1 and π 1 + π 2 = 1

67 EM for two Gaussians: M-step Fitting the Gaussian model for each of k-th constinuetnt. 7.4 Mixture of Gaussians 113 Sample x i contributes according to the responsibility γ ik. (dashed and solid lines for fit before and after update) Figure 7.9 M-Step for fitting the mixture of Gaussians model. For the k th constituent Gaussian, we update the parameters {λ k, µ k, Σ k}. The i th data point x i contributes to these updates according to the responsibility r ik (indicated by size of point) assigned in the E-Step; data points that are Figure from more Computer associatedvision: with themodels, k th component learning have and more inference effect onby thesimon parameters. Prince. Look along samples x for each h in the M-step

68 EM for two Gaussians: M-step (cont.) M-step: Compute the Maximum Likelihood of the parameters of the mixture model given out data s membership distribution, the γ (t) i s: for k = 1, 2 n µ (t+1) i=1 k = γ(t) ik x i n, i=1 γ(t) ik σ (t+1) k = n i=1 γ(t) ik (x i µ (t+1) n i=1 γ(t) ik k ) 2, n π (t+1) i=1 k = γ(t) ik. n

69 EM in practice Modeling complex data densities

70 Summary Bayesian theory: Combines prior knowledge and observed data to find the most probable hypothesis. Naïve Bayes Classifier: All variables considered independent. EM algorithm: Learn probability destribiution (model parameters) in presence of hidden variables. If you are interested in learning more take a look at: C. M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag 2006.

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Algorithms for Classification: The Basic Methods

Algorithms for Classification: The Basic Methods Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

More information

Stephen Scott.

Stephen Scott. 1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions

More information

COMP 328: Machine Learning

COMP 328: Machine Learning COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. Bayesian Learning Criteria Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are:

More information

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU March 1, 2017 1 / 13 Bayes Rule Bayes Rule Assume that {B 1, B 2,..., B k } is a partition of S

More information

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano Prof. Josenildo Silva jcsilva@ifma.edu.br 2015 2012-2015 Josenildo Silva (jcsilva@ifma.edu.br) Este material é derivado dos

More information

T Machine Learning: Basic Principles

T Machine Learning: Basic Principles Machine Learning: Basic Principles Bayesian Networks Laboratory of Computer and Information Science (CIS) Department of Computer Science and Engineering Helsinki University of Technology (TKK) Autumn 2007

More information

Bayesian Classification. Bayesian Classification: Why?

Bayesian Classification. Bayesian Classification: Why? Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches

More information

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture? Bayes Rule CS789: Machine Learning and Neural Network Bayesian learning P (Y X) = P (X Y )P (Y ) P (X) Jakramate Bootkrajang Department of Computer Science Chiang Mai University P (Y ): prior belief, prior

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical Learning. Philipp Koehn. 10 November 2015 Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B) Examples My mood can take 2 possible values: happy, sad. The weather can take 3 possible vales: sunny, rainy, cloudy My friends know me pretty well and say that: P(Mood=happy Weather=rainy) = 0.25 P(Mood=happy

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with

More information

Introduction to Machine Learning

Introduction to Machine Learning Uncertainty Introduction to Machine Learning CS4375 --- Fall 2018 a Bayesian Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell Most real-world problems deal with

More information

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning Topics Bayesian Learning Sattiraju Prabhakar CS898O: ML Wichita State University Objectives for Bayesian Learning Bayes Theorem and MAP Bayes Optimal Classifier Naïve Bayes Classifier An Example Classifying

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1 Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Decision Trees. Tirgul 5

Decision Trees. Tirgul 5 Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Learning with Probabilities

Learning with Probabilities Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1 Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior

More information

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty Bayes Classification n Uncertainty & robability n Baye's rule n Choosing Hypotheses- Maximum a posteriori n Maximum Likelihood - Baye's concept learning n Maximum Likelihood of real valued function n Bayes

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem! COMP61011 Probabilistic Classifiers Part 1, Bayes Theorem Reverend Thomas Bayes, 1702-1761 p ( T W ) W T ) T ) W ) Bayes Theorem forms the backbone of the past 20 years of ML research into probabilistic

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Decision Tree Learning - ID3

Decision Tree Learning - ID3 Decision Tree Learning - ID3 n Decision tree examples n ID3 algorithm n Occam Razor n Top-Down Induction in Decision Trees n Information Theory n gain from property 1 Training Examples Day Outlook Temp.

More information

Machine Learning. Bayesian Learning.

Machine Learning. Bayesian Learning. Machine Learning Bayesian Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg Martin.Riedmiller@uos.de

More information

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II * A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II * kevin small & byron wallace * Slides borrow heavily from Andrew Moore, Weng- Keen Wong and Longin Jan Latecki today

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Naïve Bayes Lecture 6: Self-Study -----

Naïve Bayes Lecture 6: Self-Study ----- Naïve Bayes Lecture 6: Self-Study ----- Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 6: Required Reading Daumé III (015:

More information

10-701/ Machine Learning: Assignment 1

10-701/ Machine Learning: Assignment 1 10-701/15-781 Machine Learning: Assignment 1 The assignment is due September 27, 2005 at the beginning of class. Write your name in the top right-hand corner of each page submitted. No paperclips, folders,

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Why Probability? It's the right way to look at the world.

Why Probability? It's the right way to look at the world. Probability Why Probability? It's the right way to look at the world. Discrete Random Variables We denote discrete random variables with capital letters. A boolean random variable may be either true or

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood

More information

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2. Bayesian Learning Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2. (Linked from class website) Conditional Probability Probability of

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction Classification What is classification Classification Simple methods for classification Classification by decision tree induction Classification evaluation Classification in Large Databases Classification

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6. Machine Learning Bayesian Learning Bayes Theorem Bayesian Learning [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme

More information

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw Applied Logic Lecture 4 part 2 Bayesian inductive reasoning Marcin Szczuka Institute of Informatics, The University of Warsaw Monographic lecture, Spring semester 2017/2018 Marcin Szczuka (MIMUW) Applied

More information

CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 09: Naive Bayes CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1

More information

Probabilistic Classification

Probabilistic Classification Bayesian Networks Probabilistic Classification Goal: Gather Labeled Training Data Build/Learn a Probability Model Use the model to infer class labels for unlabeled data points Example: Spam Filtering...

More information

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last

More information

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition Introduction Decision Tree Learning Practical methods for inductive inference Approximating discrete-valued functions Robust to noisy data and capable of learning disjunctive expression ID3 earch a completely

More information

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use? Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Data Mining Part 4. Prediction

Data Mining Part 4. Prediction Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas Naïve Bayes Vibhav Gogate The University of Texas at Dallas Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1 Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] 1 BAYESIAN LEARNING [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theorem MAP, ML hypotheses, MAP learners Minimum description length principle Bayes optimal classifier, Naive Bayes learner Example:

More information

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci. Soft Computing Lecture Notes on Machine Learning Matteo Mattecci matteucci@elet.polimi.it Department of Electronics and Information Politecnico di Milano Matteo Matteucci c Lecture Notes on Machine Learning

More information

Computer vision: models, learning and inference

Computer vision: models, learning and inference Computer vision: models, learning and inference Chapter 2 Introduction to probability Please send errata to s.prince@cs.ucl.ac.uk Random variables A random variable x denotes a quantity that is uncertain

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation

More information

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC CLASSIFICATION NAIVE BAYES NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com WHAT IS CLASSIFICATION? A supervised learning task of determining the class of an instance; it is

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015 Probability Refresher Kai Arras, University of Freiburg Winter term 2014/2015 Probability Refresher Introduction to Probability Random variables Joint distribution Marginalization Conditional probability

More information

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Recitation 8 Oct 21, Oznur Tastan Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting

More information