Probability Based Learning - PDF Free Download

Probability Based Learning Lecture 7, DD2431 Machine Learning J. Sullivan, A. Maki September 2013

Advantages of Probability Based Methods Work with sparse training data. More powerful than deterministic methods - decision trees - when training data is sparse. Results are interpretable. More transparent and mathematically rigorous than methods such as ANN, Evolutionary methods. Tool for interpreting other methods. Framework for formalizing other methods - concept learning, least squares.

Outline Probability Theory Basics Bayes rule MAP and ML estimation Minimum Description Length principle Naïve Bayes Classifier EM Algorithm

Probability Theory Basics

Random Variables A random variable x denotes a quantity that is uncertain the result of flipping a coin, the result of measuring the temperature The probability distribution P (x) of a randam variable (r.v.) captures the fact that the r.v. will have different values when observed and Some values occur more than others.

Random Variables 26 2 Introduction to probability A discrete random variable takes values from a predefined set. For a Boolean discrete random variable this Figure 2.1 Two different representations for discrete probabilities a) A bar predefined set has two members - graph representing the probability that a biased 6-sided {0, die 1}, lands on {yes, each no} etc. face. The height of the bar represents the probability: the sum of all heights is one. b) A Hinton diagram illustrating the probability of observing different weather types in England. The area of the square represents the probability, A continuous so the sum ofrandom all areas is one. variable takes values that 26 are real numbers. 2 Introduction to probability Figure 2.2 Continuous probability distribution (probability density function or pdf for short) for time taken to complete a test. Note that the probability density can exceed one, but the area under the curve must always have unit area. discrete pdf Figure 2.1 Two different representations for discrete probabilities a) A bar graph representing the probability that a biased 6-sided die lands on each 2.2 face. Joint The height probability of the bar represents the probability: the sum of all heights is one. b) A Hinton diagram illustrating the probability of observing different Figures taken from weather Computer types in England. Vision: The area models, of the square learning represents and the probability, inference by Simon Prince. so Consider the sum of two allrandom areas isvariables, one. x and y. If we observe multiple paired instances Problem 2.1 continuous pdf

Joint Probabilities Consider two random variables x and y. Observe multiple paired instances of x and y. Some paired outcomes will occur more frequently. 2 Introduction to probability This information is encoded in the joint probability distribution P (x, y). P (x) denotes the joint probability of x = (x 1,..., x K ). discrete joint pdf Figure from Computer Vision: models, learning and inference by Simon Prince.

Joint Probabilities (cont.) a) b) c) d) e) f) Figure from Computer Vision: models, learning and inference by Simon Prince.

Marginalization The probability distribution of any single variable can be recovered from a joint distribution by summing for the discrete case P (x) = y P (x, y) and integrating for the continuous case P (x) = P (x, y) dy y

Marginalization (cont.) a) b) c) Figure from Computer Vision: models, learning and inference by Simon Prince.

Conditional Probability The conditional probability of x given that y takes value y indicates the different values of r.v. x which we ll observe given that y is fixed to value y. The conditional probability can be recovered from the joint distribution P (x, y): P (x y = y ) = P (x, y = y ) P (x, y = y ) P (y = y = ) x P (x, y = y ) dx 2.4 Conditional probability 29 Extract an appropriate slice, and then normalize it. Figure 2.5 Conditional Probability. Joint pdf of x and y and two conditional Figure from Computer probability Vision: distributions models, Pr(x y learning = y1) andpr(x y and= inference y2). These are byformed Simon Prince.

Bayes Rule Bayes Rule P (y x) = P (x y)p (y) P (x) = P (x y)p (y) P (x y)p (y) y Each term in Bayes rule has a name: P (y x) Posterior (what we know about y given x.) P (y) Prior (what we know about y before we consider x.) P (x y) Likelihood (propensity for observing a certain value of x given a certain value of y) P (x) Evidence (a constant to ensure that the l.h.s. is a valid distribution)

Bayes Rule In many of our applications y is a discrete variable and x is a multi-dimensional data vector extracted from the world. Then P (y x) = P (x y)p (y) P (x) P (x y) Likelihood represents the probability of observing data x given the hypothesis y. P (y) Prior of y represents the background knowledge of hypothesis y being correct. P (y x) Posterior represents the probability that hypothesis y is true after data x has been observed.

Learning and Inference Bayesian Inference: The process of calculating the posterior probability distribution P (y x) for certain data x. Bayesian Learning: The process of learning the likelihood distribution P (x y) and prior probability distribution P (y) from a set of training points {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}

Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information given: The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Example: Which Gender? Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information distributions over given: hair length for each class Given: classes h = {A, B}, A = men, B = women, Task: Determine the gender of a person given the hair length The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 and it might be interesting to know that the observation of hair length is made in a boys school: P(A) = P A = 0.95, P(B) = P B = 0.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

Example: Which Gender? Task: Determine the gender of a person given their measured hair length = calculate P (g x). Solution: Apply Bayes Rule to get P (x g = m )P (g = m ) P (g = m x) = P (x) P (x g = m )P (g = m ) = P (x g = f )P (g = f ) + P (x g = m )P (g = m ) Can calculate P (g = f x) = 1 P (g = m x)

Selecting the most probably hypothesis Maximum A Posteriori (MAP) Estimate: Hypothesis with highest probability given observed data y MAP = arg max y Y = arg max y Y = arg max y Y P (y x) P (x y) P (y) P (x) P (x y) P (y) Maximum Likelihood Estimate (MLE): Hypothesis with highest likelihood of generating observed data. y MLE = arg max y Y P (x y) Useful if we do not know prior distribution or if it is uniform.

Example: Cancer or Not? Scenario: A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have cancer. Scenario in probabilities: Priors: Likelihoods: P (disease) =.008 P (not disease) =.992 P (+ disease) =.98 P (+ not disease) =.03 P ( disease) =.02 P ( not disease) =.97

Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) =.98.008 =.0078 P (+ not disease) P (not disease) =.03.992 =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = (.0078 +.0298) =.21.0298 P (not disease +) = (.0078 +.0298) =.79

Relation to Occams s Razor Occam s Razor: Choose the simplest explanation for the observed data Information theoretic perspective Occam s razor corresponds to choosing the explanation requiring the fewest bits to represent. The optimal representation requires log 2 p(y x) bits to store. (Remember: the Shannon information content) Minimum description length principle: Choose hypothesis y MDL = arg min y Y log 2 P (y x) = arg min y Y log 2 P (x y) log 2 P (y) The MDL estimate is equal to the MAP estimate y MAP = arg max y Y log 2 P (x y) + log 2 P (y)

Naïve Bayes Classifier

Feature Space ources of Noise Sources of Noise Sensors give measurements which can be converted to features. Ideally a feature value is identical for all samples in one class. Samples Feature space

Feature Space Sources of Noise Sensors give measurements which can be converted to features. However in the real world Samples because of Measurement noise Intra-class variation Poor choice of features Feature space 25 25

Feature Space Feature Space End result: a K dimensional space in which each dimension is a feature containing n labelled samples (objects) 26

Problem: Large Feature Space Size of feature space exponential in number of features. More features = potential for better description of the objects but... More features = more difficult to model P (x y). Extreme Solution: Naïve Bayes classifier All features (dimensions) regarded as independent. Model k one-dimensional distributions instead of one k-dimensional distribution.

Naïve Bayes Classifier One of the most common learning methods. When to use: Moderate or large training set available. Features x i of a data instance x are conditionally independent given classification (or at least reasonably independent, still works with a little dependence). Successful applications: Medical diagnoses (symptoms independent) Classification of text documents (words independent)

Naïve Bayes Classifier x is a vector (x 1,..., x K ) of attribute or feature values. Let Y = {1, 2,..., Y } be the set of possible classes. The MAP estimate of y is y MAP = arg max y Y P (y x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) P (x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) Naïve Bayes assumption: P (x 1,..., x K y) = K k=1 P (x k y) This give the Naïve Bayes classifier: y MAP = arg max P (y) K P (x k y) y Y k=1

Example: Play Tennis? Question: Will I go and play tennis given the forecast? My measurements: 1 forecast {sunny, overcast, rainy}, 2 temperature {hot, mild, cool}, 3 humidity {high, normal}, 4 windy {false, true}. Possible decisions: y {yes, no}

Example: Play Tennis? : What Play I did in Tennis? the past: nd ata

Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no 9 14 5 14 Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

Example: Play Tennis? Inference: Use the learnt model to classify a new instance. New instance: x = (sunny, cool, high, true) Apply Naïve Bayes Classifier: 4 y MAP = arg max P (y) P (x i y) y {yes, no} i=1 P (yes) P (sunny yes) P (cool yes) P (high yes) P (true yes) = 9 14 2 9 3 9 3 9 3 9 =.005 P (no) P (sunny no) P (cool no) P (high no) P (true no) = 5 14 3 5 1 5 4 5 3 5 =.021 = y MAP = no

Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

Naïve Bayes: Estimating Probabilities Problem: What if none of the training instances with target value y have attribute x i? Then P (x i y) = 0 = P (y) K P (x i y) = 0 Solution: Add as prior knowledge that P (x i y) must be larger than 0: i=1 where P (x i y) = n y + mp n + m n = number of training samples with label y n y = number of training samples with label y and value x i p = prior estimate of P (x i y) m = weight given to prior estimate (in relation to data)

Example: Spam detection Aim: Build a classifier to identify spam e-mails. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam e-mails. E-mail is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam) Email: E Vector: e Dear customer, A fully licensed Online Pharmacy is offering pharmaceuticals: - brought to you directly from abroad -produced by the same multinational corporations selling through the major US pharmacies -priced up to 5 times cheaper as compared to major US pharmacies. Enjoy the US dollar purchasing power on http://pharmacy-buyonline.com.ua/ ('dear', 'customer', ',', 'a', 'fully', 'licensed',...,'/') Concatenate words from e-mail into a vector

Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from e-mail to e-mail.. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

Example: Spam detection Learning: Assume one has n training e-mails and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). S = {(e 1, y 1 ),..., (e n, y n )}

Example: Spam detection Learning: Assume one has n training e-mails and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). Create dictionary S = {(e 1, y 1 ),..., (e n, y n )} 1 Make a union of all the distinctive words and tokens in e 1,..., e n to create W = {w 1,..., w L }. (Note: words such as and, the,... omitted)

Example: Spam detection Learning: Assume one has n training e-mails and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). Learn probabilities For y {spam, not spam} S = {(e 1, y 1 ),..., (e n, y n )} 1 Set P (y) = n i=1 Ind(yi=y) n proportion of e-mails from class y. 2 n y = n i=1 K i Ind (y i = y) total # of words in the class y e-mails. 3 For each word w l compute n yl = n i=1 Ind (y i = y) occurrences of word w l in the class y e-mails. 4 P (w l y) = n yl+1 n y+ W ( Ki k=1 Ind (e ik = w l ) ) # of assume prior value of P (w l y) is 1/ W.

Example: Spam detection Inference: Classify a new e-mail e = (e 1,..., e K ) y = arg max y { 1,1} K P (y) P (e k y) k=1

Summary so far Bayesian theory: Combines prior knowledge and observed data to find the most probable hypothesis. Naïve Bayes Classifier: All variables considered independent.

Expectation-Maximization (EM) Algorithm

Mixture of Gaussians This distribution is a weight sum of K Gaussian distributions K P (x) = π k N (x; µ k, σk 2 ) ure 7.6 Mixture of Gaussians el in 1D. A complex multimodal bability density function (black d curve) is created by taking a ghted sum or mixture of several stituent normal distributions with erent means and variances (red, n and blue dashed curves). To ure that the final distribution is aliddensity,theweightsmustbe itive and sum to one. k=1 where π 1 + + π K = 1 7 Modeling complex data densities and π k > 0 (k = 1,..., K). This model can describe complex multi-modal probability distributions by combining simpler distributions. st function for the M-Step (equation 7.12) improves the bound. For now we ssume that these things are true and proceed with the main thrust of the er. We will return to these issues in section 7.8.

Mixture of Gaussians P (x) = K π k N (x; µ k, σk 2 ) k=1 Learning the parameters of this model from training data x 1,..., x n is not trivial - using the usual straightforward maximum likelihood approach. Instead learn parameters using the Expectation-Maximization (EM) algorithm.

Mixture of Gaussians as a marginalization We can interpret the Mixture of Gaussians model with the introduction of a discrete hidden/latent variable h and P (x, h): K K P (x) = P (x, h = k) = P (x h = k)p (h = k) k=1 = k=1 K π k N (x; µ k, σk) 2 7.4 Mixture of Gaussians 111 k=1 Figure 7.7 Mixture of Gaussians as a marginalization. The mixture of Gaussians can also be thought of in terms of a joint distribution Pr(x, h) between the observed variable x and a discrete hidden variable h. To create the mixture density we marginalize over h. The hidden variable has a straightforward interpretation: it is the index of the constituent normal distribution. mixture density Figures must takenbe from positive Computer definite. Vision: For a simpler models, approach, learning weand express inference the observed by Simon density Prince. as a marginalization and use the EM algorithm to learn the parameters.

EM for two Gaussians Assume: We know the pdf of x has this form: P (x) = π 1 N (x; µ 1, σ1) 2 + π 2 N (x; µ 2, σ2) 2 where π 1 + π 2 = 1 and π k > 0 for components k = 1, 2. Unknown: Values of the parameters (Many!) Θ = (π 1, µ 1, σ 1, µ 2, σ 2 ). Have: Observed n samples x 1,..., x n drawn from p(x). Want to: Estimate Θ from x 1,..., x n. How would it be possible to get them all???

EM for two Gaussians For each sample x i introduce a hidden variable h i { 1 if sample x i was drawn from N (x; µ 1, σ 2 h i = 1) 2 if sample x i was drawn from N (x; µ 2, σ2) 2 and come up with initial values for each of the parameters. Θ (0) = (π (0) 1, µ (0) 1, σ (0) 1, µ (0) 2, σ (0) 2 ) EM is an iterative algorithm which updates Θ (t) using the following two steps...

EM for two Gaussians: E-step The responsibility of k-th Gaussian for each sample x (indicated by the size of the projected data point) 112 7 Modeling complex data densities Look at each sample x along hidden variable h in Figure 7.8 E-Step for fitting the mixture of Gaussians model. For each of the I data points x1...i, we calculate the posterior distribution Pr(hi xi) over the hidden variable hi. The posterior probability Pr(hi = k xi) thathi takes value k can be understood thease-step responsibility of normal distribution k for data point xi. For example, for data point x1 (magenta circle) the component 1 (red curve) is more than twice as likely to be responsible than component 2 (green curve). Note that in the joint distribution (left), the Figure from Computer size ofvision: the projected models, data point learning indicates the responsibility. and inference by Simon Prince.

EM for two Gaussians: E-step (cont.) E-step: Compute the posterior probability that x i was generated by component k given the current estimate of the parameters Θ (t). (responsibilities) for i = 1,... n for k = 1, 2 γ (t) ik = P (h i = k x i, Θ (t) ) = π (t) k N (x i; µ (t) k, σ(t) k ) π (t) 1 N (x i ; µ (t) 1, σ (t) 1 ) + π (t) 2 N (x i ; µ (t) 2, σ (t) 2 ) Note: γ (t) i1 + γ(t) i2 = 1 and π 1 + π 2 = 1

EM for two Gaussians: M-step Fitting the Gaussian model for each of k-th constinuetnt. 7.4 Mixture of Gaussians 113 Sample x i contributes according to the responsibility γ ik. (dashed and solid lines for fit before and after update) Figure 7.9 M-Step for fitting the mixture of Gaussians model. For the k th constituent Gaussian, we update the parameters {λ k, µ k, Σ k}. The i th data point x i contributes to these updates according to the responsibility r ik (indicated by size of point) assigned in the E-Step; data points that are Figure from more Computer associatedvision: with themodels, k th component learning have and more inference effect onby thesimon parameters. Prince. Look along samples x for each h in the M-step

EM for two Gaussians: M-step (cont.) M-step: Compute the Maximum Likelihood of the parameters of the mixture model given out data s membership distribution, the γ (t) i s: for k = 1, 2 n µ (t+1) i=1 k = γ(t) ik x i n, i=1 γ(t) ik σ (t+1) k = n i=1 γ(t) ik (x i µ (t+1) n i=1 γ(t) ik k ) 2, n π (t+1) i=1 k = γ(t) ik. n

EM in practice 114 7 Modeling complex data densities

Summary Bayesian theory: Combines prior knowledge and observed data to find the most probable hypothesis. Naïve Bayes Classifier: All variables considered independent. EM algorithm: Learn probability destribiution (model parameters) in presence of hidden variables. If you are interested in learning more take a look at: C. M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag 2006.