Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1
Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior knowledge quantifies hypothesis and prediction uncertainty gives optimal predictions Maximum a posteriori and maximum likelihood learning Maximum likelihood parameter learning CS194-10 Fall 2011 Lecture 15 2
Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1, h 2,..., prior P(H) ith observation x i gives the outcome of random variable X i training data X = x 1,..., x N Given the data so far, each hypothesis has a posterior probability: P (h k X) = αp (X h k )P (h k ) where P (X h k ) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: P(X N+1 X) = Σ k P(X N+1 X, h k )P (h k X) = Σ k P(X N+1 h k )P (h k X) No need to pick one best-guess hypothesis! CS194-10 Fall 2011 Lecture 15 3
Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? CS194-10 Fall 2011 Lecture 15 4
Posterior probability of hypotheses P (h k X) = αp (X h k )P (h k ) P (h 1 5 limes) = αp (5 limes h 1 )P (h 1 ) = α 0.0 5 0.1 = 0 P (h 2 5 limes) = αp (5 limes h 2 )P (h 2 ) = α 0.25 5 0.2 = 0.000195α P (h 3 5 limes) = αp (5 limes h 3 )P (h 3 ) = α 0.5 5 0.4 = 0.0125α P (h 4 5 limes) = αp (5 limes h 4 )P (h 4 ) = α 0.75 5 0.2 = 0.0475α P (h 5 5 limes) = αp (5 limes h 5 )P (h 5 ) = α 1.0 5 0.1 = 0.1α α = 1/(0 + 0.000195 + 0.0125 + 0.0475 + 0.1) = 6.2424 P (h 1 5 limes) = 0 P (h 2 5 limes) = 0.00122 P (h 3 5 limes) = 0.07803 P (h 4 5 limes) = 0.29650 P (h 5 5 limes) = 0.62424 CS194-10 Fall 2011 Lecture 15 5
Posterior probability of hypotheses Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) 0 2 4 6 8 10 Number of samples in d CS194-10 Fall 2011 Lecture 15 6
Prediction probability P(X N+1 X) = Σ k P(X N+1 X, h k )P (h k X) = Σ k P(X N+1 h k )P (h k X) P (lime on 6 5 limes) = P (lime on 6 h 1 )P (h 1 5 limes) + P (lime on 6 h 2 )P (h 2 5 limes) + P (lime on 6 h 3 )P (h 3 5 limes) + P (lime on 6 h 4 )P (h 4 5 limes) + P (lime on 6 h 5 )P (h 5 5 limes) = 0 0 + 0.25 0.00122 + 0.5 0.07830 + 0.75 0.29650 + 1.0 0.62424 = 0.88607 CS194-10 Fall 2011 Lecture 15 7
Prediction probability 1 P(next candy is lime d) 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d CS194-10 Fall 2011 Lecture 15 8
Learning from positive examples only Example from Tenenbaum via Murphy, Ch.3: Given examples of some unknown class, a predefined subset of {1,..., 100}, output a hypothesis as to what the class is E.g., {16, 8, 2, 64} Boolean classification problem; simplest consistent solution is everything. [This is the basis for Chomsky s Poverty of the Stimulus argument purporting to prove that humans must have innate grammatical knowledge] CS194-10 Fall 2011 Lecture 15 9
Bayesian counterargument Assuming numbers are sampled uniformly from the class: P ({16, 8, 2, 64} powers of 2) = 7 4 4.2 10 4 P ({16, 8, 2, 64} everything) = 100 4 = 10 8 This difference far outweighs any reasonable simplicity-based prior CS194-10 Fall 2011 Lecture 15 10
Bayes vs. Humans CS194-10 Fall 2011 Lecture 15 11
MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h k X) I.e., maximize P (X h k )P (h k ) or minimize log P (X h k ) log P (h k )... or, in information theory terms, minimize bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning In science experiments, inputs are fixed, deterministic h predicts outputs P (X h k ) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis CS194-10 Fall 2011 Lecture 15 12
ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (X h k ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method CS194-10 Fall 2011 Lecture 15 13
A simple generative model: Bernoulli A generative model is a probability model from which the probability of any observable data set can be derived [Usually contrasted with a discriminative or conditional model, which gives only the probability for the output given the observable inputs ] E.g., Bernoulli[θ] model: P (X i = 1) = θ; P (X i = 0) = 1 θ or P (X i = x i ) = θ x i(1 θ) 1 x i Suppose we get a bag of candy from a new manufacturer; fraction θ of cherry candies Any θ is possible: continuum of hypotheses h θ CS194-10 Fall 2011 Lecture 15 14
ML estimation of Bernoulli model Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so P (X h θ ) = Π N i = 1P (x i h θ ) = θ i x i (1 θ) N i x i = θ c (1 θ) l Maximize this w.r.t. θ which is easier for the log-likelihood: L(X h θ ) = log P (X h θ ) = Σ N i = 1 log P (x i h θ ) = c log θ + l log(1 θ) dl(x h θ ) = c dθ θ l 1 θ = 0 θ = c c + l = c N Seems sensible, but causes problems with 0 counts! CS194-10 Fall 2011 Lecture 15 15
Naive Bayes models Generative model for discrete (often Boolean) classification problems: Each example has discrete class variable Y i Each example has discrete or continuous attributes X ij, j = 1,..., D Attributes are conditionally independent given the class value: P(Y i = 1) θ Y i Y i 0 1 P(X ij = 1 Y i ) θ 0,j θ 1,j X i,1 X ij X i,d P (y i, x i,1,..., x i,d ) = P (y i )Π D j = 1P (x ij y i ) = θ y i (1 θ) 1 y i Π D j = 1θ x ij y i,j(1 θ yi,j) 1 x ij CS194-10 Fall 2011 Lecture 15 16
ML estimation of Naive Bayes models Likelihood is P (X h θ ) = Π N i = 1θ y i(1 θ) 1 y i Π D j = 1θ x ij y i,j(1 θ yi,j) 1 x ij Log likelihood is L = log P (X h θ ) = Σ N i = 1y i log θ + (1 y i ) log(1 θ) +Σ D j = 1x ij log θ yi,j + (1 x ij ) log(1 θ yi,j) This has parameters in separate terms, so derivatives are decoupled: L θ = y i ΣN i = 1 θ 1 y i 1 θ = N 1 θ N N 1 1 θ L x = Σ ij i:yi = y 1 x ij = N yj N y N yj θ yj 1 θ yj θ yj 1 θ yj θ yj where N y = number of examples with class label y and N yj = number of examples with class label y and value 1 for X ij CS194-10 Fall 2011 Lecture 15 17
Setting derivatives to zero: ML estimation contd. θ = N 1 /N as before θ yj = N yj /N y I.e., count the fraction of each class with jth attribute set to 1 O(ND) time to train the model Example: 1000 cherry and lime candies, wrapped in red or green wrappers by the Surprise Candy Company 400 cherry, of which 300 have red wrappers and 100 green wrappers 600 lime, of which 120 have red wrappers and 480 green wrappers θ = P (F lavor = cherry) = 400/100 = 0.40 θ 11 = P (W rapper = red F lavor = cherry) = 300/400 = 0.75 θ 01 = P (W rapper = red F lavor = lime) = 120/600 = 0.20 CS194-10 Fall 2011 Lecture 15 18
Classifying a new example P (Y = 1 x 1,..., x D ) = α P (x 1,..., x D Y = 1)P (Y = 1) = α θ Π D j = 1θ x j 1j(1 θ 1j ) 1 x j log P (Y = 1 x 1,..., x D ) = log α + log θ + Σ D j = 1x j log θ 1j + (1 x j ) log(1 θ 1j ) = ( log α + log θ + Σ D j = 1(1 θ 1j ) ) + Σ D j = 1x j (log(θ 1j /(1 θ 1j ))) The set of points where P (Y = 1 x 1,..., x D ) = P (Y = 0 x 1,..., x D ) = 0.5 is a linear separator! (But location is sensitive to class prior.) CS194-10 Fall 2011 Lecture 15 19
Summary Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets 1. Choose a parameterized family of models to describe the data requires substantial insight and sometimes new models 2. Write down the likelihood of the data as a function of the parameters may require summing over hidden variables, i.e., inference 3. Write down the derivative of the log likelihood w.r.t. each parameter 4. Find the parameter values such that the derivatives are zero may be hard/impossible; modern optimization techniques help Naive Bayes is a simple generative model with a very fast training method that finds a linear separator in input feature space CS194-10 Fall 2011 Lecture 15 20
and provides probabilistic predictions CS194-10 Fall 2011 Lecture 15 21