Probabilistic Learning

Size: px

Start display at page:

Download "Probabilistic Learning"

Austen Morris
5 years ago
Views:

1 Artificial Intelligence: Representation and Problem Solving April 17, 2007 Probabilistic Learning Reminder No class on Thursday - spring carnival. 2

2 Recall the basic algorithm for learning decision trees: 1. starting with whole training data 2. select attribute or value along dimension that gives best split using information gain or other criteria 3. create child nodes based on split 4. recurse on each child using child data until a stopping criterion is reached all examples have same class amount of data is too small tree too large Does this capture probabilistic relationships? 3 Quantifying the certainty of decions Suppose instead of a yes or no answer want some estimate of how strongly we believe a loan applicant is a credit risk. This might be useful if we want some flexibility in adjusting our decision criteria. - Eg, suppose we re willing to take more risk if times are good. - Or, if we want to examine case we believe are higher risks more carefully. <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N 4

3 The mushroom data Or suppose we wanted to know how likely a mushroom was safe to eat? Do decision trees give us that information? 5 EDIBLE? CAP-SHAPE CAP-SURFACE! 1 edible flat fibrous! 2 poisonous convex smooth! 3 edible flat fibrous! 4 edible convex scaly! 5 poisonous convex smooth! 6 edible convex fibrous! 7 poisonous flat scaly! 8 poisonous flat scaly! 9 poisonous convex fibrous! 10 poisonous convex fibrous! 11 poisonous flat smooth! 12 edible convex smooth! 13 poisonous knobbed scaly! 14 poisonous flat smooth! 15 poisonous flat fibrous! Mushroom data Fisher s Iris data 2.5 In which example would you be more confident about 2 the class? Iris virginica petal width (cm) Iris setosa Iris versicolor Decision trees provide a classification 6 7but not petal length (cm) uncertainty. 6

4 The general classification problem desired output y = {y 1,..., y K } output is a binary classification vector: { 1 if x i C i class i, y i = 0 otherwise model θ = {θ 1,..., θ M } model (e.g. a decision tree) is defined by M parameters. Data D = {x 1,..., x T } x i = {x 1,..., x N } i input is a set of T observations, each an N-dimensional vector (binary, discrete, or continuous) Given data, we want to learn a model that can correctly classify novel observations. How do we approach this probabilistically? 7 The answer to all questions of uncertainty Let s apply Bayes rule to infer the most probable class given the observation: p(c k x) = p(x C k)p(c k ) p(x) p(x C k )p(c k ) = k p(x C k)p(c k ) This is the answer, but what does it mean? How do we specify the terms? - p(c k ) is the prior probability on the different classes - p(x C k ) is the data likelihood, ie probability of x given class C k How should we define this? 8

5 What classifier would give optimal performance? Consider the iris data again. How would we minimize the number of future mis-classifications? We would need to know the true distribution of the classes. Assume they follow a Gaussian distribution. The number of samples in each class is the same (50), so (assume) p(c k ) is equal for all classes. Because p(x) is the same for all classes we have: p(c k x) = p(x C k)p(c k ) p(x) p(x C k )p(c k ) petal width (cm) 0.9p(petal length C 2 ) p(petal length C 3 ) petal length (cm) 9 Where do we put the boundary? p(petal length C 2 ) p(petal length C 3 )

6 Where do we put the boundary? decision boundary R32 = C3 is misclassified as C2 p(petal length C 2 ) p(petal length C 3 ) R23 = C2 is misclassified as C3 Where do we put the boundary? Shifting the boundary trades-off the two errors R32 = C3 is misclassified as C2 p(petal length C 2 ) p(petal length C 3 ) R23 = C2 is misclassified as C3

7 Where do we put the boundary? The misclassification error is defined by p(error) = p(c 3 x)dx + p(c 2 x)dx R 32 R 23 which in our case is proportional to the data likelihood R32 = C3 is misclassified as C2 p(petal length C 2 ) p(petal length C 3 ) R23 = C2 is misclassified as C3 Where do we put the boundary? The misclassification error is defined by p(error) = p(c 3 x)dx + p(c 2 x)dx R 32 R 23 which in our case is proportional to the data likelihood p(petal length C 2 ) p(petal length C 3 ) This region would yield p(c 3 x) > p(c 2 x) but we re still classifying this region as C2!

8 The optimal decision boundary The minimal misclassification error at the point where p(c 3 x) = p(c 2 x) Optimal decision boundary p(x C 3 )p(c 3 )/p(x) = p(x C 2 )p(c 2 )/p(x) p(x C 3 ) = p(x C 2 ) p(c 2 petal length) Note: this assumes we have only two classes. p(petal length C 2 ) p(petal length C 3 ) p(c 3 petal length) Bayesian classification for more complex models Recall the class conditional probability: p(c k x) = p(x C k)p(c k ) p(x) p(x C k )p(c k ) = k p(x C k)p(c k ) How do we define the data likelihood, p(x C k) ie the probability of x given class C k 16

9 Defining a probabilistic classification model How would we define credit risk problem? - Class: - Data: C 1 = defaulted C 2 = didn t default x = { <2 years, missed payments } - Prior (from data): p(c 1 ) = 3/10; p(c 2 ) = 7/10; - Likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? - How would we determine these? <2 years at current job? missed payments? defaulted? Y N Y N Y N Predicting credit risk 17 Defining a probabilistic model by counting The prior is obtained by counting number of classes in the data: p(c k = k) = Count(C k = k) # records The likelihood is obtained the same way: p(x = v C k ) = Count(x = v C k = k) Count(C k = k) p(x 1 = v 1,..., x N = v N C k = k) = Count(x 1 = v 1,... x N = v N, C k = k) Count(C k = k) This is the maximum likelihood estimate (MLE) of the probabilities 18

10 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N N Y Y N Y N Y N Y N 19 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y Y Y N Y N Y N 20

11 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N Y Y N Y 21 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N Y Y 22

12 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N 23 Defining a probabilistic classification model Determining the likelihood: p(x 1, x 2 C 1 ) =? p(x 1, x 2 C 2 ) =? Simple approach: look at counts in data x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0 <2 years at current job? Predicting credit risk missed payments? defaulted? Y N Y N Y N What do we do about these? 24

13 Being (proper) Bayesians: Recall our coin-flipping example In Bernoulli trials, each sample is either 1 (e.g. heads) with probability ", or 0 (tails) with probability 1 $ ". The binomial distribution specifies probability of total #heads, y, out of n trials: p(y θ, n) = ( n y ) θ y (1 θ) n y p(y!=0.25, n=10) y 25 Applying Bayes rule Given n trials with k heads, what do we know about "? We can apply Bayes rule to see how our knowledge changes as we acquire new observations: p(θ y, n) = posterior likelihood prior p(y θ, n)p(θ n) p(y n) = normalizing constant # We know the likelihood, what about the prior? # Uniform on [0, 1] is a reasonable assumption, i.e. we don t know anything. # What is the form of the posterior? p(y θ, n)p(θ n)dθ # In this case, the posterior is just proportional to the likelihood: p(θ y, n) ( ) n θ y (1 θ) n y y 26

14 Evaluating the posterior # What do we know initially, before observing any trials? p(! y=0, n=0) ! 27 Coin tossing # What is our belief about " after observing one tail? p(! y=0, n=1) ! 28

15 Coin tossing # Now after two trials we observe 1 head and 1 tail. p(! y=1, n=2) ! 29 Coin tossing # 3 trials: 1 head and 2 tails. p(! y=1, n=3) ! 30

16 Coin tossing # 4 trials: 1 head and 3 tails. p(! y=1, n=4) ! 31 Coin tossing # 5 trials: 1 head and 4 tails. p(! y=1, n=5) ! 32

17 Evaluating the normalizing constant To get proper probability density functions, we need to evaluate p(y n): p(θ y, n) = p(y θ, n)p(θ n) p(y n) # Bayes in his original paper in 1763 showed that: 1 p(y n) = 0 p(y θ, n)p(θ n)dθ = 1 n + 1 ( ) n p(θ y, n) = θ y (1 θ) n y (n + 1) y 33 The ratio estimate What about after just one trial: 0 heads and 1 tail? # MAP and ratio estimate would say 0. Does this make sense? # What would a better estimate be? MAP estimate = 0 y/n = 0 p(! y=0, n=1) * ! 34

18 The expected value estimate The expected value of a pdf is: 1 E(θ y, n) = 0 = y + 1 n + 2 θp(θ y, n)dθ This is called smoothing or regularization E(θ y = 0, n = 1) = 1 3 p(! y=0, n=1) What happens for zero trials? ! 35 On to the mushrooms! EDIBLE? CAP-SHAPE CAP-SURFACE CAP-COLOR ODOR STALK-SHAPE POPULATION HABITAT! 1 edible flat fibrous red none tapering several woods! 2 poisonous convex smooth red foul tapering several paths! 3 edible flat fibrous brown none tapering abundant grasses! 4 edible convex scaly gray none tapering several woods! 5 poisonous convex smooth red foul tapering several woods! 6 edible convex fibrous gray none tapering several woods! 7 poisonous flat scaly brown fishy tapering several leaves! 8 poisonous flat scaly brown spicy tapering several leaves! 9 poisonous convex fibrous yellow foul enlarging several paths! 10 poisonous convex fibrous yellow foul enlarging several woods! 11 poisonous flat smooth brown spicy tapering several woods! 12 edible convex smooth yellow anise tapering several woods! 13 poisonous knobbed scaly red foul tapering several leaves! 14 poisonous flat smooth brown foul tapering several leaves! 15 poisonous flat fibrous gray foul enlarging several woods! 16 edible sunken fibrous brown none enlarging solitary urban! 17 poisonous flat smooth brown foul tapering several woods! 18 poisonous convex smooth white foul tapering scattered urban! 19 poisonous flat scaly yellow foul enlarging solitary paths! 20 edible convex fibrous gray none tapering several woods!!!!!!!!!! 36

19 The scaling problem p(x = v C k ) = Count(x = v C k = k) Count(C k = k) p(x 1 = v 1,..., x N = v N C k = k) = Count(x 1 = v 1,... x N = v N, C k = k) Count(C k = k) The prior is easy enough. But for the likelihood, the table is huge! 37 Mushroom attributes and values # values attributes EDIBLE: edible poisonous CAP-SHAPE: bell conical convex flat knobbed sunken CAP-SURFACE: fibrous grooves scaly smooth CAP-COLOR: brown buff cinnamon gray green pink purple red white yellow BRUISES: bruises no ODOR: almond anise creosote fishy foul musty none pungent spicy GILL-ATTACHMENT: attached free GILL-SPACING: close crowded GILL-SIZE: broad narrow GILL-COLOR: black brown buff chocolate gray green orange pink purple red white yellow STALK-SHAPE: enlarging tapering STALK-ROOT: bulbous club equal rooted STALK-SURFACE-ABOVE-RING: fibrous scaly silky smooth STALK-SURFACE-BELOW-RING: fibrous scaly silky smooth STALK-COLOR-ABOVE-RING: brown buff cinnamon gray orange pink red white yellow STALK-COLOR-BELOW-RING: brown buff cinnamon gray orange pink red white yellow VEIL-TYPE: partial universal VEIL-COLOR: brown orange white yellow RING-NUMBER: none one two RING-TYPE: evanescent flaring large none pendant SPORE-PRINT-COLOR: black brown buff chocolate green orange purple white yellow POPULATION: abundant clustered numerous scattered several solitary HABITAT: grasses leaves meadows paths urban waste woods 22 attributes with an average of 5 values! 38

20 Simplifying with Naïve Bayes What if we assume the features are independent? p(x C k ) = p(x 1,..., x N C k ) N = p(x n C k ) n=1 We know that s not precisely true, but it might make a good approximation. Now we only need to specify N different likelihoods: p(x i = v i C k = k) = Count(x i = v i C k = k) Count(C k = k) Huge savings in number of of parameters 39 Inference with Naïve Bayes Inference is just like before, but with the independence approximation: p(c k x) = p(x C k)p(c k ) p(x) = p(c k) n p(x n C k ) p(x) p(c k ) n = p(x n C k ) p(c k ) p(x n C k ) n Classification performance is often surprisingly good easy to implement k 40

21 Implementation issues If you implement Naïve Bayes naïvely, you ll run into trouble. Why? p(c k x) = p(c k ) n p(x n C k ) p(c k ) p(x n C k ) n k It s never good to compute products of a long list of numbers They ll quickly go to zero with machine precision, even using doubles (64 bit) Strategy: compute log probabilities log p(c k x) = log p(c k ) + n = log p(c k ) + n What about that constant? It still has a product. [ log p(x n C k ) log p(c k ) ] p(x n C k ) n log p(x n C k ) constant k 41 Converting back to probabilities The only requirement of the denominator is that it normalize the numerator to yield a valid probability distribution. We used a log transformation: g i = log p i + constant The form of the probability the same for any constant c p i i p i = = = e g i i eg i e c e g i i ec e g i e g i+c i eg i+c A common choice: choose c so that the log probabilities are shifted to zero: c = max g i i 42

22 Recall another simplifying assumption: Noisy-OR We assume each cause C j can produce effect Ei with probability fij. The noisy-or model assumes the parent causes of effect E i contribute independently. The probability that none of them caused effect E i is simply the product of the probabilities that each one did not cause Ei. The probability that any of them caused Ei is just one minus the above, i.e. P (E i par(e i )) = P (E i C 1,..., C n ) = 1 i (1 P (E i C j )) touch contaminated object (O) hit by viral droplet (D) f CD catch cold (C) eat contaminated food (F) f CO f CF = 1 i (1 f ij ) P (C D, O, F ) = 1 (1 f CD )(1 f CO )(1 f CF ) 43 A general one-layer causal network Could either model causes and effects Or equivalently stochastic binary features. Each input xi encodes the probability that the ith binary input feature is present. The set of features represented by %j is defined by weights fij which encode the probability that feature i is an instance of %j.

23 The data: a set of stochastic binary patterns a Each column is a distinct eight-dimensional binary feature. There are five underlying causal feature patterns. What are they? b c The data: a set of stochastic binary patterns a a a Each column is a distinct eight-dimensional binary feature. b b b true hidden causes of the data c c c inferred causes of the data

24 Hierarchical Statistical Models A Bayesian belief network: The joint probability of binary states is P (S W) = i P (S i pa(s i ), W) The probability S i depends only on its parents: pa(s i ) S i D P (S i pa(s i ), W) = { h( j S jw ji ) if S i = 1 1 h( j S jw ji ) if S i = 0 The function h specifies how causes are combined, h(u) = 1 exp( u), u > 0. Main points: hierarchical structure allows model to form high order representations upper states are priors for lower states weights encode higher order features 47 Gibbs sampling (back to the example from last lecture) Model represents stochastic binary features. Each input xi encodes the probability that the ith binary input feature is present. The set of features represented by %j is defined by weights fij which encode the probability that feature i is an instance of %j. Trick: It s easier to adapt weights in an unbounded space, so use the transformation: optimize in w-space. 48

25 The data: a set of stochastic binary patterns a a a Each column is a distinct eight-dimensional binary feature. b c c c b b true hidden causes of the data inferred causes of the data 49 Hierarchical Statistical Models A Bayesian belief network: The joint probability of binary states is! P (S W) = P (Si pa(si), W) i The probability Si depends only on its parents: pa(si) P (Si pa(si), W) = " # h( j Sj wji) if Si = 1 # 1 h( j Sj wji) if Si = 0 Si The function h specifies how causes are combined, h(u) = 1 exp( u), u > 0. D Main points: hierarchical structure allows model to form high order representations upper states are priors for lower states weights encode higher order features 50

26 Learning Objective Adapt W to find the most probable explanation of the input patterns. The probability of the data is P (D 1:N W) = n P (D n W) P (D n W) is computed by marginalizing P (D n W) = k P (D n S k, W)P (S k W) Computing this sum exactly is intractable, but we can still make accurate approximations. 51 Approximating P (D n W) Good representations should have just one or a few possible explanations for most patterns. in this case, most P (D n S k, W) will be zero the following approximation will be very accurate once weights have adapted P (D n W) P (D n Ŝ, W)P(Ŝ W) approximation becomes increasingly accurate as learning proceeds 52

27 Using Adapting EM to adapt the Weights the network parameters The complexity of the model is controlled by placing a prior on the weights. assume the prior to be the product of gamma distributions objective function becomes L = P (D 1:N W)P (W α, β) A simple and efficient EM formula for adapting the weights can be derived using the transformations f ij = 1 exp( w ij ) and g i = 1 exp( u i ). S (n) f ij = α 1 + 2f ij + n S(n) i j α + β + n S(n) i f ij /g (n) f ij can be interpreted as the frequency of state S j given cause S i j f ij is a weighted average of the number of times S j was active given S i the ratio f ij /g j inversely weights each term by number causes for S j 53 Inferring the best representation of the observed variables Given on the input D, the is no simple way to determine which states are the input s most likely causes. - Computing the most probable network state is an inference process - we want to find the explanation of the data with highest probability - this can be done efficiently with Gibbs sampling Gibbs sampling is another example of an MCMC method Key idea: The samples are guaranteed to converge to the true posterior probability distribution 54

28 Gibbs Sampling Gibbs sampling is a way to select an ensemble of states that are representative of the posterior distribution P (S D, W). Each state of the network is updated iteratively according to the probability of S i given the remaining states. this conditional probability can be computed using (Neal, 1992) P (S i = a S j : j i, W) P (S i = a pa(s i ), W) j ch(s i ) P (S j pa(s j ), S i = a, W) limiting ensemble of states will be typical samples from P (S D, W) also works if any subset of states are fixed and the rest are sampled 55 The Network Gibbs sampling Interpretation equations of(derivation the Gibbsomitted) Sampling Equation The probability of S i changing state given the remaining states is P (S i = 1 S i S j : j i, W) = exp( x i ) x i indicates how much changing the state S i changes the probability of the whole network state x i = log h(u i ; 1 S i ) log h(u i ; S i ) + log h(u j + δ ij ; S j ) log h(u j ; S j ) j ch(s i ) u i is the causal input to S i, u i = k S kw ki δ j specifies the change in u j for a change in S i, δ ij = +S j w ij if S i = 0, or S j w ij if S i = 1 56

29 Interpretation of the Gibbs sampling equation The Gibbs equation can be interpreted as: feedback +! feedforward feed-back: how consistent is Si with current causes?! feedforward: how likely is Si a cause of its children feedback allows the lower-level units to use information only computable at higher levels feedback determines (disambiguates) the state when the feedforward input is ambiguous 57 The higher-order lines problem! "!"$!"$!"$!"$!"# The true generative model Patterns sampled from the model Can we infer the structure of the network given only the patterns? 58

30 Weights in a belief network after learning The second layer learns combinations of the first layer features The first layer of weights learn that patterns are combinations of lines. 59 The Shifter Problem A B Shift patterns weights of a network after learning 60

31 Gibbs sampling: feedback disambiguates lower-level states " $ #! One the structure learned, the Gibbs updating convergences in two sweeps. 61 Next time (which is next Tuesday) classifying with other models Don t forget: Spring Carnival this week - No class on Thursday. 62

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters