Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Size: px

Start display at page:

Download "Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell"

Penelope Bates
6 years ago
Views:

1 Machine Learning 0-70/5 70/5-78, 78, Spring 00 Probability 0 Aarti Singh Lecture, January 3, 00 f(x) µ x Reading: Bishop: Chap, Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

2 Announcements Homework is out! Due: Wednesday, Jan 0, 00 (beginning of class) st Recitation Jan 4, 00 5:00-6:30 pm NSH 305 Probability

3 Probability in Machine Learning Machine Learning tasks involve reasoning under uncertainity Sources of uncertainity/randomness: Noise variability in sensor measurements, partial observability, incorrect labels Finite sample size - Training and test data are randomly drawn instances Hand-written digit recognition Probability quantifies uncertainty! 3

Basic Probability Concepts Conceptual or physical, repeatable experiment with random outcome at any trial Roll of dice Nucleotide present at a DNA site Time-space position of an aircraft on a radar

4 Basic Probability Concepts Conceptual or physical, repeatable experiment with random outcome at any trial Roll of dice Nucleotide present at a DNA site Time-space position of an aircraft on a radar screen Sample space S - set of all possible outcomes. (can be finite or infinite.) S {,,3,4,5,6} Event A - any subset of S : S { A,T,C,G} o S { 0, Rmax } { 0, 360 } { 0, + } See, 4 or 6 in a roll observe a "G" at a site UA007 in angular location {45-60 } 4

5 Definition Classical: Probability of an event A is the relative frequency (limiting ratio of number of occurrences of event A to the total number of trials) P(A) lim N NA N E.g. P({}) /6 P({,4,6}) / Sample space S Its area is, P(S) A P(A) - area of the oval Complement of A 5

6 Definition Axiomatic (Kolmogorov): Probability of an event A is a number assigned to this event such that 0 P(A) all probabilities are between 0 and A A Area of A can t be smaller than 0 Area of A can t be larger than 6

7 Definition Axiomatic (Kolmogorov): Probability of an event A is a number assigned to this event such that 0 P(A) all probabilities are between 0 and P(ϕ) 0 probability of no outcome is 0 A Area of A can t be smaller than 0 7

8 Definition Axiomatic (Kolmogorov): Probability of an event A is a number assigned to this event such that 0 P(A) all probabilities are between 0 and P(ϕ) 0 probability of no outcome is 0 P(S) probability of some outcome is A A Area of A can t be smaller than 0 Area of A can t be larger than 8

9 Definition Axiomatic (Kolmogorov): Probability of an event A is a number assigned to this event such that 0 P(A) all probabilities are between 0 and P(ϕ) 0 no outcome has 0 probability P(S) some outcome is bound to occur P(A U B) P(A) + P(B) P(A B) probability of union of two events A B Area of A U B Area of A + Area of B Area of A B 9

10 Definition Axiomatic (Kolmogorov): Probability of an event A is a number assigned to this event such that 0 P(A) all probabilities are between 0 and P(ϕ) 0 no outcome has 0 probability P(S) some outcome is bound to occur P(A U B) P(A) + P(B) P(A B) probability of union of two events Probability space is a sample space equipped with an assignment P(A) to every event A S such that P satisfies the Kolmogorov axioms. 0

11 Theorems from the Axioms 0 P(A) P(ϕ) 0 P(S) P(A U B) P(A) + P(B) P(A B) P( A) - P(A) Proof: P(A U A) P(S) P(A A) P(ϕ) 0 P(A) + P( A) + 0 > P( A) - P(A) A A

12 Theorems from the Axioms 0 P(A) P(ϕ) 0 P(S) P(A U B) P(A) + P(B) P(A B) P(A) P(A B) + P(A B) Proof: P(A) P(A S) P(A (B U B)) P((A B) U (A B)) P(A B) + P(A B) P((A B) (A B)) P(A B) + P(A B) P(ϕ) P(A B) + P(A B) A A B A B

13 Why use probability? There have been many other approaches to handle uncertainty: Fuzzy logic Qualitative reasoning (Qualitative physics) Probability theory is nothing but common sense reduced to calculation Pierre Laplace, 8. Any scheme for combining uncertain information really should obey these axioms Di Finetti 93 - If you gamble based on uncertain beliefs that satisfy these axioms, then you can t be exploited by an opponent 3

14 Random Variable A random variable is a function that associates a unique numerical value X(ω) with every outcome ω S of an experiment. (The value of the r.v. will vary from trial to trial as the experiment is repeated) S ω X(ω) P(X < ) P({ω: X(ω) < }) Discrete r.v.: The outcome of a coin-toss H, T 0 (Binary) The outcome of a dice-roll -6 Continuous r.v.: The location of an aircraft Univariate r.v.: The outcome of a dice-roll -6 Multi-variate r.v.: The time-space position of an aircraft on radar screen X R Θ t 4

Discrete Probability Distribution In the discrete case, a probability distribution P on S (and hence on the domain of X ) is an assignment of a non-negative real number P(s) to

15 Discrete Probability Distribution In the discrete case, a probability distribution P on S (and hence on the domain of X ) is an assignment of a non-negative real number P(s) to each s S (or each valid value of x) such that 0 P(Xx) Σ x P(X x) X random variable x value it takes E.g. Bernoulli distribution with parameter θ P(x) θ θ for x 0 for x P(x) θ x ( θ) x 5

16 Discrete Probability Distribution In the discrete case, a probability distribution P on S (and hence on the domain of X ) is an assignment of a non-negative real number P(s) to each s S (or each valid value of x) such that 0 P(Xx) Σ x P(X x) X random variable x value it takes E.g. Multinomial distribution with parameters θ,,θk x x M x K P(x), where x j n n! x θ θ Lθ x!x! Lx K! j x x K K 6

17 Continuous Prob. Distribution A continuous random variable X can assume any value in an interval on the real line or in a region in a high dimensional space X usually corresponds to a real-valued measurements of some property, e.g., length, position, It is not possible to talk about the probability of the random variable assuming a particular value --- P(Xx) 0 Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval P(X [x,x]) P(X < x) P(X [,x]) 7

18 Continuous Prob. Distribution The probability of the random variable assuming a value within some given interval from x to x is defined to be the area under the graph of the probability density function between x and x. Probability mass: P x ( X [ x x ]) p( x ) dx,, x note that + p( x ) dx. Cumulative distribution function (CDF): F(x) P x ( X x) p(x')dx' Probability density function (PDF): p (x) d F( x) dx Car flow on Liberty Bridge (cooked up!) + p(x)dx ; p(x) 0, x 8

19 What is the intuitive meaning of p(x) If p(x ) a and p(x ) b, That is: then when a value X is sampled from the distribution with density p(x), you are a/b times as likely to find that X is very close to x than that X is very close to x. P(x lim 0 P(x h< X< x h< X< x + h) lim + h) h 0 h x + h p(x)dx p(x p(x ) h ) h x h x + h p(x)dx x h a b 9

20 Continuous Distributions Uniform Probability Density Function p( x ) /( b a ) for a x b 0 elsewhere Normal (Gaussian) Probability Density Function f(x) p( x ) ( x µ ) / σ e πσ The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, µ (mean) and σ (standard deviation), determine the location and shape of the distribution. Exponential Probability Distribution density : x /µ p( x ) x / µ e, o CDF : P ( x x ) e µ f(x) x Time Between Successive Arrivals (mins.) 0 µ P(x < ) area.4866 x

21 Statistical Characterizations Expectation: the centre of mass, mean value, first moment E(X) x xp(x ) xp(x)dx discrete continuous Variance: the spread Var(X) x [x E(X)] [x E(X)] p(x) p(x)dx discrete continuous

22 Gaussian (Normal) density in D If X N(µ, σ ), the probability density function (pdf) of X is defined as ( x µ ) / σ p ( x ) e E(X) µ πσ var( X) σ Here is how we plot the pdf in matlab xs-3:0.0:3; plot(xs,normpdf(xs,mu,sigma)) Zero mean Large variance Zero mean Small variance Note that a density evaluated at a point can be bigger than!

23 Gaussian CDF If Z N(0, ), the cumulative density function is defined as Φ ( x ) x p( z ) dz x z / e dz π This has no closed form expression, but is built in to most software packages (eg. normcdf in matlab stats toolbox). pdf cdf 3

24 Central limit theorem If (X,X, X n ) are i.i.d. (independent and identically distributed to be covered next) random variables Then define As n infinity, p( X ) X n n i Gaussian with mean E[X i ] and variance Var[X i ]/n X i n n n0 Somewhat of a justification for assuming Gaussian distribution 4

25 Independence Training and test samples typically assumed to be i.i.d. (independent and identically distributed) A and B are independent events if P(A B) P(A) * P(B) Outcome of A has no effect on the outcome of B (and vice versa). E.g. Roll of two die P({},{3}) /6*/6 /36 5

26 Independence A, B and C are pairwise independent events if P(A B) P(A) * P(B) P(A C) P(A) * P(C) P(B C) P(B) * P(C) A, B and C are mutually independent events if, in addition to pairwise independence, P(A B C) P(A) * P(B) * P(C) 6

27 Conditional Probability P(A B) Probability of event A conditioned on event B having occurred P(A B) If P(B) > 0, then P(A B) P(B) E.g. H "having a headache" F "coming down with Flu" P(H)/0 P(F)/40 P(H F)/ Fraction of people with flu that have a headache Corollary: The Chain Rule P(A B) P(A B) P(B) B A B A If A and B are independent, P(A B) P(A) 7

28 Conditional Independence A and B are independent if P(A B) P(A) * P(B) P(A B) P(A) Outcome of B has no effect on the outcome of A (and vice versa). A and B are conditionally independent given C if P(A B C) P(A C) * P(B C) P(A B,C) P(A C) Outcome of B has no effect on the outcome of A (and vice versa) if C is true. 8

29 Prior and Posterior Distribution Suppose that our propositions have a "causal flow" e.g., F B H Prior or unconditional probabilities of propositions e.g., P(Flu) 0.05 and P(DrinkBeer ) 0. correspond to belief prior to arrival of any (new) evidence Posterior or conditional probabilities of propositions e.g., P(Headache Flu) 0.5 and P(Headache Flu,DrinkBeer ) 0.7 correspond to updated belief after arrival of new evidence Not always useful: P(Headache Flu, Steelers win) 0.5 9

30 Probabilistic Inference H "having a headache" F "coming down with Flu" P(H)/0 P(F)/40 P(H F)/ One day you wake up with a headache. You come with the following reasoning: "since 50% of flues are associated with headaches, so I must have a chance of coming down with flu Is this reasoning correct? 30

31 Probabilistic Inference H "having a headache" F "coming down with Flu" P(H)/0 P(F)/40 P(H F)/ The Problem: P(F H)? F F H H 3

32 Probabilistic Inference H "having a headache" F "coming down with Flu" P(H)/0 P(F)/40 P(H F)/ The Problem: F P(F H) P(F H) P(H) F H H P(H F) P(F) P(H) /8 P(H F) 3

33 The Bayes Rule What we have just did leads to the following general expression: P (A B) P(B A)P(A) P(B) This is Bayes Rule 33

34 Quiz P(H)/0 P(F)/40 P(H F)/ P(F H) /8 Which of the following statement is true? P(F H) P(F H) P( F H) P(F H) P(F H) P( H F) P(F) ( P(H F)) P(F) P( H) P(H) 34

35 More General Forms of Bayes Rule P (A B) P(B A)P(A) P(B) Law of total probability P(B) P(B A) + P(B A) P(B A) P(A) + P(B A) P( A) P(A B) P(B P(B A)P(A) A)P(A) + P(B A)P( A) 35

36 More General Forms of Bayes Rule P(Y y X) P(X Y)p(Y) P(X Y y)p(y y y) P(Y X Z) P(X Y Z)p(Y Z) P(X Z) P(X Y Z)p(Y Z) P(X Y Z)p( Y Z) + P(X Y Z)p( Y Z) E.g. P(Flu Headhead DrankBeer) F B F B F H H F H H 36

37 Joint and Marginal Probabilities A joint probability distribution for a set of RVs (say X,X,X3) gives the probability of every atomic event P(X,X,X3) P(Flu,DrinkBeer) a matrix of values: B B F F P(Flu,DrinkBeer, Headache)? Every question about a domain can be answered by the joint distribution, as we will see later. A marginal probability distribution is the probability of every value that a single RV can take P(X) P(Flu)? 37

38 Inference by enumeration Start with a Joint Distribution Building a Joint Distribution of M3 variables Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have M rows). F B H Prob For each combination of values, say how probable it is. F B Normalized, i.e., sums to H 38

39 Inference with the Joint One you have the JD you can ask for the probability of any atomic event consistent with you query P ( E ) P ( row i ) i E F B H 0.4 F B H 0. F B H 0.7 F B H 0. F B H 0.05 F B H 0.05 F B H 0.05 F B H 0.05 E.g. E {( F, B,H),( F,B,H)} F B H 39

40 Inference with the Joint Compute Marginals P(Flu Headache) P (F H B) + P(F H B) F B H 0.4 F B H 0. F B H 0.7 F B H 0. F B H 0.05 F B H 0.05 F B H 0.05 F B H 0.05 Recall: Law of Total Probability F B H 40

41 Inference with the Joint Compute Marginals P(Headache) P (H F) + P(H F) P (H F B) + P(H F B) + P (H F B) + P(H F B) F B H 0.4 F B H 0. F B H 0.7 F B H 0. F B H 0.05 F B H 0.05 F B H 0.05 F B H 0.05 B F H 4

42 Inference with the Joint Compute Conditionals P( E E ) P( E E P ( E ) i E i E IE ) P( row ) P( row ) i i F B H 0.4 F B H 0. F B H 0.7 F B H 0. F B H 0.05 F B H 0.05 F B H 0.05 F B H 0.05 B F H 4

43 Inference with the Joint Compute Conditionals F B H 0.4 F B H 0. P(Flu Headache) P(Flu Headache) P(Headache) F B H 0.7 F B H 0. F B H 0.05 F B H 0.05 F B H 0.05 F B H 0.05 General idea: Compute distribution on query variable by fixing evidence variables and F B summing over hidden variables H 43

44 Where do probability distributions come from? Idea One: Human, Domain Experts Idea Two: Simpler probability facts and some algebra e.g., P(F) P(B) P(H F,B) P(H F, B) F F F F F F F B B B B B B B H H H H H H H F B H 0.05 Use chain rule and independence assumptions to compute joint distribution 44

45 Where do probability distributions come from? Idea Three: Learn them from data! A good chunk of this course is essentially about various ways of learning various forms of them! 45

46 Density Estimation A Density Estimator learns a mapping from a set of attributes to a Probability Often know as parameter estimation if the distribution form is specified Binomial, Gaussian Some important issues: Nature of the data (iid, correlated, ) Objective function (MLE, MAP, ) Algorithm (simple algebra, gradient methods, EM, ) Evaluation scheme (likelihood on test data, predictability, consistency, ) 46

47 Parameter Learning from iid data Goal: estimate distribution parameters θ from a dataset of independent, identically distributed (iid), fully observed, training cases D {x,..., x } Maximum likelihood estimation (MLE). One of the most common estimators. With iid and full-observability assumption, write L(θ) as the likelihood of the data: L( θ) P(D; θ) P(x, x, K, x N; θ) P( x; θ ) P( x ; θ ), K, P( x ; θ ) i P( x i ; θ ) 3. pick the setting of parameters most likely to have generated the data we saw: θ MLE arg max L( θ) θ arg max log L( θ ) θ 47

48 Example : Bernoulli model Data: We observed Niid coin tossing: D {, 0,,, 0} Model: P(x) θ for x 0 θ for x P x x ( x) θ ( θ) How to write the likelihood of a single observation x i? P( ) θ ( θ ) x i x i The likelihood of dataset D {x,,x N }: x i L( θ) P(x, x,..., x N ; θ) N i P(x ; θ) i i θ N xi x i ( θ ( θ) ) i x i xi i #head #tails ( θ ) θ ( θ ) 48

49 MLE Objective function: n h n t l( θ) logl( θ) logθ ( θ) n h logθ+ (N n h We need to maximize this w.r.t. θ Take derivatives wrtθ )log( θ) l θ nh n h θ θ 0 ) θ MLE n h or θ ) MLE x i i Sufficient statistics n, where n Frequency as sample mean The counts, are sufficient statistics of data D h h i x i, 49

50 Example : univariate normal Data: We observed Niid real samples: D{-0., 0,, -5.,, 3} Model: / ( x ) ( πσ ) exp{ ( x µ ) / σ } Log likelihood: l ( θ) P θ (µ,σ ) log L( θ) MLE: take derivative and set to zero: ( x µ ) N N i P(xi) log(πσ ) i i σ Π N l ( / σ ) µ l σ N σ + n ( x µ ) n σ 4 ( xn µ ) n µ σ MLE MLE N N n ( x n µ ML) n x n 50

51 Overfitting Recall that for Bernoulli Distribution, we have θ ) head ML n n head head + n tail What if we tossed too few times so that we saw zero head? We have ) head θ ML 0, and we will predict that the probability of seeing a head next is zero!!! The rescue smoothing : Where n' is know as the pseudo- (imaginary) count θ ) head ML n head n + n ' tail + n + n ' head But can we make this more formal? 5

52 Bayesian Learning The Bayesian Rule: Or equivalently, posterior likelihood prior MAP estimate: θ MAP arg max P( θ θ D) (Belief about coin toss probability) If prior is uniform, MLE MAP 5

53 Bayesian estimation for Bernoulli Beta(α,β) distribution: P( θ) Γ( α+β) θ Γ( α) Γ( β) α ( θ) β B( α, β) θ α ( θ) β Posterior distribution of θ : P( θ D) p(x,..., x θ)p( θ) p(x,..., x ) N n h n t α β n h+α ) n t+β θ ( θ) θ ( θ) θ ( θ N Beta(α+nh,β+nt) Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior α and β are hyperparameters (parameters of the prior) and correspond to the number of virtual heads/tails (pseudo counts) 53

54 MAP Posterior distribution of θ : P( θ x p( x,..., x θ ) p( θ ) θ nh nt α β nh+ α,..., x ) ( ) ( ) ( p( x,..., x ) θ θ θ θ θ ) n + β t Maximum a posteriori (MAP) estimation: θ MAP arg max log P( θ θ x,..., x N ) Posterior mean estimation: +α θ h MAP Nn +α+β With enough data, prior is forgotten Beta parameters can be understood as pseudo-counts 54

55 Dirichlet distribution 55

56 Estimating the parameters of a distribution Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data θ MLE arg max P(D θ) θ Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief θ MAP arg max P( θ D) θ arg max P(D θ)p( θ) θ 56

57 MLE vs MAP (Frequentist vs Bayesian) Frequentist/MLE approach: θ is unknown constant, estimate from data Bayesian/MAP approach: θ is a random variable, assume a probability distribution Drawbacks MLE: Overfits if dataset is too small MAP: Two people with different priors will end up with different estimates 57

58 Bayesian estimation for normal distribution Normal Prior: Joint probability: ( ) { } 0 τ µ µ πτ µ / ) ( exp ) ( / P Posterior: τ σ σ µ τ σ τ τ σ σ µ N N x N N ~ and, / / / / / / ~ where Sample mean ( ) ( ) ( ) { } 0 τ µ µ πτ µ σ πσ µ / ) ( exp exp ), ( / / N n n N x x P ( ) { } σ µ µ πσ µ ~ / ~ ) ( exp ~ ) ( / x P 58

59 Probability Review What you should know:

Machine Learning

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,