Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Size: px

Start display at page:

Download "Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824"

Magnus Griffith
5 years ago
Views:

1 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019

2 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

3 Linear Regression Model representation h θ x = θ 0 + θ 1 x 1 + θ 2 x θ n x n = θ x Cost function J θ = 1 σ 2m i=1 m h θ x i y i 2 Gradient descent for linear regression Repeat until convergence {θ j Features and polynomial regression θ j α 1 σ m i=1 m h θ x i y i x i j } Can combine features; can use different functions to generate features (e.g., polynomial) Normal equation θ = (X X) 1 X y

4 (x 0 ) Size in feet^2 (x 1 ) Number of bedrooms (x 2 ) Number of floors (x 3 ) Age of home (years) (x 4 ) Price ($) in 1000 s (y) y = θ = (X X) 1 X y Slide credit: Andrew Ng

5 Least square solution J θ = 1 σ 2m i=1 m h θ x i y i 2 = 1 σ 2m i=1 m θ x (i) y i 2 θ = 1 2m Xθ y 2 2 J θ = 0 θ = (X X) 1 X y

6 Justification/interpretation 1 Geometric interpretation X = 1 x (1) 1 x (2) 1 x (m) = z 0 z 1 z 2 z n Xθ: column space of X or span({z 0, z 1,, z n }) y Xθ y Xθ column space of X Residual Xθ y is orthogonal to the column space of X X Xθ y = 0 (X X)θ = X y

7 Justification/interpretation 2 Probabilistic model Assume linear model with Gaussian errors p θ y i x i = 1 2πσ exp( 1 2 2σ 2 (y i θ x i ) Solving maximum likelihood argmin θ m log( i=1 argmin θ m i=1 p θ y i p y i x i ) = argmin θ x i 1 m 1 2σ 2 i=1 2 θ x i y i 2 Image credit: CS 446@UIUC

8 Justification/interpretation 3 Loss minimization m J θ = 1 2m i=1 m h θ x i y i 2 = 1 m i=1 L ls h θ x i, y i L ls y, y = 1 2 y y 2 2 : Least squares loss Empirical Risk Minimization (ERM) m 1 m L ls y i, y i=1

9 m training examples, n features Gradient Descent Need to choose α Need many iterations Works well even when n is large Normal Equation No need to choose α Don t need to iterate Need to compute (X X) 1 Slow if n is very large Slide credit: Andrew Ng

10 Things to remember Model representation h θ x = θ 0 + θ 1 x 1 + θ 2 x θ n x n = θ x Cost function J θ = 1 σ 2m i=1 m h θ x i y i 2 Gradient descent for linear regression Repeat until convergence {θ j Features and polynomial regression θ j α 1 σ m i=1 m h θ x i y i x i j } Can combine features; can use different functions to generate features (e.g., polynomial) Normal equation θ = (X X) 1 X y

11 Today s plan Probability basics Estimating parameters from data Maximum likelihood (ML) Maximum a posteriori estimation (MAP) Naïve Bayes

12 Today s plan Probability basics Estimating parameters from data Maximum likelihood (ML) Maximum a posteriori estimation (MAP) Naive Bayes

13 Random variables Outcome space S Space of possible outcomes Random variables Functions that map outcomes to real numbers Event E Subset of S

14 Visualizing probability P(A) Sample space A is true Area = 1 A is false P A = Area of the blue circle

15 Visualizing probability P A + P ~A A is true A is false P A + P ~A = 1

16 Visualizing probability P A A^B B A^~B P A = P(A, B) + P A, ~B

17 Visualizing conditional probability A^B P A B = P A, B /P(B) B A Corollary: The chain rule P A, B = P A B P(B)

18 Bayes rule A^B Thomas Bayes B A Corollary: The chain rule P A, B = P A B P B = P B P A B P A B = = P A, B P B P B A P A P(B)

19 Other forms of Bayes rule P A B = P B A P A P(B) P A B, X = P B A, X P A, X P(B, X) P A B = P B A P A P B A P A + P B ~A P(~A)

20 Applying Bayes rule P A B = A = you have the flu B = you just coughed Assume: P A = 0.05 P B A = 0.8 P B ~A = 0.2 What is P(flu cough) = P(A B)? P B A P A P B A P A + P B ~A P(~A) P A B = ~0.17 Slide credit: Tom Mitchell

21 Why we are learning this? x h Hypothesis Learn P Y X y

22 Joint distribution Making a joint distribution of M variables 1. Make a truth table listing all combinations A B C Prob For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell

23 Using joint distribution Can ask for any logical expression involving these variables P E = σ rows matching E P(row) P E 1 E 2 = σ rows matching E1 and P row E2 P row σ rows matching E 2 Slide credit: Tom Mitchell

24 The solution to learn P Y X? Main problem: learning P Y X may require more data than we have Say, learning a joint distribution with 100 attributes # of rows in this table? # of people on earth? 10 9 Slide credit: Tom Mitchell

25 What should we do? 1. Be smart about how we estimate probabilities from sparse data Maximum likelihood estimates (ML) Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell

26 Today s plan Probability basics Estimating parameters from data Maximum likelihood (ML) Maximum a posteriori (MAP) Naive Bayes

27 Estimating the probability Flip the coin repeatedly, observing It turns heads α 1 times It turns tails α 0 times Your estimate for P X = 1 is? X = 1 X =0 Case A: 100 flips: 51 Heads (X = 1), 49 Tails (X = 0) P X = 1 =? Case B: 3 flips: 2 Heads (X = 1), 1 Tails (X = 0) P X = 1 =? Slide credit: Tom Mitchell

28 Two principles for estimating parameters Maximum Likelihood Estimate (MLE) Choose θ that maximizes probability of observed data θ MLE = argmax θ P(Data θ) Maximum a posteriori estimation (MAP) Choose θ that is most probable given prior probability and data θ MAP = argmax θ P θ D = argmax θ P Data θ P θ P(Data) Slide credit: Tom Mitchell

29 Two principles for estimating parameters Maximum Likelihood Estimate (MLE) Choose θ that maximizes P Data θ θ MLE = α 1 α 1 + α 0 Maximum a posteriori estimation (MAP) Choose θ that maximize P θ Data θ MAP = (α 1 + #halluciated 1s) (α 1 +#halluciated 1s) + (α 0 + #halluciated 0s) Slide credit: Tom Mitchell

30 Maximum likelihood estimate Each flip yields Boolean value for X X Bernoulli: P X = θ X 1 θ 1 X X = 1 X =0 P X = 1 = θ P X = 0 = 1 θ Data set D of independent, identically distributed (iid) flips, produces α 1 ones, α 0 zeros P D θ = P α 1, α 0 θ = θ α 1 1 θ α 0 θ = argmax θ P(D θ) = α 1 α 1 + α 0 Slide credit: Tom Mitchell

31 Beta prior distribution P θ P θ = Beta β 1, β 0 = 1 B(β 1,β 0 ) θβ θ β 0 1 Slide credit: Tom Mitchell

Maximum likelihood estimate X = 1 X =0 Data set D of iid flips, produces α 1 ones, α 0 zeros P D θ = P α 1, α 0 θ = θ α 1 1 θ α 0 Assume prior (Conjugate prior: Closed form

32 Maximum likelihood estimate X = 1 X =0 Data set D of iid flips, produces α 1 ones, α 0 zeros P D θ = P α 1, α 0 θ = θ α 1 1 θ α 0 Assume prior (Conjugate prior: Closed form representation of posterior) 1 P θ = Beta β 1, β 0 = B(β 1, β 0 ) θβ1 1 1 θ β 0 1 θ = argmax θ P D θ P(θ) = α 1 + β 1 1 (α 1 + β 1 1) + (α 0 + β 0 1) Slide credit: Tom Mitchell

33 Some terminology Likelihood function P Data θ Prior P θ Posterior P θ Data Conjugate prior: Prior P θ is the conjugate prior for a likelihood functionp Data θ if the prior P θ and the posterior P θ Data have the same form. Example (coin flip problem) Prior P θ : Beta β 1, β 0 Likelihood P Data θ : Binomial θ α 1 1 θ α 0 Posterior P θ Data : Beta α 1 + β 1, α 0 + β 0 Slide credit: Tom Mitchell

34 How many parameters? Suppose X = [X 1,, X n ], where X i and Y are Boolean random variables To estimate P Y X 1,, X n ) When n = 2 (Gender, Hours-worked)? When n = 30? Slide credit: Tom Mitchell

35 Can we reduce paras using Bayes rule? P Y X = P X Y P Y P(X) How many parameters for P(X 1,, X n Y)? 2 n 1 2 How many parameters for P Y? 1 Slide credit: Tom Mitchell

36 Today s plan Probability basics Estimating parameters from data Maximum likelihood (ML) Maximum a posteriori estimation (MAP) Naive Bayes

37 Naïve Bayes Assumption: n P X 1,, X n Y = j=1 P(X j Y) i.e., X i and X j are conditionally independent given Y for i j Slide credit: Tom Mitchell

38 Conditional independence Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z i, j, k P X = x i Y = y j, Z = z k ) = P(X = x i Z k ) P X Y, Z = P(X Z) Example: P Thunder Rain, Lightning = P(Thunder Lightning) Slide credit: Tom Mitchell

39 Applying conditional independence Naïve Bayes assumes X i are conditionally independent given Y e.g., P X 1 X 2, Y = P(X 1 Y) P X 1, X 2 Y = P X 1 X 2, Y P X 2 Y = P X 1 Y P(X 2 Y) n General form: P X 1,, X n Y = ς j=1 P(X j Y) How many parameters to describe P X 1,, X n Y? P(Y)? Without conditional indep assumption? With conditional indep assumption? Slide credit: Tom Mitchell

40 Naïve Bayes classifier Bayes rule: P Y = y k X 1,, X n ) = P(Y = y k)p(x 1,, X n Y = y k σ j P Y = y j P X 1,, X n Y = y j Assume conditional independence among X i s: P Y = y k X 1,, X n ) = P Y = y k Π i P X i Y = y k ) σ j P Y = y j Π i P X i Y = y j ) Pick the most probable Y Y argmax y k P Y = y k Π i P X i Y = y k ) Slide credit: Tom Mitchell

41 Naïve Bayes algorithm discrete X i For each value y k Estimate π k = P(Y = y k ) For each value x ij of each attribute X i Estimate θ ijk = P(X i = x ijk Y = y k ) Classify X test Y argmax y k P Y = y k Π i P X i test Y = y k ) Y argmax y k π k Π i θ ijk

42 Estimating parameters: discrete Y, X i Maximum likelihood estimates (MLE) π k = P Y = y k = #D Y = y k D መθ ijk = P X i = x ij Y = y k = #D X i = x ij ^ Y = y k #D{Y = y k } Slide credit: Tom Mitchell

43 F = 1 iff you live in Fox Ridge S = 1 iff you watched the superbowl last night D = 1 iff you drive to VT G = 1 iff you went to gym in the last month P F = 1 = P S = 1 F = 1 = P S = 1 F = 0 = P D = 1 F = 1 = P D = 1 F = 0 = P G = 1 F = 1 = P G = 1 F = 0 = P F = 0 = P S = 0 F = 1 = P S = 0 F = 0 = P D = 0 F = 1 = P D = 0 F = 0 = P G = 0 F = 1 = P G = 0 F = 0 = P F S, D, G = P F P S F P D F P(G F)

44 Naïve Bayes: Subtlety #1 Often the X i are not really conditionally independent Naïve Bayes often works pretty well anyway Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) What is the effect on estimated P(Y X)? What if we have two copies: X i = X k P Y = y k X 1,, X n ) P Y = y k Π i P X i Y = y k ) Slide credit: Tom Mitchell

45 Naïve Bayes: Subtlety #2 MLE estimate for P X i Y = y k ) might be zero. (for example, X i = birthdate. X i = Feb_4_1995) Why worry about just one parameter out of many? P Y = y k X 1,, X n ) P Y = y k Π i P X i Y = y k ) What can we do to address this? MAP estimates (adding imaginary examples) Slide credit: Tom Mitchell

46 Estimating parameters: discrete Y, X i Maximum likelihood estimates (MLE) π k = P Y = y k = #D Y = y k D መθ ijk = P X i = x ij Y = y k = #D X i = x ij, Y = y k #D{Y = y k } MAP estimates (Dirichlet priors): π k = P Y = y k = #D Y = y k + (β k 1) D + σ m (β m 1) መθ ijk = P X i = x ij Y = y k = #D X i = x ij, Y = y k + (β k 1) #D{Y = y k } + σ m (β m 1) Slide credit: Tom Mitchell

47 What if we have continuous X i Gaussian Naïve Bayes (GNB): assume P X i = x Y = y k = 1 exp( x μ ik 2 2πσ ik 2σ ik 2 ) Additional assumption on σ ik : Is independent of Y (σ i ) Is independent of X i (σ k ) Is independent of X i and Y (σ k ) Slide credit: Tom Mitchell

48 Naïve Bayes algorithm continuous X i For each value y k Estimate π k = P(Y = y k ) For each attribute X i estimate Class conditional mean μ ik, variance σ ik Classify X test Y argmax y k P Y = y k Π i P X i test Y = y k ) Y argmax y k π k Π i Normal(X i test, μ ik, σ ik ) Slide credit: Tom Mitchell

49 Things to remember Probability basics Estimating parameters from data Maximum likelihood (ML) Maximum a posteriori estimation (MAP) maximize P(Data θ) maximize P(θ Data) Naive Bayes P Y = y k X 1,, X n ) P Y = y k Π i P X i Y = y k )

50 Next class Logistic regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood