LDA, QDA, Naive Bayes

Size: px

Start display at page:

Download "LDA, QDA, Naive Bayes"

Blanche Hardy
6 years ago
Views:

1 LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017

2 Last Class Logistic Regression Maximum Likelihood Principle

3 Logistic Regression Predict probability of a class: p(x) Example: p(balance) probability of default for person with balance Linear regression: logistic regression: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) log = β 0 + β 1 X 1 p(x) Odds: p(x) /1 p(x)

4 Logistic Function y = ex 1 + e x Logistic p(x) = x eβ 0+β 1 X 1 + e β 0+β 1 X

5 Logit Function ( ) p(x) log 1 p(x) Logit p(x) ( ) p(x) log = β 0 + β 1 X 1 p(x)

6 Logistic Regression Pr[default = yes balance] = eβ 0+β 1 balance 1 + e β 0+β 1 balance Linear regression Logistic regression Balance Probability of Default Balance Probability of Default

7 Estimating Coefficients: Maximum Likelihood Likelihood: Probability that data is generated from a model Find the most likely model: l(model) = Pr[data model] max l(model) = max Pr[data model] model model Likelihood function is difficult to maximize Transform it using log (strictly increasing) max log l(model) model Strictly increasing transformation does not change maximum

8 Today 1. Classification methods continued 2. Discriminative vs. Generative ML Models 3. Generative classification models: Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Naive Bayes Classification

9 Discriminative vs Generative Models Discriminative models Estimate conditional models Pr[Y X] Linear regression Logistic regression Generative models Estimate joint probability Pr[Y, X] = Pr[Y X] Pr[X] Estimates not only probability of labels but also the features Once model is fit, can be used to generate data LDA, QDA, Naive Bayes

10 Generative Models + Can be used to generate data (Pr[X]) + Offers more insights into data Often works worse, particularly when assumptions are violated

11 Normal Distribution Density function: p(x) = 1 σ (x µ) 2 2π e 2σ 2 density.default(x = x) Density N = Bandwidth =

12 Normal Distribution Density function: p(x) = 1 σ (x µ) 2 2π e 2σ 2 density.default(x = x) Density N = Bandwidth = Central limit theorem: Z = 1 /n n i=1 X i for i.i.d. X i is normal with n

13 Logistic Regression Y = { 1 if default 0 otherwise Linear regression Logistic regression Balance Probability of Default Balance Probability of Default Predict: Pr[default = yes balance]

14 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict:

15 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes]

16 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no]

17 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no] Classes are normal: Pr[balance default = yes]

18 LDA vs Logistic Regression Logistic regressions: Pr[default = yes balance] Linear discriminant analysis: Pr[balance default = yes] and Pr[default = yes] Pr[balance default = no] and Pr[default = no]

19 LDA with 1 Feature Classes are normal and class probabilities π k are scalars f k (x) = 1 ( σ 2π exp 1 ) 2σ 2 (x µ k) 2 Key Assumption:Class variances σk 2 are the same

20 Bayes Theorem Classification from label distributions: Pr[Y = k X = x] = Pr[X = x Y = k] Pr[Y = k] Pr[X = x]

21 Bayes Theorem Classification from label distributions: Pr[Y = k X = x] = Pr[X = x Y = k] Pr[Y = k] Pr[X = x] Example: Pr[default = yes balance = $100] = Pr[balance = $100 default = yes] Pr[default = yes] Pr[balance = $100]

22 Bayes Theorem Classification from label distributions: Pr[Y = k X = x] = Pr[X = x Y = k] Pr[Y = k] Pr[X = x] Example: Notation: Pr[default = yes balance = $100] = Pr[balance = $100 default = yes] Pr[default = yes] Pr[balance = $100] Pr[Y = k X = x] = π kf k (x) K l=1 π lf l (x)

23 Classification With LDA Probability in class k 1 > Probability in class k 2

24 Classification With LDA Probability in class k 1 > Probability in class k 2 Pr[Y = k 1 X = x] > Pr[Y = k 2 X = x]

25 Classification With LDA Probability in class k 1 > Probability in class k 2 Pr[Y = k 1 X = x] > Pr[Y = k 2 X = x] π k1 f k1 (x) K l=1 π lf l (x) > π k 2 f k2 (x) K l=1 π lf l (x)

26 Classification With LDA Probability in class k 1 > Probability in class k 2 Pr[Y = k 1 X = x] > Pr[Y = k 2 X = x] π k1 f k1 (x) K l=1 π lf l (x) > π k 2 f k2 (x) K l=1 π lf l (x) π k1 f k1 (x) > π k2 f k2 (x)

27 Classification With LDA Probability in class k 1 > Probability in class k 2 Pr[Y = k 1 X = x] > Pr[Y = k 2 X = x] π k1 f k1 (x) K l=1 π lf l (x) > π k 2 f k2 (x) K l=1 π lf l (x) π k1 f k1 (x) > π k2 f k2 (x) log (π k1 f k1 (x)) > log (π k2 f k2 (x))

28 Classification With LDA Probability in class k 1 > Probability in class k 2 Pr[Y = k 1 X = x] > Pr[Y = k 2 X = x] Discriminant function: Derive at home π k1 f k1 (x) K l=1 π lf l (x) > π k 2 f k2 (x) K l=1 π lf l (x) π k1 f k1 (x) > π k2 f k2 (x) log (π k1 f k1 (x)) > log (π k2 f k2 (x)) ˆδ k1 (x) > ˆδ k2 (x) ˆδ k (x) = x ˆµ k ˆσ 2 ˆµ2 k 2ˆσ 2 + log(ˆπ k)

29 Estimating LDA Parameters

30 Estimating LDA Parameters Maximum log likelihood! max µ,σ max µ,σ max µ,σ N i=1 log l(µ, σ) = max µ,σ N i=1 ( 1 log N log (f yi (x i )) = i=1 ( 1 )) 2σ 2 (x i µ yi ) 2 σ 2π exp ( log σ 1 2σ 2 (x i µ yi ) 2 + consts ) =

31 Estimating LDA Parameters Maximum log likelihood! max µ,σ max µ,σ max µ,σ N i=1 log l(µ, σ) = max µ,σ N i=1 ( 1 log N log (f yi (x i )) = i=1 ( 1 )) 2σ 2 (x i µ yi ) 2 σ 2π exp ( log σ 1 2σ 2 (x i µ yi ) 2 + consts Concave in µ and 1 /σ 2, consider a single class with mean µ µ log l(µ, σ) = 1 σ 2 N (x i µ) = 0 i=1 σ log l(µ, σ) = N σ + 1 σ 3 N (x i µ) 2 = 0 i=1 ) =

32 Estimating LDA Parameters log l is derivatives: Therefore: µ log l(µ, σ) = 1 σ 2 N (x i µ) = 0 i=1 σ log l(µ, σ) = N σ + 1 σ 3 N (x i µ) 2 = 0 i=1 µ = 1 N σ 2 = 1 N N i=1 x i N (x i µ) 2 i=1

33 Better Parameter Estimates Maximum likelihood variance σ 2 is biased: µ = 1 N σ 2 = 1 N N i=1 x i N (x i µ) 2 i=1 Unbiased estimate: µ = 1 N N i=1 σ 2 = 1 N 1 x i N (x i µ) 2 i=1 See ISL for precise formula for more than a single class

34 LDA with Multiple Features Multivariate Normal Distributions: x 2 x 2 x 1 x 1 Multivariate normal distribution density (mean vector µ, covariance matrix Σ): ( 1 p(x) = (2π) p/2 Σ exp 1 ) 1 /2 2 (x µ) Σ 1 (x µ)

35 Multivariate Maximum Likelihood Consider a singe class: max µ,σ max µ,σ ( N 1 log i=1 (2π) p/2 Σ log l(µ, Σ) = max µ,σ ( exp 1/2 N log (f k (x i )) = i=1 1 2 (x i µ) Σ 1 (x i µ)) ) = max N µ,σ 2 log Σ 1 N (x i µ) Σ 1 (x i µ) = 2 i=1 max N µ,σ 2 log Σ 1 2 Trace Σ 1 N (x i µ) (x i µ) i=1 Use / Σ log Σ = Σ and 1 / A Trace(AB) = B Σ = 1 N N (x i µ) (x i µ) i=1

36 Multivariate Classification Using LDA Linear: Decision boundaries are linear X X X X 1

37 Confusion Matrix: Predict default True Yes No Total Predicted Yes a b a + b No c d c + d Total a + c b + d N Result of LDA classification: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes No Total Yes No Total

38 Confusion Matrix: Predict default True Yes No Total Predicted Yes a b a + b No c d c + d Total a + c b + d N Result of LDA classification: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes No Total Yes No Total Most people who default are predicted as No default

39 Increasing LDA Sensitivity Result of LDA classification: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes No Total Yes No Total

40 Increasing LDA Sensitivity Result of LDA classification: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes No Total Yes No Total Result of LDA classification: Predict default if Pr[default = yes balance] > 1 /2 Predicted True Yes No Total Yes No Total

41 True Positives, etc Predicted Reality Positive Negative Positive True Positive False Positive Negative False Negative True Negative Recall/sensitivity = TP/(TP+FN) Precision = TP/(TP+FP) Specificity = TN/(TN+FP)

42 ROC Curve Predicted Reality Positive Negative Positive True Positive False Positive Negative False Negative True Negative ROC Curve True positive rate False positive rate

43 Area Under ROC Curve ROC Curve True positive rate False positive rate Larger area is better Many other ways to measure classifier performance, like F 1

44 QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ

45 QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data?

46 QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data? What about the test error?

47 QDA: Quadratic Discriminant Analysis X X X X 1

48 Naive Bayes Simple Bayes net classification With normal distribution over features X 1,..., X k special case of QDA with diagonal Σ Generalizes to non-normal distributions and discrete variables More on it later...

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,