COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Similar documents
Co-training and Learning with Noise

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Two Roles for Bayesian Methods

Machine Learning. Bayesian Learning.

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

The Bayes classifier

Introduction to Machine Learning

Machine Learning, Fall 2012 Homework 2

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Introduction to Machine Learning

STA 4273H: Statistical Machine Learning

Machine Learning (CS 567) Lecture 5

Machine Learning Linear Classification. Prof. Matteo Matteucci

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Naïve Bayes classification

Introduction to Machine Learning

LDA, QDA, Naive Bayes

人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Day 5: Generative models, structured classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Introduction to Machine Learning Spring 2018 Note 18

Support Vector Machines

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Introduction to Machine Learning

Gaussian discriminant analysis Naive Bayes

Bayesian Methods: Naïve Bayes

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning - MT Classification: Generative Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Learning algorithms

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

CPSC 540: Machine Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

LEARNING AND REASONING ON BACKGROUND NETS FOR TEXT CATEGORIZATION WITH CHANGING DOMAIN AND PERSONALIZED CRITERIA

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Generative Model (Naïve Bayes, LDA)

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Bayesian Learning. Bayes Theorem. MAP, MLhypotheses. MAP learners. Minimum description length principle. Bayes optimal classier. Naive Bayes learner

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic Graphical Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Classification Methods II: Linear and Quadratic Discrimminant Analysis

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Logistic Regression. Machine Learning Fall 2018

The Naïve Bayes Classifier. Machine Learning Fall 2017

FINAL: CS 6375 (Machine Learning) Fall 2014

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE 5984: Introduction to Machine Learning

Intelligent Systems (AI-2)

Linear Regression and Discrimination

Machine Learning for Signal Processing Bayes Classification

Using Part-of-Speech Information for Transfer in Text Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Week 5: Logistic Regression & Neural Networks

Logistic Regression. William Cohen

Machine Learning

CPSC 540: Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Probabilistic Machine Learning

arxiv: v1 [cs.lg] 5 Jul 2010

Gaussian Models

Generative v. Discriminative classifiers Intuition

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 2: Linear discriminant analysis (continued); logistic regression

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

ECE521 week 3: 23/26 January 2017

Dimensionality reduction

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

The generative approach to classification. A classification problem. Generative models CSE 250B

Nonparametric Bayesian Methods (Gaussian Processes)

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Lecture 9: Classification, LDA

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Undirected Graphical Models

Generative v. Discriminative classifiers Intuition

Statistical Machine Learning Hilary Term 2018

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

LEC 4: Discriminant Analysis for Classification

Bayesian Learning (II)

Transcription:

COMP 55 Applied Machine Learning Lecture 5: Generative models for linear classification Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Today s quiz Q. What is a linear classifier? (In contrast to a non-linear classifier.) Q. Describe the difference between discriminative and generative classifiers. Q. Consider the following data set. If you use logistic regression to compute a decision boundary, what is the prediction for x 6? Data Feature Feature Feature Output x 0 0 0 x 0 0 x 0 0 0 x 4 x 5 0 x 6 0 0 0? COMP-55: Applied Machine Learning

Quick recap Two approaches for linear classification: Discriminative learning: Directly estimate P(y x). Logistic regression, P(y x) : σ(wx) = / ( + e -WX ) Generative learning: Separately model P(x y) and P(y). Use these, through Bayes rule, to estimate P(y x). COMP-55: Applied Machine Learning

Linear discriminant analysis (LDA) Return to Bayes rule: P(y x) = P(x y)p(y) P(x) LDA makes explicit assumptions about P(x y): (x µ )T Σ (x µ ) P(x y) = e (π ) / Σ / Multivariate Gaussian, with mean μ and covariance matrix Σ. Notation: here x is a single instance, represented as an m* vector. Key assumption of LDA: Both classes have the same covariance matrix, Σ. COMP-55: Applied Machine Learning 4

Linear discriminant analysis (LDA) Return to Bayes rule: P(y x) = P(x y)p(y) P(x) LDA makes explicit assumptions about P(x y): (x µ )T Σ (x µ ) P(x y) = e (π ) / Σ / Multivariate Gaussian, with mean μ and covariance matrix Σ. Notation: here x is a single instance, represented as an m* vector. Key assumption of LDA: Both classes have the same covariance matrix, Σ. Consider the log-odds ratio (again, P(x) doesn t matter for decision): ln P(x y =)P(y =) P(x y = 0)P(y = 0) P(y =) = ln P(y = 0) µ T Σ µ + µ T Σ µ + x T Σ (µ µ ) This is a linear decision boundary! w 0 + x T w COMP-55: Applied Machine Learning 5

Learning in LDA: class case Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. COMP-55: Applied Machine Learning 6

Learning in LDA: class case Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. P(y=0) = N 0 / (N 0 + N ) P(y=) = N / (N 0 + N ) where N, N 0, be # of training samples from classes and 0, respectively. COMP-55: Applied Machine Learning 7

Learning in LDA: class case Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. P(y=0) = N 0 / (N 0 + N ) P(y=) = N / (N 0 + N ) where N, N 0, be # of training samples from classes and 0, respectively. μ 0 = i=:n I(y i =0) x i / N 0 μ = i=:n I(y i =) x i / N where I(x) is the indicator function: I(x)=0 if x=0, I(x)= if x=. COMP-55: Applied Machine Learning 8

Learning in LDA: class case Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. P(y=0) = N 0 / (N 0 + N ) P(y=) = N / (N 0 + N ) where N, N 0, be # of training samples from classes and 0, respectively. μ 0 = i=:n I(y i =0) x i / N 0 μ = i=:n I(y i =) x i / N where I(x) is the indicator function: I(x)=0 if x=0, I(x)= if x=. Σ = k=0: i=:n I(y i =0) (x i μ k )(x i μ k ) T / (N 0 +N -N k ) COMP-55: Applied Machine Learning 9

Learning in LDA: class case Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. P(y=0) = N 0 / (N 0 + N ) P(y=) = N / (N 0 + N ) where N, N 0, be # of training samples from classes and 0, respectively. μ 0 = i=:n I(y i =0) x i / N 0 μ = i=:n I(y i =) x i / N where I(x) is the indicator function: I(x)=0 if x=0, I(x)= if x=. Σ = k=0: i=:n I(y i =0) (x i μ k )(x i μ k ) T / (N 0 +N -N k ) Decision boundary: Given an input x, classify it as class if the log-odds ratio is >0, classify it as class 0 otherwise. COMP-55: Applied Machine Learning 0

More complex decision boundaries Want to accommodate more than classes? Consider the per-class linear discriminant function: δ y (x) = x T Σ µ y µ T y Σ µ y + log P(y) Use the following lowing decision rule: # Output = argmaxδ y (x) = argmax x T Σ µ y y y µ & T y Σ µ y + log P(y) $ % ' ( Want more flexible (non-linear) decision boundaries? Recall trick from linear regression of re-coding variables. COMP-55: Applied Machine Learning

Using higher-order features COMP-55: Applied Machine Learning FIGURE 4.. The left plot shows some data from three classes, with linear decision boundaries found by linear discriminant analysis. The right plot shows quadratic decision boundaries. These were obtained by finding linear boundaries in the five-dimensional space X,X,X X,X,X. Linear inequalities in this space are quadratic inequalities in the original space.

Quadratic discriminant analysis LDA assumes all classes have the same covariance matrix, Σ. QDA allows different covariance matrices, Σ y for each class y. Linear discriminant function: Quadratic discriminant function: δ y (x) = x T Σ µ y µ T y Σ µ y + log P(y) δ y (x) = Σ y (x µ y) T Σ y (x µ y )+ log P(y) QDA has more parameters to estimate, but greater flexibility to estimate the target function. Bias-variance trade-off. COMP-55: Applied Machine Learning

4 LDA vs QDA FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows the quadratic decision boundaries for the data in Figure 4. (obtained using LDA in the five-dimensional space X,X,X X,X,X ). The right plot shows the quadratic decision boundaries found by QDA. The differences are small, as is usually the case. COMP-55: Applied Machine Learning

Generative learning Scaling up Consider applying generative learning (e.g. LDA) to Project data. COMP-55: Applied Machine Learning 5

Generative learning Scaling up Consider applying generative learning (e.g. LDA) to Project data. E.g. Task: Given first few words, determine language (without dictionary). What are features? How to encode the data? How to handle high dimensional feature space? Assume same co-variance matrix for both classes? (Maybe that is ok.) What kinds of assumptions can we make to simplify the problem? COMP-55: Applied Machine Learning 6

Naïve Bayes assumption Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are conditionally independent given y. In other words: P(x j y) = P(x j y, x k ), for all j, k COMP-55: Applied Machine Learning 7

Naïve Bayes assumption Generative learning: Estimate P(x y), P(y). Then calculate P(y x). Naïve Bayes: Assume the x j are conditionally independent given y. In other words: P(x j y) = P(x j y, x k ), for all j, k Generative model structure: P(x y) = P(x, x,, x m y) = P(x y) P(x y, x ) P(x y, x, x ) P(x m y, x, x,, x m- ) (from general rules of probabilities) = P(x y) P(x y) P(x y) P(x m y) (from the Naïve Bayes assumption above) COMP-55: Applied Machine Learning 8

Naïve Bayes graphical model y x x x x m How many parameters to estimate? Assume m binary features. COMP-55: Applied Machine Learning 9

Naïve Bayes graphical model y x x x x m How many parameters to estimate? Assume m binary features. without Naïve Bayes assumption: O( m ) numbers to describe model. With Naïve Bayes assumption: O(m) numbers to describe model. Useful when the number of features is high. COMP-55: Applied Machine Learning 0

Training a Naïve Bayes classifier Assume x, y are binary variables, m=. Estimate the parameters P(x y) and P(y) from data. Define: Θ = Pr (y=) Θ j, = Pr (x j = y=) Θ j,0 = Pr (x j = y=0). y x COMP-55: Applied Machine Learning

Training a Naïve Bayes classifier Assume x, y are binary variables, m=. Estimate the parameters P(x y) and P(y) from data. y Define: Θ = Pr (y=) Θ j, = Pr (x j = y=) Θ j,0 = Pr (x j = y=0). x Evaluation criteria: Find parameters that maximize the loglikelihood function. Likelihood: Pr(y x) Pr(y)Pr(x y) = i=:n ( P(y i ) j=:m P(x i,j y i ) ) Samples i are independent, so we take product over n. Input features are independent (cond. on y) so we take product over m. COMP-55: Applied Machine Learning

Training a Naïve Bayes classifier Likelihood for binary output variable: L(Θ y=) = Θ y (-Θ ) -y Log-likelihood for all parameters: log L(Θ,Θ i,,θ i,0 D) = Σ i=:n [ log P(y i ) + Σ j=:m log P(x i,j y i ) ] COMP-55: Applied Machine Learning

Training a Naïve Bayes classifier Likelihood for binary output variable: L(Θ y=) = Θ y (-Θ ) -y Log-likelihood for all parameters: log L(Θ,Θ i,,θ i,0 D) = Σ i=:n [ log P(y i ) + Σ j=:m log P(x i,j y i ) ] = Σ i=:n [ y i log Θ + (-y i )log(-θ ) + Σ j=:m y i ( x i,j logθ i, + (-x i,j )log(-θ i, ) ) + Σ j=:m (-y i )( x i,j logθ i,0 + (-x i,j )log(-θ i,0 ) ) ] COMP-55: Applied Machine Learning 4

Training a Naïve Bayes classifier Likelihood for binary output variable: L(Θ y=) = Θ y (-Θ ) -y Log-likelihood for all parameters: log L(Θ,Θ i,,θ i,0 D) = Σ i=:n [ log P(y i ) + Σ j=:m log P(x i,j y i ) ] = Σ i=:n [ y i log Θ + (-y i )log(-θ ) + Σ j=:m y i ( x i,j logθ i, + (-x i,j )log(-θ i, ) ) + Σ j=:m (-y i )( x i,j logθ i,0 + (-x i,j )log(-θ i,0 ) ) ] (will have other form if params P(x y) have other form, e.g. Gaussian). COMP-55: Applied Machine Learning 5

Training a Naïve Bayes classifier Likelihood for binary output variable: L(Θ y=) = Θ y (-Θ ) -y Log-likelihood for all parameters: log L(Θ,Θ i,,θ i,0 D) = Σ i=:n [ log P(y i ) + Σ j=:m log P(x i,j y i ) ] = Σ i=:n [ y i log Θ + (-y i )log(-θ ) + Σ j=:m y i ( x i,j logθ i, + (-x i,j )log(-θ i, ) ) + Σ j=:m (-y i )( x i,j logθ i,0 + (-x i,j )log(-θ i,0 ) ) ] (will have other form if params P(x y) have other form, e.g. Gaussian). To estimate Θ, take derivative of logl with respect to Θ, set to 0: L / Θ = Σ i=:n (y i /Θ - (-y i )/(-Θ )) = 0 COMP-55: Applied Machine Learning 6

Training a Naïve Bayes classifier Solving for Θ we get: Θ = (/n) Σ i=:n y i = number of examples where y= / number of examples COMP-55: Applied Machine Learning 7

Training a Naïve Bayes classifier Solving for Θ we get: Θ = (/n) Σ i=:n y i = number of examples where y= / number of examples Similarly, we get: Θ j, = number of examples where x j = and y= / number of examples where y= Θ j,0 = number of examples where x j = and y=0 / number of examples where y=0 COMP-55: Applied Machine Learning 8

Naïve Bayes decision boundary Consider again the log-odds ratio: log Pr(y = x) Pr(y = 0 x) = log Pr(x y =)P(y =) Pr(x y = 0)P(y = 0) = log = log m ( ) P(y =) P(y = 0) + log P x j y = j= P x j y = 0 m j= ( ) ( ) ( ) m P(y =) P(y = 0) + log P x j y = P x j y = 0 j= COMP-55: Applied Machine Learning 9

Naïve Bayes decision boundary Consider the case where features are binary: x j = {0, } Define: w j,0 = log P(x j = 0 y =) P(x j = 0 y = 0) ; w j, = log P(x j = y =) P(x j = y = 0) Now we have: log Pr(y = x) Pr(y = 0 x) ( ) ( ) m P(y =) = log P(y = 0) + log P x y = j P x j y = 0 This is a linear decision boundary! w 0 + x T w j= m P(y =) = log P(y = 0) + (w ( x )+ w x ) j,0 j j, j j= m m P(y =) = log P(y = 0) + w j,0 + (w j, w j,0 )x j j= j= COMP-55: Applied Machine Learning 0

Text classification example Using Naïve Bayes, we can compute probabilities for all the words which appear in the document collection. P(y=c) is the probability of class c P(x j y=c) is the probability of seeing word j in documents of class c Class c word word word word m COMP-55: Applied Machine Learning

Text classification example Using Naïve Bayes, we can compute probabilities for all the words which appear in the document collection. P(y=c) is the probability of class c P(x j y=c) is the probability of seeing word j in documents of class c Set of classes depends on the application, e.g. Topic modeling: each class corresponds to documents on a given topic, e.g. {Politics, Finance, Sports, Arts}. Class c What happens when a word is not observed in the training data? word word word word m COMP-55: Applied Machine Learning

Laplace smoothing Replace the maximum likelihood estimator: Pr(x j y=) = number of instance with x j = and y= number of examples with y= COMP-55: Applied Machine Learning

Laplace smoothing Replace the maximum likelihood estimator: Pr(x j y=) = number of instance with x j = and y= number of examples with y= With the following: Pr(x j y=) = (number of instance with x j = and y=) + (number of examples with y=) + COMP-55: Applied Machine Learning 4

Laplace smoothing Replace the maximum likelihood estimator: Pr(x j y=) = number of instance with x j = and y= number of examples with y= With the following: Pr(x j y=) = (number of instance with x j = and y=) + (number of examples with y=) + If no example from that class, it reduces to a prior probability or Pr=/. If all examples have x j =, then Pr(x j =0 y) has Pr = / (#examples + ). If a word appears frequently, the new estimate is only slightly biased. This is an example of Bayesian prior for Naïve Bayes. COMP-55: Applied Machine Learning 5

Example: 0 newsgroups Given 000 training documents from each group, learn to classify new documents according to which newsgroup they came from: comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.space sci.crypt sci.electronics sci.med talk.politics.guns Naïve Bayes: 89% classification accuracy (comparable to other state-of-the-art methods.) COMP-55: Applied Machine Learning 6

Multi-class classification Generally two options:. Learn a single classifier that can produce 0 distinct output values.. Learn 0 different -vs-all binary classifiers. COMP-55: Applied Machine Learning 7

Multi-class classification Generally two options:. Learn a single classifier that can produce 0 distinct output values.. Learn 0 different -vs-all binary classifiers. Option assumes you have a multi-class version of the classifier. For Naïve Bayes, compute P(y x) for each class, and select the class with highest probability. COMP-55: Applied Machine Learning 8

Multi-class classification Generally two options:. Learn a single classifier that can produce 0 distinct output values.. Learn 0 different -vs-all binary classifiers. Option assumes you have a multi-class version of the classifier. For Naïve Bayes, compute P(y x) for each class, and select the class with highest probability. Option applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates a class imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.) COMP-55: Applied Machine Learning 9

Best features extracted From MSc thesis by Jason Rennie http://qwone.com/~jason/papers/sm-thesis.pdf COMP-55: Applied Machine Learning 40

Gaussian Naïve Bayes Extending Naïve Bayes to continuous inputs: P(y) is still assumed to be a binomial distribution. P(x y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ R n and covariance matrix Σ R n xr n COMP-55: Applied Machine Learning 4

Gaussian Naïve Bayes Extending Naïve Bayes to continuous inputs: P(y) is still assumed to be a binomial distribution. P(x y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ R n and covariance matrix Σ R n xr n If we assume the same Σ for all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagonal (i.e. features are independent): Gaussian Naïve Bayes. COMP-55: Applied Machine Learning 4

Gaussian Naïve Bayes Extending Naïve Bayes to continuous inputs: P(y) is still assumed to be a binomial distribution. P(x y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ R n and covariance matrix Σ R n xr n If we assume the same Σ for all classes: Linear discriminant analysis. If Σ is distinct between classes: Quadratic discriminant analysis. If Σ is diagonal (i.e. features are independent): Gaussian Naïve Bayes. How do we estimate parameters? Derive the maximum likelihood estimators for μ and Σ. COMP-55: Applied Machine Learning 4

More generally: Bayesian networks If not all conditional independence relations are true, we can introduce new arcs to show what dependencies are there. At each node, we have the conditional probability distribution of the corresponding variable, given its parents. This more general type of graph, annotated with conditional distributions, is called a Bayesian network. More on this in COMP-65. COMP-55: Applied Machine Learning 44

What you should know Naïve Bayes assumption Log-odds ratio decision boundary How to estimate parameters for Naïve Bayes Laplace smoothing Relation between Naïve Bayes, LDA, QDA, Gaussian Naïve Bayes. COMP-55: Applied Machine Learning 45