Bayesian Classification Methods

Similar documents
Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Naïve Bayes classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Introduction to Bayesian Learning. Machine Learning Fall 2018

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

MLE/MAP + Naïve Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture : Probabilistic Machine Learning

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

ST 740: Linear Models and Multivariate Normal Inference

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Bayesian Linear Regression

Bayesian Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 5: LDA and Logistic Regression

Linear Regression Models P8111

MLE/MAP + Naïve Bayes

Classification. Chapter Introduction. 6.2 The Bayes classifier

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Linear Models A linear model is defined by the expression

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Bayesian Learning (II)

COMP90051 Statistical Machine Learning

DISCRIMINANT ANALYSIS: LDA AND QDA

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Density Estimation. Seungjin Choi

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 5: Classification

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

6.867 Machine Learning

Intro to Probability. Andrei Barbu

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Learning Bayesian network : Given structure and completely observed data

ECE521 week 3: 23/26 January 2017

Parametric Techniques

6.867 Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Linear Regression (9/11/13)

LDA, QDA, Naive Bayes

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Stat 5101 Lecture Notes

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian Methods: Naïve Bayes

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

PS 203 Spring 2002 Homework One - Answer Key

COS513 LECTURE 8 STATISTICAL CONCEPTS

Logistic Regression. Machine Learning Fall 2018

Machine Learning for Signal Processing Bayes Classification

Parametric Techniques Lecture 3

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Linear Models for Classification

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 1: Bayesian Framework Basics

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Introduction to Machine Learning

Introduction into Bayesian statistics

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

T Machine Learning: Basic Principles

A Study of Relative Efficiency and Robustness of Classification Methods

CMU-Q Lecture 24:

AMS-207: Bayesian Statistics

Bayesian Models in Machine Learning

Principles of Bayesian Inference

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Analysis for Natural Language Processing Lecture 2

CS-E3210 Machine Learning: Basic Principles

Assignment 2: K-Nearest Neighbors and Logistic Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Chapter 5. Bayesian Statistics

Statistical learning. Chapter 20, Sections 1 4 1

Lecture Notes 6: Linear Models

Linear Classification

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Does Modeling Lead to More Accurate Classification?

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Review of Probabilities and Basic Statistics

Class 26: review for final exam 18.05, Spring 2014

Bayesian Learning. Bayesian Learning Criteria

Announcements. Proposals graded

Some Probability and Statistics

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33

How do you define probability? There exists a fundamental difference between how Frequentists and Bayesians view probability: Frequentist: Think of probability as choosing a random point in a Venn diagram and determing the percentage of times the point would fall in a certain set as you repeat this process an infinite number of times. Seeks to be objective. Bayesian: The probability assigned to an event depends person to person. It is the condition under which a person would make a bet with their own money. Reflects a state of belief. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 2 / 33

Conditional Probability and Bayes Rule The fundamental mechanism through which a Bayesian would update their beliefs about a certain event is Bayes Rule. Let A and B be events. Conditional Probability: P(A B) = P(A B) P(B) where P(B) > 0 = P(A B) = P(B A)P(A) Bayes Rule: P(A B) = P(B A)P(A) P(B) Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 3 / 33

Posterior Distributions Let θ be a parameter of interest and y be observed data. The Frequentist approach considers θ to be fixed and the data to be random, whereas Bayesians view θ as a random variable and y as fixed. The posterior distribution for θ is an application of Bayes Rule: π(θ y) = f Y θ(y θ)f Θ (θ) f Y (y) f Θ (θ) encapsulates our prior belief about the parameter. Our prior beliefs are then updated by the data we observe. Since θ is treated as a random variable, we can now make probability statements about it. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 4 / 33

Primary Goals Frequentist: Calculate point estimates for the parameters and use standard errors to calculate confidence intervals. Bayesian: Derive the posterior distribution of the parameter. Summarize this distribution with information about mean, median, mode, quantile, variance etc. Used credible intervals to show ranges which contain the parameter with high probability. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 5 / 33

Bayesian Learning The Bayesian framework provides us with a mathematically rigorous way to update our beliefs as we gather new information. There are no restrictions on what can constitute prior information. Therefore we can use posterior distributions from time 1 as our prior for time 2. π 1 (θ y 1 ) f Y θ (y 1 θ) f Θ (θ) π 2 (θ y 1, y 2 ) f Y θ (y 2 θ) π 1 (θ y 1 ). π n (θ y 1,..., y n ) f Y θ (y n θ) π n 1 (θ y 1,..., y n 1 ) Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 6 / 33

Overview of Differences [Gill (2008)] Frequentist: Bayesian: Frequentist: Bayesian: Frequentist: Bayesian: Interpretation of Probability Observed result from infinite series of trials performed or imagined under identical conditions. Probabilistic quantity of interest is p(data H o) Probability is the researcher/observer degree of belief before or after the data are observed. Probabilistic quantity of interest is p(θ data). What is Fixed and Variable Data are a iid random sample from continuous stream. Parameters are fixed by nature. Data observed and so fixed by the sample generated. Parameters are unknown and described distributionally. How are Results Summarized? Point estimates and standard errors. 95% confidence intervals indicating that 19/20 times the interval covers the true parameter value, on average. Descriptions of posteriors such as means and quantiles. Highest posterior density intervals indicating region of highest probability. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 7 / 33

Linear Regression (Frequentist) Consider the linear model y = Xβ + e where X is a n x k matrix with rank k, β is a k x 1 vector of coefficients and y is an n x 1 vector of responses. e is a vector of errors iid N(0, σ 2 I). Frequentist regression seeks point estimates by maximizing likelihood function with respect to β and σ 2. [ ] L(β, σ 2 X, y) = (2πσ 2 ) n 1 2 exp 2σ 2 (y Xβ)T (y Xβ) The unbiased OLS estimates are: ˆβ = (X T X) 1 X T y ˆσ 2 = (y Xˆβ) T (y Xˆβ) n k Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 8 / 33

Bayesian Regression Bayesian regression, on the other hand, finds the posterior distribution of the parameters β and σ 2. π(β, σ 2 X, y) L(β, σ 2 X, y)f β,σ 2(β, σ2 ) L(β, σ 2 X, y)f β σ 2(β σ2 )f σ 2(σ 2 ) Setup Prior Posterior Conjugate β σ 2 N(b, σ 2 ) β X t(n + a k) σ 2 IG(a, b) σ 2 X IG(n + a k, 1 2 ˆσ2 (n + a k)) Uninformative β c over [, ] β X t(n k) σ 2 = 1 over [0 : ] σ σ2 X IG( 1 2 (n k 1), 1 2 ˆσ2 (n k)) Gill(2008) Additionally, the posterior distribution conditioned on σ 2 for β in the non-informative case: β σ 2, y N(ˆβ, σ 2 V β ) where V β = (X T X) 1 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 9 / 33

Example: Median Home Prices From the MASS library: Data regarding 506 housing values in suburbs of Boston Fit a model with medv (median value of owner occupied homes) as the response and four predictors. Implementation with the blinreg function in the LearnBayes package. boston.fit <- lm(medv ~ crim + indus + chas + rm, data = Boston, y = TRUE, x = TRUE) theta.sample <- blinreg(boston.fit$y, boston.fit$x, 100000) Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 10 / 33

Example: Posterior Samples of Parameters Crime Industry Frequency 0 4000 0.35 0.25 0.15 0.05 Frequency 0 3000 0.4 0.3 0.2 0.1 β 1 β 2 Charles River Rooms Frequency 0 3000 0 2 4 6 8 Frequency 0 4000 6 7 8 9 β 3 β 4 Std. Deviation Frequency 0 1000 3000 5.5 6.0 6.5 σ Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 11 / 33

Example: Confidence Intervals for Parameters Bayesian regression with non-informative priors leads to similar estimates as OLS. OLS Confidence Intervals: Bayesian Credible Intervals: 2.5 % 97.5 % (Intercept) -26.43-15.29 crim -0.26-0.12 indus -0.35-0.18 chas 2.48 6.65 rm 6.62 8.25 Intercept crim indus chas rm 2.5% -26.42-0.26-0.35 2.48 6.62 97.5% -15.30-0.12-0.18 6.65 8.25 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 12 / 33

Logistic Regression Like Linear Regression, Logistic Regression is also a likelihood maximization problem in the frequentist setup. When the number of parameters is two, the log-likelihood function is: l(β 0, β 1 y) = β 0 n y i + β 1 i=1 n x i y i i=1 n log(1 + e β 0+β 1 x i ) In the Bayesian setting, we incorporate prior information and find the posterior distribution of the parameters: i=1 π(β X, y) L(β X, y)f β (β) The standard weakly-informative prior used in the arm package is the Student-t distribution with 1 degree of freedom (Cauchy distribution). Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 13 / 33

Logistic Regression: Example with Cancer Data Set Goal: Run Logistic Regression on the cancer data set from the University of Wisconsin using the bayesglm function in the arm package. cancer.bayesglm <- bayesglm(malignant ~., data = cancer, family = binomial (link = "logit")) Using this function, you can get point estimates (posterior modes) for your parameters and their standard errors. These estimates will be the same as the output given by glm in most cases without substantive prior information. Estimate Std. Error (Intercept) -8.2409 1.2842 id 0.0000 0.0000 thickness 0.4522 0.1270 unif.size 0.0063 0.1676 unif.shape 0.3642 0.1876 adhesion 0.2002 0.1067 cell.size 0.1460 0.1458 bare.nuclei1-1.1499 0.8526 Some estimates from bayesglm Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 14 / 33

Logistic Regression: Example with Cancer Data Set 1.00 0.75 malignant 0.50 0.25 0.00 2.5 5.0 7.5 10.0 thickness Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 15 / 33

Logistic Regression: Separation Separation occurs when a particular linear combination of predictors is always associated with a particular response. Can lead to maximum likelihood estimates of for some parameters! A common fix is to drop predictors from the model. Can lead to the best predictors being dropped from the model [Zorn(2005)] 2008 paper by Gelman, Jakulin, Pittau & Su proposes a Bayesian fix. Use the t-distrubtion as the weakly-informative prior. Created bayesglm in the arm (Applied Regression and Multilevel Modeling) package with the t-distribution as the default prior. Using bayesglm instead of glm can solve this problem! Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 16 / 33

Logistic Regression: Separation Example 1.00 0.75 y 0.50 0.25 0.00 2 0 2 4 6 x Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 17 / 33

Logistic Regression: Separation Example This problem is fixed using bayesglm 1.00 0.75 y 0.50 0.25 0.00 2 0 2 4 6 x Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 18 / 33

Logistic Regression: Separation Example Estimate Std. Error z value Pr(> z ) (Intercept) -164.0925 69589.0320-0.00 0.9981 x 75.6372 32263.2265 0.00 0.9981 glm fit Estimate Std. Error (Intercept) -4.0144 0.9054 x 1.6192 0.3867 bayesglm fit Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 19 / 33

Naive Bayes: Overview Like Linear Discriminant Analysis, Naive Bayes uses Bayes Rule to determine the probability that a particular outcome is in class C l. P(Y = C l X ) = P(X Y = C l) P(Y = C l ) P(X ) P(Y = C l X ) is the posterior probability of the class given a set of predictors P(Y = C l ) is the prior probability of the outcome P(X ) is the probability of the predictors P(X Y = C l ) is the probablity of observing a set of predictors given that Y belongs to class l. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 20 / 33

Naive Bayes: Independence Assumption P(X Y = C l ) is extremely difficult to calculate without a strong set of assumptions. The Naive Bayes model simplifies this calculation with the assumption that all predictors are mutually independent given Y. P(Y = C l X 1... X n ) = P(Y = C l) P(X 1,..., X n Y = C l ) P(X 1,..., X n ) P(Y = C l ) P(X 1,..., X n Y = C l ) P(Y = C l ) P(X 1 Y = C l )... P(X n Y = C l ) n P(Y = C l ) P(X i Y = C l ) i=1 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 21 / 33

Naive Bayes: Class Probabilities and Assignment The classifier assigns the class which has the highest posterior probability: argmax C l P(Y = C l ) n P(X i = x i Y = C l ) i=1 We need to assume prior probabilites for class priors. We have two basic options. Equiprobable probabilites: P(C l ) = 1 n number of samples in class l Calculation from training set: P(C l ) = total number of samples The third option is to use information based on prior beliefs about the probablity that Y = C l Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 22 / 33

Naive Bayes: Distributional Assumptions Even though the independence assumptions simplifies the classifier, we still need to make assumptions about the predictors. In the case of continuous predictors, the class probabilities are calculated using a Normal Distribution by default. The classifier estimates the µ l and σ l for each predictor, P(X = x Y = C l ) = Other choices include: Multinomial Bernoulli Semi-supervised methods 1 σ l 2π exp [ 1 ] 2σl 2 (x µ l ) 2 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 23 / 33

Naive Bayes: Simple Example Classify the Iris dataset using the NaiveBayes function in the klar package. Sepal.Length 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 Sepal.Width Petal.Length 1 2 3 4 5 6 7 4.5 5.5 6.5 7.5 0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 7 Petal.Width Iris Data Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 24 / 33

Naive Bayes: Simple Example Density 0.0 0.2 0.4 0.6 Density 0.0 0.4 0.8 1.2 1 2 3 4 5 6 7 0.5 1.5 2.5 Petal.Length Petal.Width setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 25 / 33

Naive Bayes: Non-Parametric Estimation cancer.nb <- NaiveBayes(malignant ~.,data = cancer, usekernel = TRUE ) 0 1 0 445 9 1 12 232 TRUE (97%) 0 1 0 434 6 1 23 235 FALSE (96%) Density 0.00 0.10 2 4 6 8 10 thickness Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 26 / 33

Naive Bayes: Priors Bayesian approaches allow for the inclusion of prior information into the model. In this context priors are applied to the probability of belonging to a group. 0 1 0 442 3 1 15 238 P(0) = 0.1; P(1) = 0.9 0 1 0 443 4 1 14 237 P(0) = 0.25; P(1) = 0.75 0 1 0 444 7 1 13 234 P(0) = 0.5; P(1) = 0.5 0 1 0 445 18 1 12 223 P(0) = 0.9; P(1) = 0.1 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 27 / 33

Naive Bayes: Sepsis Example Fit Naive Bayes model on data regarding whether or not a child develops sepsis based on three predictors. Data from research done at Kaiser Permanente (Courtesy of Dr. David Draper at UCSC). Response is whether or not a child developed Sepsis (66,846 observations). This is an extremely bad outcome for a child. They either die or go into a permanent coma. Two predictors ANC and I2T are derived from a blood test called the Complete Blood Count. Third predictor is the baby s age in hours at which the blood test results were obtained. In this scenario, our goal is likely to predict accurately which children will get sepsis. Overall classification rate is likely not as important. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 28 / 33

Naive Bayes: Sepsis Example Age <= 1 1 < Age <= 4 4 < Age I2T 0.0 0.2 0.4 0.6 0.8 1.0 I2T 0.0 0.2 0.4 0.6 0.8 1.0 I2T 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 ANC 0 20 40 ANC 0 20 40 60 80 ANC Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 29 / 33

Naive Bayes: Sepsis Example As can be seen by the confusion matrices below, the overall classification rate of around 98.6% does not make up for the fact that we only correctly classify the sepsis incidence 11% of the time. 0 1 0 65887 216 1 715 28 usekernel = FALSE 0 1 0 66589 239 1 13 5 usekernel = TRUE Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 30 / 33

Naive Bayes: Pros and Cons Advantages Due to the independence assumption, there is no curse of dimensionality! We can fit models with p > n without any issues. Independence assumption simplifies math and speeds up the model fitting process. We can include prior information to increase accuracy. Have the option of not making distributional assumptions about each predictor. Disadvantages If independence assumptions do not hold, this model can perform poorly. Non-parametric approach can overfit data. Our prior information could be wrong. Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 31 / 33

Logistic Regression with Sepsis Data How does Logistic regression fare with the Sepsis data? 0 1 0 66510 233 1 92 11 Cutoff: 0.1, % Correct: 99.51% 0 1 0 66245 217 1 357 27 Cutoff: 0.05, % Correct: 99.14% 0 1 0 65582 180 1 1020 64 Cutoff: 0.025, % Correct: 98.20% 0 1 0 63339 135 1 3263 109 Cutoff: 0.01, % Correct: 94.91% Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 32 / 33

The End Questions? Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 33 / 33