Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Similar documents
Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Bayes (Naïve or not) Classifiers: Generative Approach

Point Estimation: definition of estimators

Generative classification models

Unsupervised Learning and Other Neural Networks

CHAPTER VI Statistical Analysis of Experimental Data

Econometric Methods. Review of Estimation

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Chapter 5 Properties of a Random Sample

Part I: Background on the Binomial Distribution

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Naïve Bayes MIT Course Notes Cynthia Rudin

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

CHAPTER 3 POSTERIOR DISTRIBUTIONS

Summary of the lecture in Biostatistics

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Functions of Random Variables

Multivariate Transformation of Variables and Maximum Likelihood Estimation

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture 3. Sampling, sampling distributions, and parameter estimation

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Simulation Output Analysis

X ε ) = 0, or equivalently, lim

Simple Linear Regression

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Introduction to local (nonparametric) density estimation. methods

STK4011 and STK9011 Autumn 2016

Maximum Likelihood Estimation

Point Estimation: definition of estimators

Chapter 4 Multiple Random Variables

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Lecture 3 Probability review (cont d)

Multiple Choice Test. Chapter Adequacy of Models for Regression

CS 2750 Machine Learning Lecture 5. Density estimation. Density estimation

Lecture Notes Types of economic variables

Dimensionality Reduction and Learning

ENGI 3423 Simple Linear Regression Page 12-01

Parameter Estimation

Lecture 3 Naïve Bayes, Maximum Entropy and Text Classification COSI 134

Third handout: On the Gini Index

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Supervised learning: Linear regression Logistic regression

Idea is to sample from a different distribution that picks points in important regions of the sample space. Want ( ) ( ) ( ) E f X = f x g x dx

22 Nonparametric Methods.

1 Onto functions and bijections Applications to Counting

TESTS BASED ON MAXIMUM LIKELIHOOD

Simple Linear Regression

Correlation and Simple Linear Regression

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Objectives of Multiple Regression

ρ < 1 be five real numbers. The

Classification : Logistic regression. Generative classification model.

Chapter 8. Inferences about More Than Two Population Central Values

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Kernel-based Methods and Support Vector Machines

COV. Violation of constant variance of ε i s but they are still independent. The error term (ε) is said to be heteroscedastic.

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Chapter 14 Logistic Regression Models

Chapter 3 Sampling For Proportions and Percentages

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Lecture 2 - What are component and system reliability and how it can be improved?

Lecture 8: Linear Regression

Study of Correlation using Bayes Approach under bivariate Distributions

The Mathematical Appendix

Chapter 11 Systematic Sampling

Overcoming Limitations of Sampling for Aggregation Queries

Analysis of Variance with Weibull Data

8.1 Hashing Algorithms

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Continuous Distributions

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

Can we take the Mysticism Out of the Pearson Coefficient of Linear Correlation?

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

A New Family of Transformations for Lifetime Data

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

9.1 Introduction to the probit and logit models

Lecture Notes Forecasting the process of estimating or predicting unknown situations

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

6.867 Machine Learning

Bayesian belief networks

Fridayʼs lecture" Problem solutions" Joint densities" 1."E(X) xf (x) dx (x,y) dy X,Y Marginal distributions" The distribution of a ratio" Problems"

Class 13,14 June 17, 19, 2015

7. Joint Distributions

ENGI 4421 Propagation of Error Page 8-01

Lecture Note to Rice Chapter 8

Chapter 2 Supplemental Text Material

LINEAR REGRESSION ANALYSIS

Transcription:

arametrc Dest Estmato: Baesa Estmato. Naïve Baes Classfer

Baesa arameter Estmato Suppose we have some dea of the rage where parameters θ should be Should t we formalze such pror owledge hopes that t wll lead to better parameter estmato? Let θ be a radom varable wth pror dstrbuto θ Ths s the e dfferece betwee ML ad Baesa parameter estmato Ths e assumpto allows us to full explot the formato provded b the data

Baesa arameter Estmato θ s a radom varable wth pror pθ Ule MLE case, pxθ s a codtoal dest The trag data D allow us to covert pθ to a posteror probablt dest pθd. After we observe the data D, usg Baes rule we ca compute the posteror pθd But θ s ot our fal goal, our fal goal s the uow px Therefore a better thg to do s to maxmze pxd, ths s as close as we ca come to the uow px!

Baesa Estmato: Formula for pxd From the defto of ot dstrbuto: θ x D p x, D p dθ Usg the defto of codtoal probablt: x D p x θ, D p θ D p dθ But pxθ,dpxθ sce pxθ s completel specfed b θ Usg Baes formula, p ow uow p x D p x θ p θ D dθ p D θ p θ θ D p D θ p D θ p θ p x θ dθ 1

Baesa Estmato vs. MLE So prcple pxd ca be computed I practce, t ma be hard to do tegrato aaltcall, ma have to resort to umercal methods p x D p x θ p x θ p θ 1 dθ 1 p x θ p θ Cotrast ths wth the MLE soluto whch requres dfferetato of lelhood to get p x θˆ d θ Dfferetato s eas ad ca alwas be doe aaltcall

Baesa Estmato vs. MLE support θ receves from the data x D p x θ p θ D p dθ proposed model wth certa θ The above equato mples that f we are less certa about the exact value of θ, we should cosder a weghted average of pxθ over the possble values of θ. Cotrast ths wth the MLE soluto whch alwas gves us a sgle model: p x θˆ

Baesa Estmato for Gaussa wth uow µ Let px µ be Nµ,σ 2 that s σ 2 s ow, but µ s uow ad eeds to be estmated, so θ µ 2 Assume a pror over µ : p µ ~ N µ 0, σ 0 µ 0 ecodes some pror owledge about the true mea, whle measures our pror ucertat. 2 µ σ 0

Baesa Estmato for Gaussa wth uow µ The posteror dstrbuto s: p µ D p D µ p µ 2 2 1 x µ µ µ 0 α 'exp + 2 1 σ σ 0 1 1 2 1 µ 0 α ''exp + µ 2 x 2 2 2 + µ 2 2 σ σ 0 σ 1 σ 0 Where factors that do ot deped o µ have bee absorbed to the costats α ad α p µ D s a expoet of a quadratc fucto of µ.e. t s a ormal dest; t remas ormal for a umber of trag samples. If we wrte 2 1 1 µ µ p µ D exp '' 1 1 2 1 µ 0 ; α exp 2πσ 2 σ + µ 2 x + µ 2 2 2 2 2 σ σ 0 σ 1 σ 0 the detfg the coeffcets, we get where 1 ˆ µ x 1 1 1 µ + µ ˆ µ 0 2 2 + 2 σ σ σ σ σ σ 2 2 2 0 0

Baesa Estmato for Gaussa wth uow µ µ 2 Solvg explctl for ad σ we obta: 2 2 σ 0 σ µ ˆ 2 2 µ + µ 2 2 0 σ 0 + σ σ 0 + σ our best guess after observg samples σ σ σ 2 2 2 0 2 2 σ 0 + σ ucertat about the guess, decreases mootocall wth

Baesa Estmato for Gaussa wth uow µ Each addtoal observato decreases our ucertat about the true value of. µ As creases, p µ D becomes more ad more sharpl peaed, approachg a Drac delta fucto as approaches ft. Ths behavor s ow as Baesa Learg.

Baesa Estmato for Gaussa wth uow µ 2 2 σ 0 σ µ ˆ 2 2 µ + µ 2 2 0 σ 0 + σ σ 0 + σ µ ˆ I geeral, s a lear combato of a sample mea ad a pror µ 0, wth coeffcets that are o-egatve ad sum to 1. Thus µ les somewhere betwee ˆ µ ad µ 0. If σ 0 0, µ ˆ µ as If σ 0 0, our a pror certat that µ µ 0 s so strog that o umber of observatos ca chage our opo. If a pror guess s ver ucerta σ 0 s large, we tae µ ˆ µ µ

Baesa Estmato: Example for U[0,θ] Let X be U[0,θ]. Recall pxθ1/θ sde [0,θ], else 0 p x θ p θ 1 θ θ 1 10 x 10 Suppose we assume a U[0,10] pror o θ good pror to use f we ust ow the rage of θ but do t ow athg else θ

Baesa Estmato: Example for U[0,θ] We eed to compute p x D p x θ p θ D usg dθ p D θ p θ p θ D ad p D θ p D θ p θ p x θ dθ 1 Whe computg MLE of θ, we had p D θ Thus p θ D 1 for θ max{ x,..., x } θ 0 otherwse 1 1 c for max{ θ 0 otherwse x 1,..., x } θ 10 1 10 x p θ 1 x3 x2 θ D p 10 θ where c s the ormalzg costat,.e. c 1 10 d max { x..., x } 1, θ θ

Baesa Estmato: Example for U[0,θ] We eed to compute p x D p x θ p θ D 1 θ p θ D p x θ 1 c for max{ θ 0 otherwse We have 2 cases: 1. case x < max{x 1, x 2,, x } θ x x 1 dθ,..., 1 p x,... x } + 1 θ 2. case x > max{x 1, x 2,, x } 10 1 c 10 p x D c dθ x + 1 x θ θ x } θ 10 x 1 x3 x2 10 x D c dθ α max{ 1 c x D p θ 10 c 10 θ costat depedet of x

Baesa Estmato: Example for U[0,θ] α ML x p x θˆ 1 x3 x2 Baes 10 x p x D Note that eve after x >max {x 1, x 2,, x }, Baes dest s ot zero, whch maes sese curous fact: Baes dest s ot uform,.e. does ot have the fuctoal form that we have assumed!

ML vs. Baesa Estmato wth Broad ror Suppose pθ s flat ad broad close to uform pror pθd teds to sharpe f there s a lot of data p θ D p D θ p θ θˆ θ Thus pdθ pθdpθ wll have the same sharp pea as pθd But b defto, pea of pdθ s the ML estmate θ^ The tegral s domated b the pea: p x D p x θ p θ D dθ p x ˆ θ p θ D dθ p x ˆ θ Thus as goes to ft, Baesa estmate wll approach the dest correspodg to the MLE!

ML vs. Baesa Estmato Number of trag data The two methods are equvalet assumg fte umber of trag data ad pror dstrbutos that do ot exclude the true soluto. For small trag data sets, the gve dfferet results most cases. Computatoal complext ML uses dfferetal calculus or gradet search for maxmzg the lelhood. Baesa estmato requres complex multdmesoal tegrato techques.

ML vs. Baesa Estmato Soluto complext Easer to terpret ML solutos.e., must be of the assumed parametrc form. A Baesa estmato soluto mght ot be of the parametrc form assumed. Hard to terpret, returs weghted average of models. ror dstrbuto If the pror dstrbuto pθ s uform, Baesa estmato solutos are equvalet to ML solutos.

Naïve Baes Classfer

Ubased Learg of Baes Classfers s Impractcal Lear Baes classfer b estmatg XY ad Y. AssumeY s boolea ad X s a vector of boolea attrbutes. I ths case, we eed to estmate a set of parameters θ X x Y taes o 2 possble values; How ma parameters? taes o 2 possble values. For a partcular value, ad the 2 possble values of x, we eed compute 2-1 depedet parameters. Gve the two possble values for Y, we must estmate a total of 22-1 such parameters. Complex model Hgh varace wth lmted data!!!

Codtoal Idepedece Defto: X s codtoall depedet of Y gve Z, f the probablt dstrbuto goverg X s depedet of the value of Y, gve the value of Z,, X x Y, Z z X x Z z Example: Thuder Ra, Lghtg Thuder Lghtg Note that geeral Thuder s ot depedet of Ra, but t s gve Lghtg. Equvalet to: X, Y Z X Y, Z Y Z X Z Y Z

Dervato of Nave Baes Algorthm Nave Baes algorthm assumes that the attrbutes X 1,,X are all codtoall depedet of oe aother, gve Y. Ths dramatcall smplfes the represetato of XY estmatg XY from the trag data. Cosder XX 1,X 2 X Y X, X Y X Y X Y For X cotag attrbutes 1 2 1 2 Y X Y 1 X Y Gve the boolea X ad Y, ow we eed ol 2 parameters to defe XY, whch s dramatc reducto compared to the 22-1 parameters f we mae o codtoal depedece assumpto.

The Naïve Baes Classfer Gve: ror Y codtoall depedet features X, gve the class Y For each X, we have lelhood X Y The probablt that Y wll tae o ts th possble value, s Y X Y The Decso rule: Y X Y Y X Y X Y arg max * Y X Y If assumpto holds, NB s optmal classfer!

Naïve Baes for the dscrete puts Gve, attrbutes X each tag o J possble dscrete values ad Y a dscrete varable tag o K possble values. MLE for Lelhood X x Y gve a set of trag examples D: # D { X x Y } ˆ X x Y # D{ Y } where the #D{x} operator returs the umber of elemets the set D that satsf propert x. MLE for the pror D Y ˆ # { Y D } umber of elemets the trag set D

NB Example Gve, trag data X Y Classf the followg ovel stace : Outloosu, Tempcool,Humdthgh,Wdstrog

NB Example arg max }, { o es NB strog Wd hgh Humdt cool Temp su Outloo 0.36 5/14 0.64 9 /14 rors : o lates es lates... 0.6 5 3/ 0.33 9 3/ strog : Wd e.g. robabltes, Codtoal o lates strog Wd es lates strog Wd 0.0053 es strog es hgh es cool es su es 0.02 o strog o hgh o cool o su o

Subtletes of NB classfer 1 Volatg the NB assumpto Usuall, features are ot codtoall depedet. Noetheless, NB ofte performs well, eve whe assumpto s volated [Domgos& azza 96] dscuss some codtos for [Domgos& azza 96] dscuss some codtos for good performace

Subtletes of NB classfer 2 Isuffcet trag data What f ou ever see a trag stace where X 1 a whe Yb? X 1 a Yb 0 Thus, o matter what the values X 2,,X tae: Soluto? Yb X 1 a,x 2,,X 0

Subtletes of NB classfer 2 Isuffcet trag data To avod ths, use a smoothed estmate effectvel adds a umber of addtoal hallucated examples assumes these hallucated examples are spread evel over the possble values of X. Ths smoothed estmate s gve b # D{ X x Y ˆ X x Y # D{ Y } + lj # D{ Y } + l ˆ Y D + lk l determes the stregth of the smoothg If l1 called Laplace smoothg } + l The umber of hallucated examples

Nave Baes for Cotuous Iputs Whe the X are cotuous we must choose some other wa to represet the dstrbutos X Y. Oe commo approach s to assume that for each possble dscrete value of Y, the dstrbuto of each cotuous X s Gaussa. I order to tra such a Naïve Baes classfer we must estmate the mea ad stadard devato of each of these Gaussas

Nave Baes for Cotuous Iputs MLE for meas 1 ˆ µ δ Y Y where refers to the th trag example, ad where δy s 1 f Y ad 0 otherwse. Note the role of δ s to select ol those trag examples for whch Y. MLE for stadard devato ˆ σ X Y Y 2 µ δ δ 1 X δ 2 ˆ

Learg Classf Text Applcatos: Lear whch ews artcle are of terest Lear to classf web pages b topc. Naïve Baes s amog most effectve algorthms Target cocept Iterestg?: Documet->{+,-} Target cocept Iterestg?: Documet->{+,-} 1 Represet each documet b vector of words oe attrbute per word posto documet 2 Learg: Use trag examples to estmate + - doc+ doc-

Text Classfcato-Example: Text Text Classfcato, or the tas of automatcall assgg sematc categores to atural laguage text, has become oe of the e methods for orgazg ole formato. Sce had-codg classfcato rules s costl or eve mpractcal, most moder approaches emplo mache learg techques to automatcall lear text classfers from examples. The text cotas 48 words Text Represetato a 1 text,a 2 classfcato,. a 48 examples The represetato cotas 48 attrbutes Note: Text sze ma var, but t wll ot cause a problem

NB codtoal depedece Assumpto doc legth doc 1 a w The NB assumpto s that the word probabltes for oe text posto are depedet of the words other postos, gve the documet classfcato Idcates the th word Eglsh vocabular probablt that word posto s w, gve Clearl ot true: The probablt of word learg ma be greater f the precedg word s mache Necessar, wthout t the umber of probablt terms s prohbtve erforms remarabl well despte the correctess of the assumpto

Text Classfcato-Example: Text Text Classfcato, or the tas of automatcall assgg sematc categores to atural laguage text, has become oe of the e methods for orgazg ole formato. Sce had-codg classfcato rules s costl or eve mpractcal, most moder approaches emplo mache learg techques to automatcall lear text classfers from examples. The text cotas 48 words Classfcato: * arg max { +, } { +, } arg max a 1 ' text' a w Text Represetato a 1 text,a 2 classfcato,. a 48 examples The represetato cotas 48 attrbutes... a 48 ' examples'

Estmatg Lelhood Is problematc because we eed to estmate t for each combato of text posto, Eglsh word, ad target value: 48*50,000*2 5 mllo such terms. Assumpto that reduced the umber of terms Bag of Words Model The probablt of ecouterg a specfc word w s depedet of the specfc word posto. a w am w,, m Istead of estmatg we estmate a sgle term Now we have 50,000*2 dstct terms. a 1 w, a w,... w

Estmatg Lelhood The estmate for the lelhood s w + 1 + Vocabular -the total umber of word postos all -the total umber of word postos all trag examples whose target value s -the umber tmes word w s foud amog these word postos. Vocabular -the total umber of dstct words foud wth the trag data.

Lear_Nave_Baes_TextExamples,V 1. collect all words ad other toes that occur Examples Vocabular all dstct words ad other toes Examples 2. calculate the requred ad For each target value V do docs w - subset of Examples for whch the target value s - Text - a sgle documet created b cocateatg all members of - total umber of words coutg duplcate words multple tmes - For each word the Vocabular * umber of tmes word occurs * docs Examples w w Text w + 1 + Vocabular Text docs

Classf_Nave_Baes_TextDoc postos all word postos Doc that cota toes foud Vocabular * Retur arg max w { +, } postos