Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Similar documents
Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Lecture 3 Naïve Bayes, Maximum Entropy and Text Classification COSI 134

CS 2750 Machine Learning Lecture 5. Density estimation. Density estimation

Parameter Estimation

Nonparametric Density Estimation Intro

Econometric Methods. Review of Estimation

Bayes (Naïve or not) Classifiers: Generative Approach

Probability and Statistics. What is probability? What is statistics?

Lecture 9. Some Useful Discrete Distributions. Some Useful Discrete Distributions. The observations generated by different experiments have

Generative classification models

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Unsupervised Learning and Other Neural Networks

Random Variables. ECE 313 Probability with Engineering Applications Lecture 8 Professor Ravi K. Iyer University of Illinois

Entropy, Relative Entropy and Mutual Information

2SLS Estimates ECON In this case, begin with the assumption that E[ i

Point Estimation: definition of estimators

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

STRONG CONSISTENCY FOR SIMPLE LINEAR EV MODEL WITH v/ -MIXING

Part I: Background on the Binomial Distribution

CHAPTER VI Statistical Analysis of Experimental Data

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

2. Independence and Bernoulli Trials

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Training Sample Model: Given n observations, [[( Yi, x i the sample model can be expressed as (1) where, zero and variance σ

STK3100 and STK4100 Autumn 2017

Naïve Bayes MIT Course Notes Cynthia Rudin

STK4011 and STK9011 Autumn 2016

Lecture Notes Types of economic variables

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Simulation Output Analysis

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Chapter 5 Properties of a Random Sample

Chapter 11 Systematic Sampling

Artificial Intelligence Learning of decision trees

BASIC PRINCIPLES OF STATISTICS

Functions of Random Variables

Multiple Choice Test. Chapter Adequacy of Models for Regression

Lecture 3. Sampling, sampling distributions, and parameter estimation

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Chapter 3 Sampling For Proportions and Percentages

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

STK3100 and STK4100 Autumn 2018

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

CHAPTER 3 POSTERIOR DISTRIBUTIONS

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Continuous Random Variables: Conditioning, Expectation and Independence

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Maximum Likelihood Estimation

CHAPTER 6. d. With success = observation greater than 10, x = # of successes = 4, and

Idea is to sample from a different distribution that picks points in important regions of the sample space. Want ( ) ( ) ( ) E f X = f x g x dx

Channel Models with Memory. Channel Models with Memory. Channel Models with Memory. Channel Models with Memory

Supervised learning: Linear regression Logistic regression

Summary of the lecture in Biostatistics

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

Lecture 3 Probability review (cont d)

Chain Rules for Entropy

X ε ) = 0, or equivalently, lim

D KL (P Q) := p i ln p i q i

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

MIMA Group. Chapter 4 Non-Parameter Estimation. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

TESTS BASED ON MAXIMUM LIKELIHOOD

Dimensionality Reduction and Learning

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

LINEAR REGRESSION ANALYSIS

Third handout: On the Gini Index

22 Nonparametric Methods.

Learning Graphical Models

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

1 Onto functions and bijections Applications to Counting

Can we take the Mysticism Out of the Pearson Coefficient of Linear Correlation?

Set Theory and Probability

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Machine Learning. Tutorial on Basic Probability. Lecture 2, September 15, 2006

Correlation and Simple Linear Regression

Introduction to local (nonparametric) density estimation. methods

Chapter 8. Inferences about More Than Two Population Central Values

Random Variate Generation ENM 307 SIMULATION. Anadolu Üniversitesi, Endüstri Mühendisliği Bölümü. Yrd. Doç. Dr. Gürkan ÖZTÜRK.

Lecture 8: Linear Regression

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Chapter 8: Statistical Analysis of Simulated Data

Two Fuzzy Probability Measures

Simple Linear Regression

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Simple Linear Regression

ENGI 3423 Simple Linear Regression Page 12-01

Lecture Notes Forecasting the process of estimating or predicting unknown situations


Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Pr[X (p + t)n] e D KL(p+t p)n.

Lecture 2: The Simple Regression Model

Transcription:

arametrc Dest Estmato: Baesa Estmato. Naïve Baes Classfer

Baesa arameter Estmato Suose we have some dea of the rage where arameters should be Should t we formalze such ror owledge hoes that t wll lead to better arameter estmato? Let be a radom varable wth ror dstrbuto Ths s the e dfferece betwee ML ad Baesa arameter estmato Ths e assumto allows us to full elot the formato rovded b the data

Baesa arameter Estmato s a radom varable wth ror Ule MLE case, θ s a codtoal dest The trag data D allow us to covert θ to a osteror robablt dest θd. After we observe the data D, usg Baes rule we ca comute the osteror θd But s ot our fal goal, our fal goal s the uow Therefore a better thg to do s to mamze D, ths s as close as we ca come to the uow!

uow ow Baesa Estmato: Formula for D From the defto of ot dstrbuto: d D D, d D D D, Usg the defto of codtoal robablt: d D D But,D= sce s comletel secfed b Usg Baes formula, d D D D D

Baesa Estmato vs. MLE So rcle D ca be comuted I ractce, t ma be hard to do tegrato aaltcall, ma have to resort to umercal methods D d d Cotrast ths wth the MLE soluto whch reures dfferetato of lelhood to get ˆ Dfferetato s eas ad ca alwas be doe aaltcall

Baesa Estmato vs. MLE suort receves from the data D D roosed model wth certa d The above euato mles that f we are less certa about the eact value of θ, we should cosder a weghted average of θ over the ossble values of θ. Cotrast ths wth the MLE soluto whch alwas gves us a sgle model: ˆ

Baesa Estmato for Gaussa wth uow m Let m be Nm, σ that s σ s ow, but m s uow ad eeds to be estmated, so θ = m Assume a ror over m : m ~ N m 0, 0 m 0 ecodes some ror owledge about the true mea m, whle measures our ror ucertat. 0 The osteror dstrbuto s: m D D m m m m m 0 'e 0 m 0 ''e m m 0 0

Baesa Estmato for Gaussa wth uow m Where factors that do ot deed o μ have bee absorbed to the costats α ad α s a eoet of a uadratc fucto of μ.e. t s a ormal dest. remas ormal for a umber of trag samles. If we wrte m D m D e m m m D m m m 0 0 0 ' ' e

Baesa Estmato for Gaussa wth uow m the detfg the coeffcets, we get m m ˆ m 0 0 where ˆ m s the samle mea Solvg elctl for ad we obta: 0 m ˆ m m 0 0 0 m 0 our best guess after observg samles 0 0 ucertat about the guess, decreases mootocall wth

Baesa Estmato for Gaussa wth uow m Each addtoal observato decreases our ucertat about the true value of m. As creases, m D becomes more ad more sharl eaed, aroachg a Drac delta fucto as aroaches ft. Ths behavor s ow as Baesa Learg.

Baesa Estmato for Gaussa wth uow m 0 m ˆ m m 0 0 0 m I geeral, s a lear combato of a samle mea ad a ror m 0, wth coeffcets that are o-egatve ad sum to. Thus m les somewhere betwee ˆ m ad m0. If 0 0, m ˆ m as If 0 0, our a ror certat that m m0 s so strog that o umber of observatos ca chage our oo. If a ror guess s ver ucerta 0 s large, we tae m ˆ m m ˆ

Baesa Estmato for Gaussa wth uow m We stll should comute, ~ e e N d D D m m m m m m m, ~ N D m

Baesa Estmato: Eamle for U[0,] Let X be U[0,]. Recall =/ sde [0,], else 0 0 0 Suose we assume a U[0,0] ror o good ror to use f we ust ow the rage of but do t ow athg else

Baesa Estmato: Eamle for U[0,] We eed to comute D D usg d D D ad D D d Whe comutg MLE of, we had D Thus D 0 for ma{ otherwse c for ma{ 0 otherwse,...,,..., } } 0 0 3 0 D where c s the ormalzg costat,.e. c 0 ma,..., d

Baesa Estmato: Eamle for U[0,] We eed to comute D D D c for ma{ 0 otherwse We have cases:. case < ma{,,, }. case > ma{,,, },..., } 0 3 0 D c d ma{,... } 0 d D 0 0 D c d c c c 0 costat deedet of

Baesa Estmato: Eamle for U[0,] ML ˆ Baes D 3 0 Note that eve after >ma {,,, }, Baes dest s ot zero, whch maes sese curous fact: Baes dest s ot uform,.e. does ot have the fuctoal form that we have assumed!

ML vs. Baesa Estmato wth Broad ror Suose s flat ad broad close to uform ror D teds to share f there s a lot of data D D Thus D D wll have the same shar ea as D But b defto, ea of D s the ML estmate ^ The tegral s domated b the ea: ˆ D Dd ˆ Dd ˆ Thus as goes to ft, Baesa estmate wll aroach the dest corresodg to the MLE!

ML vs. Baesa Estmato Number of trag data The two methods are euvalet assumg fte umber of trag data ad ror dstrbutos that do ot eclude the true soluto. For small trag data sets, the gve dfferet results most cases. Comutatoal comlet ML uses dfferetal calculus or gradet search for mamzg the lelhood. Baesa estmato reures comle multdmesoal tegrato techues.

ML vs. Baesa Estmato Soluto comlet Easer to terret ML solutos.e., must be of the assumed arametrc form. A Baesa estmato soluto mght ot be of the arametrc form assumed. Hard to terret, returs weghted average of models. Broad or asmmetrc θ/d I ths case, the two methods wll gve dfferet solutos. Baesa methods wll elctl elot such formato.

ML vs. Baesa Estmato Geeral commets There are strog theoretcal ad methodologcal argumets suortg Baesa estmato. I ractce, ML estmato s smler ad ca lead to comarable erformace.

Naïve Baes Classfer

Ubased Learg of Baes Classfers s Imractcal Lear Baes classfer b estmatg XY ad Y. AssumeY s boolea ad X s a vector of boolea attrbutes. I ths case, we eed to estmate a set of arameters X Y taes o How ma arameters? ossble values; For a artcular value, ad the ossble values of, we eed comute - deedet arameters. Gve the two ossble values for Y, we must estmate a total of - such arameters. taes o ossble values. Comle model Hgh varace wth lmted data!!!

Codtoal Ideedece Defto: X s codtoall deedet of Y gve Z, f the robablt dstrbuto goverg X s deedet of the value of Y, gve the value of Z,, X Y, Z z X Z z Eamle: Thuder Ra, Lghtg Thuder Lghtg Note that geeral Thuder s ot deedet of Ra, but t s gve Lghtg. Euvalet to: X, Y Z X Y, Z Y Z X Z Y Z

Dervato of Nave Baes Algorthm Nave Baes algorthm assumes that the attrbutes X,,X are all codtoall deedet of oe aother, gve Y. Ths dramatcall smlfes the reresetato of XY estmatg XY from the trag data. Cosder X=X,X X Y X, X Y X Y X Y For X cotag attrbutes X Y X Y Gve the boolea X ad Y, ow we eed ol arameters to defe XY, whch s dramatc reducto comared to the - arameters f we mae o codtoal deedece assumto.

The Naïve Baes Classfer Gve: ror Y codtoall deedet features X, gve the class Y For each X, we have lelhood X Y The robablt that Y wll tae o ts th ossble value, s The Decso rule: Y X Y Y X Y X Y arg ma * Y X Y If assumto holds, NB s otmal classfer!

Naïve Baes for the dscrete uts Gve, attrbutes X each tag o J ossble dscrete values ad Y a dscrete varable tag o K ossble values. MLE for Lelhood X Y gve a set of trag eamles D: # D{ X Y } ˆ X Y # D{ Y } where the #D{} oerator returs the umber of elemets the set D that satsf roert. MLE for the ror ˆ Y # D{ Y D } umber of elemets the trag set D

NB Eamle Gve, trag data X Y Classf the followg ovel stace : Outloo=su, Tem=cool,Humdt=hgh,Wd=strog

NB Eamle arg ma }, { o es NB strog Wd hgh Humdt cool Tem su Outloo 0.36 5/4 0.64 9 /4 rors : o lates es lates... 0.6 3/ 5 0.33 9 3/ strog: Wd e.g. robabltes, Codtoal o lates strog Wd es lates strog Wd 0.0053 es strog es hgh es cool es su es 0.60 o strog o hgh o cool o su o

Subtletes of NB classfer Volatg the NB assumto Usuall, features are ot codtoall deedet. Noetheless, NB ofte erforms well, eve whe assumto s volated [Domgos& azza 96] dscuss some codtos for good erformace

Subtletes of NB classfer Isuffcet trag data What f ou ever see a trag stace where X =a whe Y=b? X =a Y=b = 0 Thus, o matter what the values X,,X tae: Soluto? Y=b X =a,x,,x = 0

Subtletes of NB classfer Isuffcet trag data To avod ths, use a smoothed estmate effectvel adds a umber of addtoal hallucated eamles assumes these hallucated eamles are sread evel over the ossble values of X. Ths smoothed estmate s gve b # D{ X Y ˆ X Y # D{ Y } lj # D{ Y } l ˆ Y D lj l determes the stregth of the smoothg If l= called Lalace smoothg } l The umber of hallucated eamles

Nave Baes for Cotuous Iuts Whe the X are cotuous we must choose some other wa to rereset the dstrbutos X Y. Oe commo aroach s to assume that for each ossble dscrete value of Y, the dstrbuto of each cotuous X s Gaussa. I order to tra such a Naïve Baes classfer we must estmate the mea ad stadard devato of each of these Gaussas

Nave Baes for Cotuous Iuts MLE for meas where refers to the th trag eamle, ad where δy= s f Y = ad 0 otherwse. Note the role of δ s to select ol those trag eamles for whch Y =. MLE for stadard devato Y X Y m ˆ Y X Y m ˆ ˆ

Learg Classf Tet Alcatos: Lear whch ews artcle are of terest Lear to classf web ages b toc. Naïve Baes s amog most effectve algorthms Target cocet Iterestg?: Documet->{+,-} Rereset each documet b vector of words oe attrbute er word osto documet Learg: Use trag eamles to estmate + - doc+ doc-

Tet Classfcato-Eamle: Tet Tet Classfcato, or the tas of automatcall assgg sematc categores to atural laguage tet, has become oe of the e methods for orgazg ole formato. Sce had-codg classfcato rules s costl or eve mractcal, most moder aroaches emlo mache learg techues to automatcall lear tet classfers from eamles. The tet cotas 48 words Tet Reresetato a = tet,a = classfcato,. a 48 = eamles The reresetato cotas 48 attrbutes Note: Tet sze ma var, but t wll ot cause a roblem

NB codtoal deedece Assumto doc legth doc a w The NB assumto s that the word robabltes for oe tet osto are deedet of the words other ostos, gve the documet classfcato Idcates the th word Eglsh vocabular robablt that word osto s w, gve Clearl ot true: The robablt of word learg ma be greater f the recedg word s mache Necessar, wthout t the umber of robablt terms s rohbtve erforms remarabl well deste the correctess of the assumto

Tet Classfcato-Eamle: Tet Tet Classfcato, or the tas of automatcall assgg sematc categores to atural laguage tet, has become oe of the e methods for orgazg ole formato. Sce had-codg classfcato rules s costl or eve mractcal, most moder aroaches emlo mache learg techues to automatcall lear tet classfers from eamles. The tet cotas 48 words Tet Reresetato a = tet,a = classfcato,. a 48 = eamles The reresetato cotas 48 attrbutes Classfcato: * arg ma {, } arg ma {, } a ' tet ' a w... a 48 ' eamles '

Estmatg Lelhood Is roblematc because we eed to estmate t for each combato of tet osto, Eglsh word, ad target value: 48*50,000* 5 mllo such terms. Assumto that reduced the umber of terms Bag of Words Model The robablt of ecouterg a secfc word w s deedet of the secfc word osto. a w am w,, m Istead of estmatg we estmate a sgle term Now we have 50,000* dstct terms. a w, a w,... w

Estmatg Lelhood The estmate for the lelhood s w Vocabular -the total umber of word ostos all trag eamles whose target value s -the umber tmes word w s foud amog these word ostos. Vocabular -the total umber of dstct words foud wth the trag data.

Classf_Nave_Baes_TetDoc ostos all word ostos Doc that cota toes foud Vocabular * Retur arg ma v a v {, } ostos