Bayesian classification CISC 5800 Professor Daniel Leeds

Similar documents
Logistic Classifier CISC 5800 Professor Daniel Leeds

Support Vector Machines

Support Vector Machines

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Machine learning: Density estimation

Machine Learning CISC 5800 Dr Daniel Leeds

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Hidden Markov Model Cheat Sheet

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Machine Learning CISC 5800 Dr Daniel Leeds

Classification Bayesian Classifiers

Bayesian Decision Theory

Naïve Bayes Classifier

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Pattern Classification

10-701/ Machine Learning, Fall 2005 Homework 3

Web-Mining Agents Probabilistic Information Retrieval

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Bayesian classification CISC 5800 Professor Daniel Leeds

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Lecture Notes on Linear Regression

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Classification as a Regression Problem

A total variation approach

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Maximum Likelihood Estimation (MLE)

Semi-Supervised Learning

The big picture. Outline

Linear Classification, SVMs and Nearest Neighbors

Pattern Classification (II) 杜俊

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Logistic regression with one predictor. STK4900/ Lecture 7. Program

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Ensemble Methods: Boosting

Limited Dependent Variables

Evaluation for sets of classes

Week 5: Neural Networks

Retrieval Models: Language models

Learning from Data 1 Naive Bayes

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$15:$LogisAc$Regression$/$ GeneraAve$vs.$DiscriminaAve$$

Kernel Methods and SVMs Extension

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Pattern Recognition. Approximating class densities, Bayesian classifier, Errors in Biometric Systems

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Evaluation of classifiers MLPs

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Classification learning II

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Maximum Likelihood Estimation and Binary Dependent Variables

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Multilayer neural networks

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Discriminative classifier: Logistic Regression. CS534-Machine Learning

THERMODYNAMICS. Temperature

Generative classification models

Multilayer Perceptron (MLP)

Learning Theory: Lecture Notes

Classification (klasifikácia) Feedforward Multi-Layer Perceptron (Dopredná viacvrstvová sieť) 14/11/2016. Perceptron (Frank Rosenblatt, 1957)

EM and Structure Learning

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

The Geometry of Logit and Probit

1 Convex Optimization

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Generalized Linear Methods

Lecture 10 Support Vector Machines II

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Homework Assignment 3 Due in class, Thursday October 15

The exam is closed book, closed notes except your one-page cheat sheet.

CSC 411 / CSC D11 / CSC C11

Maximum Likelihood Estimation

Algorithms for factoring

Multi-layer neural networks

Statistical pattern recognition

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Mixture of Gaussians Expectation Maximization (EM) Part 2

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 12: Bayes Classifiers. Dr. Yanjun Qi. University of Virginia

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Support Vector Machines

Support Vector Machines

6. Hamilton s Equations

Logistic Regression Maximum Likelihood Estimation

CSE 546 Midterm Exam, Fall 2014(with Solution)

Composite Hypotheses testing

Singular Value Decomposition: Theory and Applications

SDMML HT MSc Problem Sheet 4

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Conjugacy and the Exponential Family

Support Vector Machines

Clustering with Gaussian Mixtures

Transcription:

Tran Test Introducton to classfers Bayesan classfcaton CISC 58 Professor Danel Leeds Goal: learn functon C to maxmze correct labels (Y) based on features (X) lon: 6 wolf: monkey: 4 broker: analyst: dvdend: C C(x)=y jungle lon: wolf: monkey: broker: 4 analyst: dvdend: C wallstreet Graffe detector Label X : heght Class Y : True or False ( s graffe or s not graffe ) Learn otmal classfcaton arameter(s) Parameter: x thresh Examle functon: Learnng our classfer arameter(s) Adjust arameter(s) based on observed data Tranng set: contans features and corresondng labels X Y.5 True. True.8 True C x = True f x > xthresh False otherwse 3.5..5..5. False.9 False 4 The testng set Does classfer correctly label new data? Testng set should be dstnct from tranng set! Be careful wth your tranng set What f we tran wth only baby graffes and ants? What f we tran wth only T rexes and adult graffes?.5..5..5 baby cat graffe lon Trex graffe Examle good erformance: 9% correct labels.5..5..5 5 6

error Tranng vs. testng Tranng: learn arameters from set of data n each class Testng: measure how often classfer correctly dentfes new data More tranng reduces classfer error ε Too much tranng data causes worse testng error overfttng sze of tranng set 8 Quck robablty revew P(G=C H=True) G H P(G,H) A False.5 P(G=C,H=True) B False.5 C False.5 P(H=True) D False. A True.3 P(H=True G=C) B True. C True.5 D True. 9 Bayes rule Tycally: P B A P(A) P A B = P(B) P D P() P D = P(D) where D s the observed data and are the arameters to descrbe that data Our job s to fnd the most lkely arameters for gven data A osteror robablty: Probablty of Parameters for data d: P D Lkelhood: Probablty of data d gven t s from Parameters : P D Pror: Probablty of observng Parameters : P() Parameters may be treated as analogous to class Tycal classfcaton aroaches MAP Maxmum A Posteror: Determne arameters/class that has maxmum robablty argmax P D MLE Maxmum Lkelhood: Determne arameters/class whch maxmze robablty of the data argmax P D Lkelhood: P D Each arameter has own dstrbuton of ossble data Dstrbuton descrbed by arameter(s) n Examle.5 Classes: {Horse, Dog}. Feature: RunnngSeed: [ ].5 Model as Gaussan wth fxed σ μ horse =.5, μ dog = 5 5 5. The ror: P() Certan arameters/classes are more common than others Classes: {Horse, Dog} P(Horse)=.5, P(Dog)=.95 Hgh lkelhood may not mean hgh osteror Whch s hgher? P(Horse D=9) P(Dog D=9) P D P D P()..5..5 5 5 3

log(x) ex(x) Revew Classfy by fndng class wth max osteror or max lkelhood Learnng robabltes We have a con based to favor one sde argmax P D P D P() Posteror Lkelhood x Pror - means roortonal We gnore the P(D) denomnator because D stays same whle comarng dfferent classes () How can we calculate the bas? Data (D): {HHTH, TTHH, TTTT, HTTT} Bas (): robablty of H P D = H T H - # heads, T - # tals 4 5 Otmzaton: fndng the maxmum lkelhood The roertes of logarthms argmax P(D ) = argmax H T Equvalently, maxmze log P(D ) argmax H log + T log - robablty of Head e a = b log b = a a < b log a < log b log ab = log a + log b log a n = n log a Convenent when dealng wth small robabltes.454 x.9 =.44 -> - + -7 = -7 6 7 Otmzaton: fndng the maxmum lkelhood Otmzaton: fndng zero sloe argmax P(D ) = argmax H T Equvalently, maxmze log P(D ) argmax H log + T log - robablty of Head Locaton of maxmum has sloe maxmze log P(D ) argmax H log + T log : d H log + T log = d H T = - robablty of Head 8 9 3

Intuton of the MLE result = H H + T Probablty of gettng heads s # heads dvded by # total fls Fndng the maxmum a osteror P D P D P() Incororatng the Beta ror: P = α ( ) β B(α,β) argmax P D P() = argmax log P D + log P() MAP: estmatng (estmatng ) argmax log P D + log P() argmax H log + T log + α log + β log log(b α, β ) H T + α Set dervatve to β = Intuton of the MAP result = H + α H + α + T + β Pror has strong nfluence when H and T small Pror has weak nfluence when H and T large H T + α β = H + α = ( H + T + α + β ) 3 Multle features Dr. Lyon s lecture: Poston coordnates: x, y, angle Pctures: xels, sonar Sometmes multle features rovde new nformaton Robot localzaton: (,4) dfferent from (,) and from (4,4) Sometmes multle features redundant: Suer-hero fan: Watch Batman? Watch Suerman? Assumng ndeendence: Is there a storm? P(storm lghtnng, wnd) : P(S L, W) P S L, W = P(L,W S)P(S) P L, W S P(S) P(L,W) Let s assume L and W are ndeendent gven S P L, W S =? 4 5 4

Estmatng P(Lghtnng Storm) MLE countng data onts Udated Oct : Is there Lghtnng? Yes or No (Bnary varable lke Heads or Tals) P(L=yes S=yes) Probablty of lghtnng gven there s a storm P(L=no S=yes) =? What s MLE of P(L=yes S=yes)? P A = a C = c j = #D{A=a C=c j } #D{C=c j } P A = a, B = b k C = c j = #D{A=a B=b k C=c j } #D{C=c j } Note: both A and C can take on multle values (bnary and beyond) What s MLE of P(L=yes S=no)? 6 7 P(L,W S) P(A,,A n C) P(L,W S)=P(L S)P(W,S) P(A,,A n C) Non-ndeendent, estmate: P(L=yes,W=yes S=yes) P(L=yes,W=no S=yes) P(L=no,W=yes S=yes) Deduce P(L=no,W=no S=yes): (L,W) (no,no) Reeat for S=no P(L, W S = yes) Number of arameters to estmate: For each class fnd n - In total: ( n -) Udated Oct : Note: n ths slde, all varables are bnary Indeendent, estmate: P(L=yes S=yes) Deduce P(L=no S=yes): -P(L=yes S=yes) P(W=yes S=yes) Deduce P(W=no S=yes): -P(W=yes S=yes) Reeat for S=no Number of arameters to estmate: For each class fnd n In total: n Udated Oct : Note: n ths slde, all varables are bnary 8 9 Naïve Bayes: Classfcaton + Learnng Udated Oct : Want to know P(Y X,X,...,X n ) Comute P(X,X,...,X n Y) and P(Y) Comute P X, X,, X n Y = P(X Y) Learnng: Estmate each P(X Y) (through MLE) P X = x k Y = y j = #D(X = x k Y = y j ) #D(Y = y j ) Estmate P(Y) P Y = y j = #D(Y = y j) D Note: both X and Y can take on multle values (bnary and beyond) 3 Shortcomng of MLE P X = x k Y = y j = #D(X = x k Y = y j ) #D(Y = y j ) What f X = x k Y = y j s very rare, but ossble? Examle classfy artcles: X does word aear n artcle? Y={jungle, wallstreet} X =broker very unlkely n jungle: MLE P(X =broker Y=jungle)= P X = x,, X n = x n Y = y j = P(X = x Y = y j ) Udated Oct : Note: both X and Y can take on multle values (bnary and beyond) lon: 6 wolf: monkey: 4 broker: analyst: dvdend: C jungle 3 5

lon Estmate each P(X Y) through MAP Benefts of Naïve Bayes Incororatng ror for each class β j P X = x k Y = y j = #D(X = x k Y = y j ) + (β j ) #D(Y = y j ) + m (β m ) Very fast learnng and classfyng: n+ arameters, not x( n -)+ arameters Often works even f features are NOT ndeendent P Y = y j = #D(Y = y j) + (β j ) D + m (β m ) Extra note: (β j ) frequency of class j m β m frequences of all classes Udated Oct : Note: both X and Y can take on multle values (bnary and beyond) 3 33 Classfcaton strategy: generatve vs. dscrmnatve Lnear algebra: data features Document Document Document 3 Generatve, e.g., Bayes/Naïve Bayes: 5 5 Identfy robablty dstrbuton for each class Determne class wth maxmum robablty for data examle Dscrmnatve, e.g., Logstc Regresson: Identfy boundary between classes Determne whch sde of boundary new data examle exsts on 5 5 Vector lst of numbers: each number descrbes a data feature Matrx lst of lsts of numbers: features for each data ont Wolf Lon 6 Monkey 4 Broker Analyst Dvdend d 8 # of word occurrences 4 34 35 Feature sace The dot roduct Each data feature defnes a dmenson n sace Document Document Document3 Wolf 8 Lon 6 Monkey 4 Broker 4 Analyst Dvdend d doc doc doc3 wolf 36, b = The dot roduct comares two vectors: a = a b a b = = a b = a T b a n b n 5 = 5 + = 5 + = 5 37 6

The dot roduct, contnued Magntude of a vector s the sum of the squares of the elements a = a If a has unt magntude, a b s the rojecton of b onto a a b = n = a b.7.7.5 =.7.5 +.7.7 +.7 =.78.7.7 =.7 +.7.5.5 +.35 =.35 38 Searatng boundary, defned by w Searatng hyerlane slts class and class Plane s defned by lne w erendcular to lan Is data ont x n class or class? w T x > class w T x < class 39 From real-number rojecton to / label Bnary classfcaton: s class A, s class B Sgmod functon stands n for (x y) g(h) Sgmod: g h = +e h x y = ; = g w T x = x y = ; = g w T x = e wt x +e wt x +e wt x.5-5 5 h w T x = j w j x j 4 Learnng arameters for classfcaton Smlar to MLE for Bayes classfer Lkelhood for data onts y,, y n (really framed as osteror y x) If y n class A, y =, multly (-g(x ;w)) If y n class B, y =, multly (g(x ;w)) L y x; w = g x ; w ( y ) g x ; w y ( y ) log g x ; w + y log g x ; w y g x ; w log g x ; w + log g x ; w 4 w T x = w j x j j Learnng arameters for classfcaton g h = + e h e h g h = y g x ; w log g x ; w + log g x ; w y w T x w T x + log g w T x + e h y x w j x j + x j wtx e j + e wt x Learnng arameters for classfcaton y log + e wt x e wt x + log + e wt x + e wt x y log + e wt x + log e wtx + e wt x y w T x w T x log + e wt x 4 w j w j x j y ( g(w T x ) ) x j y g(w T x ) 43 7

y Iteratve gradent descent true data label g(w T x ) comuted data Begn wth ntal guessed weghts w label For each data ont (y,x ), udate each weght w j w j w j + εx j y g(w T x ) MAP for dscrmnatve classfer MLE: P(x y=;w) ~ g(w T x) MAP: P(y= x) = P(x y=;w) P(w) ~ g(w T x)??? Choose ε so change s not too bg or too small Intuton x j y g(w T x) If y = and g(w T x )=, and x j>, make w j larger and ush w T x to be larger If y = and g(w T x )=, and x j>, make w smaller and ush w T x to be smaller P(w) rors L regularzaton mnmze all weghts L regularzaton mnmze number of non-zero weghts 44 45 MAP L regularzaton P(y= x,w) = P(x y=;w) P(w): L y x; w = w j g x ; w ( y ) g x ; w y j y w T x w T x + log g w T x x j y g(w T x ) w j λ e w j λ j (x) w j λ w 46 8