Lecture 19 of 42. MAP and MLE continued, Minimum Description Length (MDL)

Similar documents
Excess Error, Approximation Error, and Estimation Error

PGM Learning Tasks and Metrics

Need for Probabilistic Reasoning. Raymond J. Mooney. Conditional Probability. Axioms of Probability Theory. Classification (Categorization)

COS 511: Theoretical Machine Learning

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

1 Definition of Rademacher Complexity

1 Review From Last Time

XII.3 The EM (Expectation-Maximization) Algorithm

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Machine learning: Density estimation

Xiangwen Li. March 8th and March 13th, 2001

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

10-701/ Machine Learning, Fall 2005 Homework 3

LECTURE :FACTOR ANALYSIS

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Computational and Statistical Learning theory Assignment 4

Boostrapaggregating (Bagging)

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Ensemble Methods: Boosting

Hidden Markov Models

EE513 Audio Signals and Systems. Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Evaluation for sets of classes

Designing Fuzzy Time Series Model Using Generalized Wang s Method and Its application to Forecasting Interest Rate of Bank Indonesia Certificate

Classification Bayesian Classifiers

Outline. Prior Information and Subjective Probability. Subjective Probability. The Histogram Approach. Subjective Determination of the Prior Density

Engineering Risk Benefit Analysis

Reliability estimation in Pareto-I distribution based on progressively type II censored sample with binomial removals

} Often, when learning, we deal with uncertainty:

Course 395: Machine Learning - Lectures

Generative classification models

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CHAPT II : Prob-stats, estimation

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

SDMML HT MSc Problem Sheet 4

Kernel Methods and SVMs Extension

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

EM and Structure Learning

Multilayer Perceptron (MLP)

Bayesian Decision Theory

Two Conjectures About Recency Rank Encoding

System in Weibull Distribution

The big picture. Outline

CHAPTER 3: BAYESIAN DECISION THEORY

Homework Assignment 3 Due in class, Thursday October 15

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

A Knowledge-Based Feature Selection Method for Text Categorization

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Least Squares Fitting of Data

Decision-making and rationality

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Optimal Bound for Sum of Square Roots of Special Type of Integers

On Pfaff s solution of the Pfaff problem

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Lecture Notes on Linear Regression

One-Shot Quantum Information Theory I: Entropic Quantities. Nilanjana Datta University of Cambridge,U.K.

Introducing Entropy Distributions

Several generation methods of multinomial distributed random number Tian Lei 1, a,linxihe 1,b,Zhigang Zhang 1,c

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Gadjah Mada University, Indonesia. Yogyakarta State University, Indonesia Karangmalang Yogyakarta 55281

Singular Value Decomposition: Theory and Applications

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Classification as a Regression Problem

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

x = , so that calculated

MDL-Based Unsupervised Attribute Ranking

Speech and Language Processing

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Generalized Linear Methods

Week 5: Neural Networks

Applied Mathematics Letters

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

PHYS 1443 Section 002 Lecture #20

2 Complement Representation PIC. John J. Sudano Lockheed Martin Moorestown, NJ, 08057, USA

Lecture 12: Discrete Laplacian

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Machine Learning. What is a good Decision Boundary? Support Vector Machines

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

ITERATIVE ESTIMATION PROCEDURE FOR GEOSTATISTICAL REGRESSION AND GEOSTATISTICAL KRIGING

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

On the Construction of Polar Codes

Bayesian predictive Configural Frequency Analysis

COMP th April, 2007 Clement Pang

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Small-Sample Equating With Prior Information

Transcription:

Lecture 19 of 4 MA and MLE contnued, Mnu Descrpton Length (MDL) Wednesday, 8 February 007 Wlla H. Hsu, KSU http://www.kddresearch.org Readngs for next class: Chapter 5, Mtchell Lecture Outlne Read Sectons 6.1-6.5, Mtchell Overvew of Bayesan Learnng Fraework: usng probablstc crtera to generate hypotheses of all knds robablty: foundatons Bayes s Theore Defnton of condtonal (posteror) probablty Rafcatons of Bayes s Theore Answerng probablstc queres MA hypotheses Generatng Maxu A osteror (MA) Hypotheses Generatng Maxu Lkelhood Hypotheses Next Week: Sectons 6.6-6.13, Mtchell; Roth; earl and Vera More Bayesan learnng: MDL, BOC, Gbbs, Sple (Naïve) Bayes Learnng over text 1

Bayes s Theore MA Hypothess Choosng Hypotheses Generally want ost probable hypothess gven the tranng data Defne: arg ax[ f ( x )] the value of x n the saple space Ω wth the hghest f(x) x Ω Maxu a posteror hypothess, h MA ML Hypothess ( h D) h MA = ( D h) ( h) ( D) = arg ax = arg ax = arg ax = ( h D) ( D) ( h D) ( D h) ( h) ( D) ( D h) ( h) Assue that p(h ) = p(h j ) for all pars, j (unfor prors,.e., H ~ Unfor) Can further splfy and choose the axu lkelhood hypothess, h ML h ML = arg ax h H ( D h ) Bayes s Theore: Query Answerng (QA) Answerng User Queres Suppose we want to perfor ntellgent nferences over a database DB Scenaro 1: DB contans records (nstances), soe labeled wth answers Scenaro : DB contans probabltes (annotatons) over propostons QA: an applcaton of probablstc nference QA Usng ror and Condtonal robabltes: Exaple Query: Does patent have cancer or not? Suppose: patent takes a lab test and result coes back postve Correct + result n only 98% of the cases n whch dsease s actually present Correct - result n only 97% of the cases n whch dsease s not present Only 0.008 of the entre populaton has ths cancer α (false negatve for H 0 Cancer) = 0.0 (NB: for 1-pont saple) β (false postve for H 0 Cancer) = 0.03 (NB: for 1-pont saple) ( Cancer ) = 0.008 ( + Cancer ) = 0.98 ( + Cancer ) = 0.03 ( Cancer ) = 0.99 ( Cancer ) = 0.0 ( Cancer ) = 0.97 (+ H 0 ) (H 0 ) = 0.0078, (+ H A ) (H A ) = 0.098 h MA = H A Cancer

Basc Forulas for robabltes roduct Rule (Alternatve Stateent of Bayes s Theore) roof: requres axoatc set theory, as does Bayes s Theore Su Rule Sketch of proof (edate fro axoatc set theory) Draw a Venn dagra of two sets denotng events A and B A Let A B denote the event correspondng to A B Theore of Total robablty Suppose events A 1, A,, A n are utually exclusve and exhaustve Mutually exclusve: j A A j = Exhaustve: (A )= 1 n Then B = B A A ( ) ( ) ( ) =1 ( A B) = ( A B) ( B) ( A B) = ( A) + ( B) ( A B) roof: follows fro product rule and 3 rd Kologorov axo B MA and ML Hypotheses: A attern Recognton Fraework attern Recognton Fraework Autoated speech recognton (ASR), autoated age recognton Dagnoss Forward roble: One Step n ML Estaton Gven: odel h, observatons (data) D Estate: (D h), the probablty that the odel generated the data Backward roble: attern Recognton / redcton Step Gven: odel h, observatons D Maxze: (h(x) = x h, D) for a new X (.e., fnd best x) Forward-Backward (Learnng) roble Gven: odel space H, data D Fnd: h H such that (h D) s axzed (.e., MA hypothess) More Info http://www.cs.brown.edu/research/a/dynacs/tutoral/docuents/ HddenMarkovModels.htl Ephass on a partcular H (the space of hdden Markov odels) 3

Bayesan Learnng Exaple: Unbased Con [1] Con Flp Saple space: Ω = {Head, Tal} Scenaro: gven con s ether far or has a 60% bas n favor of Head h 1 far con: (Head) = 0.5 h 60% bas towards Head: (Head) = 0.6 Objectve: to decde between default (null) and alternatve hypotheses A ror (aka ror) Dstrbuton on H (h 1 ) = 0.75, (h ) = 0.5 Reflects learnng agent s pror belefs regardng H Learnng s revson of agent s belefs Collecton of Evdence Frst pece of evdence: d a sngle con toss, coes up Head Q: What does the agent beleve now? A: Copute (d) = (d h 1 ) (h 1 ) + (d h ) (h ) Bayesan Learnng Exaple: Unbased Con [] Bayesan Inference: Copute (d) = (d h 1 ) (h 1 ) + (d h ) (h ) (Head) = 0.5 0.75 + 0.6 0.5 = 0.375 + 0.15 = 0.55 Ths s the probablty of the observaton d = Head Bayesan Learnng Now apply Bayes s Theore (h 1 d) = (d h 1 ) (h 1 ) / (d) = 0.375 / 0.55 = 0.714 (h d) = (d h ) (h ) / (d) = 0.15 / 0.55 = 0.86 Belef has been revsed downwards for h 1, upwards for h The agent stll thnks that the far con s the ore lkely hypothess Suppose we were to use the ML approach (.e., assue equal prors) Belef s revsed upwards fro 0.5 for h 1 Data then supports the bas con better More Evdence: Sequence D of 100 cons wth 70 heads and 30 tals (D) = (0.5) 50 (0.5) 50 0.75 + (0.6) 70 (0.4) 30 0.5 Now (h 1 d) << (h d) 4

Brute Force MA Hypothess Learner Intutve Idea: roduce Most Lkely h Gven Observed D Algorth Fnd-MA-Hypothess (D) 1. FOR each hypothess h H Calculate the condtonal (.e., posteror) probablty: = ( h D) ( D h) ( h) ( D). RETURN the hypothess h MA wth the hghest condtonal probablty h MA = arg ax ( h D) Relaton to Concept Learnng Usual Concept Learnng Task Instance space X Hypothess space H Tranng exaples D Consder Fnd-S Algorth Gven: D Return: ost specfc h n the verson space VS H,D MA and Concept Learnng Bayes s Rule: Applcaton of Bayes s Theore What would Bayes s Rule produce as the MA hypothess? Does Fnd-S Output A MA Hypothess? 5

Bayesan Concept Learnng and Verson Spaces Assuptons Fxed set of nstances <x 1, x,, x > Let D denote the set of classfcatons: D = <c(x 1 ), c(x ),, c(x )> Choose (D h) (D h) = 1 f h consstent wth D (.e., x. h(x ) = c(x )) (D h) = 0 otherwse Choose (h) ~ Unfor 1 Unfor dstrbuton: ( h) = H Unfor prors correspond to no background knowledge about h Recall: axu entropy MA Hypothess ( h D) 1 = VS 0 H,D f h s consstent wth D otherwse Evoluton of osteror robabltes Start wth Unfor rors Equal probabltes assgned to each hypothess Maxu uncertanty (entropy), nu pror nforaton (h) (h D 1 ) (h D 1, D ) Hypotheses Hypotheses Hypotheses Evdental Inference Introduce data (evdence) D 1 : belef revson occurs Learnng agent revses condtonal probablty of nconsstent hypotheses to 0 osteror probabltes for reanng h VS H,D revsed upward Add ore data (evdence) D : further belef revson 6

Characterzng Learnng Algorths by Equvalent MA Learners Inductve Syste Tranng Exaples D Hypothess Space H Canddate Elnaton Algorth Output hypotheses Tranng Exaples D Equvalent Bayesan Inference Syste Output hypotheses Hypothess Space H (h) ~ Unfor (D h) = δ(h( ), c( )) Brute Force MA Learner ror knowledge ade explct Maxu Lkelhood: Learnng A Real-Valued Functon [1] y f(x) e h ML x roble Defnton Target functon: any real-valued functon f Tranng exaples <x, y > where y s nosy tranng value y = f(x ) + e e s rando varable (nose)..d. ~ Noral (0, σ), aka Gaussan nose Objectve: approxate f as closely as possble Soluton Maxu lkelhood hypothess h ML Mnzes su of squared errors (SSE) h ML = arg n ( ( )) d h x = 1 7

Maxu Lkelhood: Learnng A Real-Valued Functon [] Dervaton of Least Squares Soluton Assue nose s Gaussan (pror knowledge) Max lkelhood soluton: h = arg ax p D h σ = arg ax e = 1 ππ roble: Coputng Exponents, Coparng Reals - Expensve! Soluton: Maxze Log rob 1 1 d ( ) h x h ML = arg ax ln = 1 πσ σ 1 d ( ) h x = arg ax = 1 σ = arg ax = 1 ML = arg ax = arg n = 1 ( ) p( d h) ( d h( x )) ( d h( x )) = 1 1 d ( ) 1 h x Learnng to redct robabltes Applcaton: redctng Survval robablty fro atent Data roble Defnton Gven tranng exaples <x, d >, where d H {0, 1} Want to tran neural network to output a probablty gven x (not a 0 or 1) Maxu Lkelhood Estator (MLE) In ths case can show: hml = arg ax [ d lnh( x ) + ( 1 d ) ln( 1 h( x ))] Weght update rule for a sgod unt w = w x 1 w 1 x w x n w n x 0 = 1 start layer,end layer Δw start layer,end layer w 0 = 1 = r Σ n r r net = w x = w x = 0 start layer,end layer ( d h( x )) x start layer,end layer = 1 + Δw start layer,end layer r r r o = ( x ) = σ( x w ) σ( net ) 8

Most robable Classfcaton of New Instances MA and MLE: Ltatons roble so far: fnd the ost lkely hypothess gven the data Soetes we just want the best classfcaton of a new nstance x, gven D A Soluton Method Fnd best (MA) h, use t to classfy Ths ay not be optal, though! Analogy Estatng a dstrbuton usng the ode versus the ntegral One fnds the axu, the other the area Refned Objectve Want to deterne the ost probable classfcaton Need to cobne the predcton of all hypotheses redctons ust be weghted by ther condtonal probabltes Result: Bayes Optal Classfer (next te ) Mnu Descrpton Length (MDL) rncple: Occa s Razor Occa s Razor Recall: prefer the shortest hypothess - an nductve bas Questons Why short hypotheses as opposed to an arbtrary class of rare hypotheses? What s specal about nu descrpton length? Answers MDL approxates an optal codng strategy for hypotheses In certan cases, ths codng strategy axzes condtonal probablty Issues How exactly s nu length beng acheved (length of what)? When and why can we use MDL learnng for MA hypothess learnng? What does MDL learnng really ental (what does the prncple buy us)? MDL rncple refer h that nzes codng length of odel plus codng length of exceptons Model: encode h usng a codng schee C 1 Exceptons: encode the condtoned data D h usng a codng schee C 9

MDL Hypothess MDL and Optal Codng: Bayesan Inforaton Crteron (BIC) = arg n[ L ( h) L ( D h) ] h + MDL C 1 C e.g., H decson trees, D = labeled tranng data L C1 (h) nuber of bts requred to descrbe tree h under encodng C 1 L C (D h) nuber of bts requred to descrbe D gven h under encodng C NB: L C (D h) = 0 f all x classfed perfectly by h (need only descrbe exceptons) Hence h MDL trades off tree sze aganst tranng errors Bayesan Inforaton Crteron BIC ( h) = lg ( D h) + lg ( h) hma = arg ax[ ( D h) ( h) ] = arg ax[ lg ( D h) + lg ( h) ] = arg ax BIC( h) = arg n[ lg ( D h) lg ( h) ] Interestng fact fro nforaton theory: the optal (shortest expected code length) code for an event wth probablty p s -lg(p) bts Interpret h MA as total length of h and D gven h under optal code BIC = -MDL (.e., argax of BIC s argn of MDL crteron) refer hypothess that nzes length(h) + length (sclassfcatons) Concludng Rearks on MDL What Can We Conclude? Q: Does ths prove once and for all that short hypotheses are best? A: Not necessarly Only shows: f we fnd log-optal representatons for (h) and (D h), then h MA = h MDL No reason to beleve that h MDL s preferable for arbtrary codngs C 1, C Case n pont: practcal probablstc knowledge bases Elctaton of a full descrpton of (h) and (D h) s hard Huan pleentor ght prefer to specfy relatve probabltes Inforaton Theoretc Learnng: Ideas Learnng as copresson Abu-Mostafa: coplexty of learnng probles (n ters of nal codngs) Wolff: coputng (especally search) as copresson (Bayesan) odel selecton: searchng H usng probablstc crtera 10

Bayesan Classfcaton Fraework Fnd ost probable classfcaton (as opposed to MA hypothess) f: X V (doan nstance space, range fnte set of values) Instances x X can be descrbed as a collecton of features x (x 1, x,, x n ) erforance eleent: Bayesan classfer Gven: an exaple (e.g., Boolean-valued nstances: x H) Output: the ost probable value v j V (NB: prors for x constant wrt v MA ) v MA = arg ax v v j V = arg ax v j V ( j x ) = arg ax ( v j x1, x, K, x n ) v j V ( x, x, K, x v ) ( v ) 1 araeter Estaton Issues Estatng (v j ) s easy: for each value v j, count ts frequency n D = {<x, f(x)>} However, t s nfeasble to estate (x 1, x,, x n v j ): too any 0 values In practce, need to ake assuptons that allow us to estate (x d) n j j Bayes Optal Classfer (BOC) Intutve Idea h MA (x) s not necessarly the ost probable classfcaton! Exaple Three possble hypotheses: (h 1 D) = 0.4, (h D) = 0.3, (h 3 D) = 0.3 Suppose that for new nstance x, h 1 (x) = +, h (x) =, h 3 (x) = What s the ost probable classfcaton of x? Bayes Optal Classfcaton (BOC) v* = v Exaple (h 1 D) = 0.4, ( h 1 ) = 0, (+ h 1 ) = 1 (h) (h D) = 0.3, ( h ) = 1, (+ h ) = 0 (h 3 D) = 0.3, ( h 3 ) = 1, (+ h 3 ) = 0 [ ( + h ) ( h D) ] h H = 0.4 [ ( h ) ( h D) ] = 0.6 h H [ ( j ) ( )] Result: v* = v = arg ax v h h D = BOC v j V h H BOC = arg ax v j V h H [ ( v j h ) ( h D) ] h 11

Ternology Introducton to Bayesan Learnng robablty foundatons Defntons: subjectvst, frequentst, logcst (3) Kologorov axos Bayes s Theore ror probablty of an event Jont probablty of an event Condtonal (posteror) probablty of an event Maxu A osteror (MA) and Maxu Lkelhood (ML) Hypotheses MA hypothess: hghest condtonal probablty gven observatons (data) ML: hghest lkelhood of generatng the observed data ML estaton (MLE): estatng paraeters to fnd ML hypothess Bayesan Inference: Coputng Condtonal robabltes (Cs) n A Model Bayesan Learnng: Searchng Model (Hypothess) Space usng Cs Suary onts Introducton to Bayesan Learnng Fraework: usng probablstc crtera to search H robablty foundatons Defntons: subjectvst, objectvst; Bayesan, frequentst, logcst Kologorov axos Bayes s Theore Defnton of condtonal (posteror) probablty roduct rule Maxu A osteror (MA) and Maxu Lkelhood (ML) Hypotheses Bayes s Rule and MA Unfor prors: allow use of MLE to generate MA hypotheses Relaton to verson spaces, canddate elnaton Next Week: 6.6-6.10, Mtchell; Chapter 14-15, Russell and Norvg; Roth More Bayesan learnng: MDL, BOC, Gbbs, Sple (Naïve) Bayes Learnng over text 1