CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Similar documents
Machine learning: Density estimation

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

CS 3710 Advanced Topics in AI Lecture 17. Density estimation. CS 3710 Probabilistic graphical models. Administration

Generative classification models

EM and Structure Learning

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Course 395: Machine Learning - Lectures

Conjugacy and the Exponential Family

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Classification learning II

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Expectation Maximization Mixture Models HMMs

Semi-Supervised Learning

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Learning from Data 1 Naive Bayes

Lecture Notes on Linear Regression

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Classification as a Regression Problem

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Limited Dependent Variables

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Engineering Risk Benefit Analysis

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Homework Assignment 3 Due in class, Thursday October 15

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Linear Approximation with Regularization and Moving Least Squares

Maximum Likelihood Estimation

Probability Theory (revisited)

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Retrieval Models: Language models

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Basic Statistical Analysis and Yield Calculations

The Expectation-Maximization Algorithm

Logistic regression models 1/12

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Evaluation for sets of classes

Advanced Statistical Methods: Beyond Linear Regression

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Lecture 10 Support Vector Machines. Oct

Lecture 3: Probability Distributions

Binomial Distribution: Tossing a coin m times. p = probability of having head from a trial. y = # of having heads from n trials (y = 0, 1,..., m).

Small Area Interval Estimation

The big picture. Outline

= z 20 z n. (k 20) + 4 z k = 4

Hidden Markov Models

Quantifying Uncertainty

Lecture 10 Support Vector Machines II

Kernel Methods and SVMs Extension

Chapter 1. Probability

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

} Often, when learning, we deal with uncertainty:

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Lecture 4 Hypothesis Testing

Gaussian Mixture Models

First Year Examination Department of Statistics, University of Florida

Hierarchical Bayes. Peter Lenk. Stephen M Ross School of Business at the University of Michigan September 2004

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Ensemble Methods: Boosting

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Learning the structure of Bayesian belief networks

Probability and Random Variable Primer

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

CS 2750 Machine Learning Lecture 5. Density estimation. Density estimation

Generalized Linear Methods

CS286r Assign One. Answer Key

Probabilistic & Unsupervised Learning. Introduction and Foundations

Statistical inference for generalized Pareto distribution based on progressive Type-II censored data with random removals

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

10-701/ Machine Learning, Fall 2005 Homework 3

EGR 544 Communication Theory

3.1 ML and Empirical Distribution

Clustering & Unsupervised Learning

SDMML HT MSc Problem Sheet 4

Relevance Vector Machines Explained

The Basic Idea of EM

CS47300: Web Information Search and Management

Logistic Classifier CISC 5800 Professor Daniel Leeds

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Suites of Tests. DIEHARD TESTS (Marsaglia, 1985) See

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH

Multilayer neural networks

Transcription:

CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before the class Programs: submt electroncally Collaboratons on homeworks: You may dscuss materal wth your fellow students, but the report and programs should be wrtten ndvduall CS 750 Machne Learnng

Outlne Outlne: Densty estmaton: Maxmum lkelhood ML Maxmum a posteror MAP Bayesan Bernoull dstrbuton. Bnomal dstrbuton Multnomal dstrbuton. ormal dstrbuton. CS 750 Machne Learnng Densty estmaton Data: D { D, D,.., Dn} D x a vector of attrbute values Attrbutes: modeled by random varables X { X, X, K, X d} wth: Contnuous values Dscrete values E.g. blood pressure wth numercal values or chest pan wth dscrete values [no-pan, mld, moderate, strong] Underlyng true probablty dstrbuton: px CS 750 Machne Learnng

Data: Densty estmaton D { D, D,.., Dn} D x a vector of attrbute values Objectve: try to estmate the underlyng true probablty dstrbuton over varables X, px, usng examples n D true dstrbuton n samples p X D D, D,.., D } { n estmate pˆ X Standard d assumptons: Samples are ndependent of each other come from the same dentcal dstrbuton fxed px CS 750 Machne Learnng Densty estmaton Types of densty estmaton: Parametrc the dstrbuton s modeled usng a set of parameters Θ p X Θ Example: mean and covarances of multvarate normal Estmaton: fnd parameters Θ descrbng data D on-parametrc The model of the dstrbuton utlzes all examples n D As f all examples were parameters of the dstrbuton Examples: earest-neghbor Sem-parametrc CS 750 Machne Learnng

Learnng va parameter estmaton In ths lecture we consder parametrc densty estmaton Basc settngs: A set of random varables X { X, X, K, X d} A model of the dstrbuton over varables n X wth parameters Θ : pˆ X Θ Data D { n D, D,.., D } Objectve: fnd parameters Θˆ that descrbe p X Θ the best CS 750 Machne Learnng Parameter estmaton. Maxmum lkelhood ML maxmze p D Θ, ξ yelds: one set of parameters Θ ML the target dstrbuton s approxmated as: pˆ X p X Θ ML Bayesan parameter estmaton uses the posteror dstrbuton over possble parameters p D Θ, ξ p Θ ξ p Θ D, ξ p D ξ Yelds: all possble settngs of Θ and ther weghts The target dstrbuton s approxmated as: p ˆ X p X D p X Θ p Θ D, ξ dθ Θ CS 750 Machne Learnng

Parameter estmaton. Other possble crtera: Maxmum a posteror probablty MAP maxmze p Θ D, ξ mode of the posteror Yelds: one set of parameters Θ MAP Approxmaton: pˆ X p X Θ MAP Expected value of the parameter Θˆ E Θ mean of the posteror Expectaton taken wth regard to posteror p Θ D, ξ Yelds: one set of parameters Approxmaton: p ˆ X p X Θˆ CS 750 Machne Learnng Parameter estmaton. Con example. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x tal 0 x Model: probablty of a head probablty of a tal Objectve: We would lke to estmate the probablty of a head from data ˆ CS 750 Machne Learnng

Parameter estmaton. Example. Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What would be your estmate of the probablty of a head? ~? CS 750 Machne Learnng Parameter estmaton. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What would be your choce of the probablty of a head? Soluton: use frequences of occurrences to do the estmate ~ 5 0.6 5 Ths s the maxmum lkelhood estmate of the parameter CS 750 Machne Learnng

Probablty of an outcome Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: we know the probablty Probablty of an outcome of a con flp x x P x Bernoull dstrbuton Combnes the probablty of a head and a tal So that x s gong to pck ts correct probablty Gves for x Gves for 0 x x CS 750 Machne Learnng x Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of ndependent con flps D H H T H T H encoded as D 00 What s the probablty of observng the data sequence D: P D? x CS 750 Machne Learnng

Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of con flps D H H T H T H encoded as D 00 What s the probablty of observng a data sequence D: P D x CS 750 Machne Learnng Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of con flps D H H T H T H encoded as D 00 What s the probablty of observng a data sequence D: P D lkelhood of the data x CS 750 Machne Learnng

Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of con flps D H H T H T H encoded as D 00 What s the probablty of observng a data sequence D: P D 6 x P D Can be rewrtten usng the Bernoull dstrbuton: x x CS 750 Machne Learnng The goodness of ft to the data. Learnng: we do not know the value of the parameter Our learnng goal: Fnd the parameter that fts the data D the best? One soluton to the best : Maxmze the lkelhood n x P D x Intuton: more lkely are the data gven the model, the better s the ft ote: Instead of an error functon that measures how bad the data ft the model we have a measure that tells us how well the data ft : Error D, P D CS 750 Machne Learnng

Example: Bernoull dstrbuton. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x tal x 0 Model: probablty of a head probablty of a tal Objectve: We would lke to estmate the probablty of a head ˆ Probablty of an outcome P x x x x Bernoull dstrbuton CS 750 Machne Learnng Maxmum lkelhood ML estmate. Lkelhood of data: n x P D, ξ Maxmum lkelhood estmate ML arg max P D, ξ - number of heads seen - number of tals seen CS 750 Machne Learnng x Optmze log-lkelhood the same as maxmzng lkelhood n x x l D, log P D, ξ log n x log x log log n x log n x

Maxmum lkelhood ML estmate. Optmze log-lkelhood l D, log log Set dervatve to zero Solvng l D, 0 ML Soluton: ML CS 750 Machne Learnng Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What s the ML estmate of the probablty of a head and a tal? CS 750 Machne Learnng

Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What s the ML estmate of the probablty of head and tal? Head: Tal: ML ML 5 5 0.6 0 5 0.4 CS 750 Machne Learnng Maxmum a posteror estmate Maxmum a posteror estmate Selects the mode of the posteror dstrbuton MAP arg max p D, ξ How to choose the pror probablty? Lkelhood of data pror P D, ξ p ξ p D, ξ va Bayes rule P D ξ P D, ξ p ξ n x x - s the pror probablty on CS 750 Machne Learnng ormalzng factor

Pror dstrbuton Choce of pror: Beta dstrbuton p ξ Beta, P D, ξ Posteror dstrbuton s agan a Beta dstrbuton P D, ξ Beta, p D, ξ Beta, P D ξ CS 750 Machne Learnng x - A Gamma functon For nteger values of x x x! Why to use Beta dstrbuton? Beta dstrbuton fts Bernoull trals - conjugate choces Beta dstrbuton 3.5 3 0.5, β0.5.5, β.5.5, β5.5.5 0.5 0 0 0. 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CS 750 Machne Learnng

Maxmum a posteror probablty Maxmum a posteror estmate Selects the mode of the posteror dstrbuton P D, ξ Beta, p D, ξ Beta, P D ξ otce that parameters of the pror act lke counts of heads and tals sometmes they are also referred to as pror counts MAP Soluton: MAP CS 750 Machne Learnng MAP estmate example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 Assume p ξ Beta 5,5 What s the MAP estmate? CS 750 Machne Learnng

MAP estmate example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 Assume p ξ Beta 5,5 What s the MAP estmate? MAP 9 33 CS 750 Machne Learnng MAP estmate example ote that the pror and data ft data lkelhood are combned The MAP can be based wth large pror counts It s hard to overturn t wth a smaller sample sze Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 Assume p ξ Beta 5,5 p ξ Beta 5,0 MAP MAP 9 33 9 48 CS 750 Machne Learnng

Bayesan framework Both ML or MAP estmates pck one value of the parameter Assume: there are two dfferent parameter settngs that are close n terms of ther probablty values. Usng only one of them may ntroduce a strong bas, f we use them, for example, for predctons. Bayesan parameter estmate Remedes the lmtaton of one choce Uses all possble parameter values Where p D, ξ Beta, The posteror can be used to defne pˆ X : p ˆ X p X D p X Θ p Θ D, ξ dθ Θ CS 750 Machne Learnng Bayesan framework Predctve probablty of an outcome x n the next tral P x D, ξ P x D, ξ P x, ξ p D, ξ d 0 p D, ξ d E 0 Posteror densty Equvalent to the expected value of the parameter expectaton s taken wth regard to the posteror dstrbuton p D, ξ Beta, CS 750 Machne Learnng

CS 750 Machne Learnng Expected value of the parameter How to obtan the expected value? d d Beta E 0 0, d 0 Beta d, 0 ote: for nteger values of CS 750 Machne Learnng Expected value of the parameter Substtutng the results for the posteror: We get ote that the mean of the posteror s yet another reasonable parameter choce: E,, Beta D p ξ ˆ E