Expect Values and Probability Density Functions

Similar documents
Probability Models for Bayesian Recognition

Bayesian Reasoning and Recognition

Linear Classifiers as Pattern Detectors

Estimating Parameters for a Gaussian pdf

Linear Classifiers as Pattern Detectors

Learning Methods for Linear Detectors

Generative Techniques: Bayes Rule and the Axioms of Probability

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Parametric Techniques Lecture 3

Parametric Techniques

Pattern Recognition. Parameter Estimation of Probability Density Functions

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

p(d θ ) l(θ ) 1.2 x x x

Bayesian Decision Theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 2

Minimum Error-Rate Discriminant

Supervised Learning: Non-parametric Estimation

Bayes Decision Theory

Machine Learning Lecture 2

The Gaussian distribution

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

CS 195-5: Machine Learning Problem Set 1

Naïve Bayes classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

PATTERN RECOGNITION AND MACHINE LEARNING

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

T Machine Learning: Basic Principles

Probability and Estimation. Alan Moses

Pattern Classification

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Statistics for scientists and engineers

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 1: Bayesian Framework Basics

Density Estimation. Seungjin Choi

Pattern Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Motivating the Covariance Matrix

Bayesian Methods: Naïve Bayes

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Statistics and Data Analysis

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Bayesian Decision and Bayesian Learning

Deep Learning for Computer Vision

p(x ω i 0.4 ω 2 ω

Generative v. Discriminative classifiers Intuition

Machine Learning

Multivariate statistical methods and data mining in particle physics

Density Estimation: ML, MAP, Bayesian estimation

BAYESIAN DECISION THEORY

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Decision Theory Lecture 2

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books

COM336: Neural Computing

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Statistics: Learning models from data

Statistical methods in recognition. Why is classification a problem?

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Introduction to Probability and Statistics (Continued)

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

L2: Review of probability and statistics

Lecture Notes on the Gaussian Distribution

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Lecture 4: Probabilistic Learning

Probability and Information Theory. Sargur N. Srihari

Bayesian Linear Regression [DRAFT - In Progress]

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture : Probabilistic Machine Learning

Machine Learning 4771

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

SDS 321: Introduction to Probability and Statistics

Non-parametric Methods

Machine Learning Lecture 3

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

L11: Pattern recognition principles

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ch 4. Linear Models for Classification

Multivariate Bayesian Linear Regression MLAI Lecture 11

Bayesian Decision Theory

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Transcription:

Intelligent Systems: Reasoning and Recognition James L. Crowley ESIAG / osig Second Semester 00/0 Lesson 5 8 april 0 Expect Values and Probability Density Functions otation... Bayesian Classification (Reminder...3 Expected Values and oments:...4 The average value is the first moment of the samples... 4 Probability Density Functions...5 Expected Values for PDFs... 7 The ormal (Gaussian Density Function...9 ultivariate ormal Density Functions... 0 Sources Bibliographiques : Pattern Recognition and achine Learning, C.. Bishop, Springer Verlag, 006. Pattern Recognition and Scene Analysis, R. E. Duda and P. E. Hart, Wiley, 973.

Classical Baysien Classifiers Séance 5 otation x a variable X a random variable (unpredictable value The number of possible values for X (Can be infinite. x A vector of D variables. X A vector of D random variables. D The number of dimensions for the vector x or X E An observation. An event. C k The class k k Class index K Total number of classes ω k The statement (assertion that E T k p(ω k =p(e T k Probability that the observation E is a member of the class k. ote that p(ω k is lower case. k umber of examples for the class k. (think = ass Total number of examples. = K k = k { X mk } A set of k examples for the class k. {X m } = {X k m } k=,k P(X Probability density function for X P( X Probability density function for X P( X ω k Probability density for X the class k. ω k = E C k. h(n A histogram of random values for the feature n. h k (n A histogram of random values for the feature n for the class k. Q K h(x = h k (x k = umber of cells in h(n. Q = D E{X} = X m m= P( X k = D (# det(c k e ( X µ k T C k ( X µ k 5-

Classical Baysien Classifiers Séance 5 Bayesian Classification (Reminder Our problem is to build a box that maps a set of features X from an Observation, E into a class C k from a set of K possible Classes. x x... x d Class(x,x,..., x d } ^ Let ω k be the proposition that the event belongs to class k: ω k = E T k ω k Proposition that the event E the class k In order to minimize the number of mistakes, we will maximize the probability that that the event E the class k ˆ k = arg# max Pr( k X k { } We will rely on two tools for this: Baye's Rule : p( k X = P( X k p( k P( X ormal Density Functions P( X k = D (# det(c k e ( X µ k T C k ( X µ k Today we concentrate on ormal Density Functions. 5-3

Classical Baysien Classifiers Séance 5 Expected Values and oments: The average value is the first moment of the samples For a numerical feature value for {X m }, the expected value E{X} is defined as the average or the mean: E{X} = m= X m µ x = E{X} is the first moment (or center of gravity of the values of {X m }. This can be seen from the histogram h(x. The mass of the histogram is the zeroth moment, = n= h(n is also the number of samples used to compute h(n. The expected value of is the average µ µ = E{X m } = X m m= This is also the expected value of n. µ = h(n # n n= Thus the center of gravity of the histogram is the expected value of the random variable: µ = E{X m } = X m = h(n # n m= n= The second moment is the expected deviation from the first moment: = E{(X E{X} } = # (X µ m = # h(n $ (n µ m= n= 5-4

Classical Baysien Classifiers Séance 5 Probability Density Functions In many cases, the number of possible feature values,, or the number of features, D, make a histogram based approach infeasible. In such cases we can replace h(x with a probability density function (pdf. A probability density function of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point in the observation space. ote: Likelihood is not probability. Definition: Likelihood is a relative measure of belief or certainty. We will use the likelihood to determine the parameters for parametric models of probability density functions. To do this, we first need to define probability density functions. A probability density function, p( X, is a function of a continuous variable or vector, X R D, of random variables such that : X is a vector of D real valued random variables with values between [, ] # $ p( X = # In this case we replace h(n p( X For Bayesian conditional density (where ω k = E T k h(n k # p( X k k Thus: p( k X = p( X k p( X p( k ote that the ratio of two pdfs gives a probability value This equation can be interpreted as Posterior Probability = Likelihood x prior probability 5-5

Classical Baysien Classifiers Séance 5 The probability for a random variable to fall within a given set is given by the integral of its density. p(x in [A, B] = p(x [ A,B] = p(xdx Thus some authors work with Cumulative Distribution functions: P(x = x $ x =# p(xdx B # A Probability Density functions are a primary tool for designing recognition machines. There is one more tool we need : Gaussian (ormal density functions: 5-6

Classical Baysien Classifiers Séance 5 Expected Values for PDFs Just as with histograms, the expected value is the first moment of a pdf. Remember that for a pdf the mass is by definition: # $ p(x dx = # For the continuous random variable {X m } and the pdf P(X. E{X} = X m = & p(x# x dx m = The second moment is $ ote that = E{(X µ } = # (X m µ = p(x$(x µ dx m= & ' & = E{(X µ } = E{X } µ = E{X } E{X} ote that this is a Biased variance. The unbiased variance would be = # $ m= (X m µ If we draw a random sample {X m } of random variables from a ormal density with parameters (µ, σ {X m } (x;µ, # Then we compute the moments, we obtain. and µ = E{X m } = X m m= ˆ = # (X µ m Where = # ˆ m= 5-7

Classical Baysien Classifiers Séance 5 ote the notation: ~ means true, ^ means estimated. The expectation underestimates the variance by /. Later we will see that the expected RS error for estimating p(x from samples is related to the bias. But first, we need to examine the Gaussian (or normal density function. 5-8

Classical Baysien Classifiers Séance 5 The ormal (Gaussian Density Function Whenever a random variable is determined by a sequence of independent random events, the outcome will be a ormal or Gaussian density function. This is demonstrated by the Central Limit Theorem. The essence of the derivation is that repeated convolution of any finite density function will tend asymptotically to a Gaussian (or normal function. p(x = (x; µ, σ = πσ e (x µ σ (x; µ, µ µ µ+ x The parameters of (x; µ, σ are the first and second moments. This is often written as a conditional: This is sometimes expressed as a conditional (X µ, σ In most cases, for any density p(x : as p(x * (x; µ, σ This is the Central Limit theorem. An exception is the dirac delta p(x = δ(x. 5-9

Classical Baysien Classifiers Séance 5 ultivariate ormal Density Functions In most practical cases, an observation is described by D features. In this case a training set { X m } an be used to calculate an average feature µ µ = E{ X } = # µ & # E{X } & ( ( µ X = ( = E{X } (... (... ( m= ( ( $ ' $ E{X D }' µ D If the features are mapped onto integers from [, ]: { X m } { n m } we can build a multi-dimensional histogram using a D dimensional table: m =, :h( n m # h( n m + As before the average feature vector, µ, is the center of gravity (first moment of the histogram. µ d = E{n d } = n dm =... h(n,n,...,n D # n d = h( n # n d = µ d µ = E{ n } = For Real valued X: m= n =n = n D = $ h( ' & n # n & n = $ µ ' n m = & m= h( n # n = h( & & n # n = & µ n = n = & &... &... & µ & h( n # n D ( D ( µ d = E{X d } = X dm =... p(x, x,...x D # x d dx,dx,...,dx D m = In any case: E{X } µ µ = E{ X $ ' $ ' E{X } } = $ ' = $ µ ' $... ' $... ' $ ' $ ' # E{X D }& # & µ D & & $ $ & $ n = For D dimensions, the second moment is a co-variance matrix composed of D terms: n = 5-0

Classical Baysien Classifiers Séance 5 ij = E{(X i µ i (X j µ j } = # (X im µ i (X jm µ j This is often written = E{( X E{ X }( X E{ X } T } m= and gives $ # #... # D ' & # = & #... # D &............ & # D # D... # DD ( This provides the parameters for p( X = ( X µ, = D (# det( e ( X µ T ( X µ The exponent is positive and quadratic (nd order. This value is known as the Distance of ahalanobis. d( X ; µ, C = ( X µ T ( X µ This is a distance normalized by the covariance. In this case, the covariance is said to provide the distance metric. This is very useful when the components of X have different units. The result can be visualized by looking at the equi-probably contours. x p(x µ, C Contours d'équi-probabilité x 5-

Classical Baysien Classifiers Séance 5 If x i and x j are statistically independent, then σ ij =0 For positive values of σ ij, x i and x j vary together. For negative values of σ ij, x i and x j vary in opposite directions. For example, consider features x = height (m and x = weight (kg In most people height and weight vary together and so σ would be positive 5-