Curve Fitting Re-visited, Bishop1.2.5

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

ECE285/SIO209, Machine learning for physical applications, Spring 2017

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

BAYESIAN DECISION THEORY

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning Lecture 3

Maximum Likelihood Estimation. only training data is available to design a classifier

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Non-parametric Methods

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Non-parametric Methods

Pattern Recognition and Machine Learning

Recent Advances in Bayesian Inference Techniques

Density Estimation. Seungjin Choi

Bayesian Models in Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning Techniques for Computer Vision

Naïve Bayes classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The Bayes classifier

Nonparametric Methods Lecture 5

Variational Inference. Sargur Srihari

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

BAYESIAN DECISION THEORY. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Overview

Ch 4. Linear Models for Classification

Quantitative Biology II Lecture 4: Variational Methods

Density Estimation: ML, MAP, Bayesian estimation

Expectation Propagation for Approximate Bayesian Inference

COM336: Neural Computing

STA 4273H: Sta-s-cal Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Variational Mixture of Gaussians. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning Lecture Notes

Machine Learning for Signal Processing Bayes Classification and Regression

Probability Models for Bayesian Recognition

Nonparametric Bayesian Methods (Gaussian Processes)

Expectation Propagation Algorithm

Lecture 4: Probabilistic Learning

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Learning Bayesian network : Given structure and completely observed data

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture : Probabilistic Machine Learning

Machine Learning 4771

Nonparametric probability density estimation

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 2

Bayesian Machine Learning

Foundations of Nonparametric Bayesian Methods

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Regression and Discrimination

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Parametric Techniques Lecture 3

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning Linear Classification. Prof. Matteo Matteucci

Reading Group on Deep Learning Session 1

Introduction to Machine Learning Midterm, Tues April 8

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning using Bayesian Approaches

Parametric Techniques

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

probability of k samples out of J fall in R.

Advanced statistical methods for data analysis Lecture 2

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

p L yi z n m x N n xi

Lecture 9: PGM Learning

PROBABILITY DISTRIBUTIONS Nonparametric Methods

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

INTRODUCTION TO PATTERN RECOGNITION

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Expectation Maximization

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Probabilistic Machine Learning

Machine Learning for Signal Processing Bayes Classification

Linear Models for Regression

Machine Learning Lecture 2

Supervised Learning: Non-parametric Estimation

Transcription:

Curve Fitting Re-visited, Bishop1.2.5

Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation

p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the simple Gaussian distribution earlier, it is convenient to maximize the logarithm of the likelihood function. Substituting for the form of the Gaussian distribution, given by (1.46), we obtain the log likelihood function in the form ln p(t x, w, β) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln(2π). (1.62) 2 n=1 Consider first the determination of the maximum likelihood solution for the polyno- 1 = 1 N {y(x n, w ML ) t n } 2. (1.63) β ML N n=1 Giving estimates of W and beta, we can predict p(t x, w ML, β ML )=N ( t y(x, w ML ), β 1 ML). (1.64) ke a step towards a more Bayesian approach and introduce a prior

Predictive Distribution

MAP: A Step towards Bayes 1.2.5 Determine by minimizing regularized sum-of-squares error,. Regularized sum of squares

Important quantity in coding theory statistical physics machine learning Entropy 1.6

Differential Entropy Put bins of width along the real line For fixed differential entropy maximized when in which case

The Kullback-Leibler Divergence P true distribution, q is approximating distribution

Inference step Decision Theory Determine either or. Decision step For given x, determine optimal t.

Minimum Misclassification Rate

Mixtures of Gaussians (Bishop 2.3.9) Old Faithful geyser: The time between eruptions has a bimodal distribution, with the mean interval being either 65 or 91 minutes, and is dependent on the length of the prior eruption. Within a margin of error of ±10 minutes, Old Faithful will erupt either 65 minutes after an eruption lasting less than 2 1 2 minutes, or 91 minutes after an eruption lasting more than 2 1 2 minutes. Single Gaussian Mixture of two Gaussians

Mixtures of Gaussians (Bishop 2.3.9) Combine simple models into a complex model: Component Mixing coefficient K=3

Mixtures of Gaussians (Bishop 2.3.9)

Mixtures of Gaussians (Bishop 2.3.9) Determining parameters p, µ, and S using maximum log likelihood Log of a sum; no closed form maximum. Solution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm (Chapter 9).

Homework

Basic building blocks: Need to determine given Representation: or? Parametric Distributions Recall Curve Fitting We focus on Gaussians!

The Gaussian Distribution

Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Example: N uniform [0,1] random variables.

Geometry of the Multivariate Gaussian

Moments of the Multivariate Gaussian (2) A Gaussian requires D*(D-1)/2 +D parameters. Often we use D +D or Just D+1 parameters.

Partitioned Conditionals and Marginals, page 89

Given i.i.d. data Maximum Likelihood for the Gaussian (1), the log likelihood function is given by ( ) ( ) A ln A = ( A 1) T 2) and (C.26). A Tr (AB) =BT. ( ) A 1 1 A = A x x A 1 (C.28) (C.24) (C.21)

Maximum Likelihood for the Gaussian (2) Set the derivative of the log likelihood function to zero, and solve to obtain Similarly

Bayes Theorem for Gaussian Variables Given we have where

Sequential Estimation Contribution of the N th data point, x N correction given x N correction weight old estimate

Bayesian Inference for the Gaussian (1) Assume s 2 is known. Given i.i.d. data the likelihood function for µ is given by This has a Gaussian shape as a function of µ (but it is not a distribution over µ).

Bayesian Inference for the Gaussian (2) Combined with a Gaussian prior over µ, this gives the posterior Completing the square over µ, we see that

Bayesian Inference for the Gaussian (4) Example: for N = 0, 1, 2 and 10. Prior

Bayesian Inference for the Gaussian (5) Sequential Estimation The posterior obtained after observing N { 1 data points becomes the prior when we observe the N th data point.

NON PARAMETRIC

Nonparametric Methods (1) Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model. Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled. 1000 parameter versus 10 parameter

Nonparametric Methods (2) Histogram methods partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. Often, the same width is used for all bins, D i = D. D acts as a smoothing parameter. In a D-dimensional space, using M bins in each dimension will require M D bins!

Nonparametric Methods (3) Assume observations drawn from a density p(x) and consider a small region R containing x such that If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and Thus The probability that K out of N observations lie inside R is Bin(KjN,P ) and if N is large V small, yet K>0, therefore N large?

Nonparametric Methods (4) Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window) It follows that and hence

Nonparametric Methods (5) To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian Any kernel such that will work. h acts as a smoother.

Nonparametric Methods (6) Nearest Neighbour Density Estimation: fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume, V?, that includes K of the given N data points. Then K acts as a smoother.

Nonparametric Methods (7) Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.

K-Nearest-Neighbours for Classification (1) Given a data set with N k data points from class C k and, we have and correspondingly Since, Bayes theorem gives

K-Nearest-Neighbours for Classification (2) K = 3 K = 1

K-Nearest-Neighbours for Classification (3) K acts as a smother For, the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions).

OLD

Bayesian Inference for the Gaussian (6) Now assume µ is known. The likelihood function for l=1/s 2 is given by This has a Gamma shape as a function of l. The Gamma distribution:

Bayesian Inference for the Gaussian (8) Now we combine a Gamma prior,, with the likelihood function for l to obtain which we recognize as with

Bayesian Inference for the Gaussian (9) If both µ and l are unknown, the joint likelihood function is given by We need a prior with the same functional dependence on µ and l.

Bayesian Inference for the Gaussian (10) The Gaussian-gamma distribution Quadratic in µ. Linear in l. Gamma distribution over l. Independent of µ. µ 0 =0, b=2, a=5, b=6

Bayesian Inference for the Gaussian (12) Multivariate conjugate priors µ unknown, L known: p(µ) Gaussian. L unknown, µ known: p(l) Wishart, L and µ unknown: p(µ, L) Gaussian-Wishart,

Partitioned Gaussian Distributions

Maximum Likelihood for the Gaussian (3) Under the true distribution Hence define

Moments of the Multivariate Gaussian (1) thanks to anti-symmetry of z