Machine Learning. Probabilistic KNN.

Similar documents
Monte Carlo in Bayesian Statistics

Recent Advances in Bayesian Inference Techniques

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Learning the hyper-parameters. Luca Martino

CSCI-567: Machine Learning (Spring 2019)

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Bayesian Methods for Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

MCMC notes by Mark Holder

LECTURE 15 Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo

Naïve Bayes classification

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Probabilistic Time Series Classification

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Markov Chains and MCMC

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

MCMC: Markov Chain Monte Carlo

Reminder of some Markov Chain properties:

STA 4273H: Sta-s-cal Machine Learning

Relevance Vector Machines

Introduction to Machine Learning CMU-10701

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian Learning (II)

an introduction to bayesian inference

Theory of Stochastic Processes 8. Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

MCMC algorithms for fitting Bayesian models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Curve Fitting Re-visited, Bishop1.2.5

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Adaptive Monte Carlo methods

Bayesian Classification and Regression Trees

Bayesian Nonparametric Regression for Diabetes Deaths

Advances and Applications in Perfect Sampling

Machine Learning, Fall 2009: Midterm

CSC 2541: Bayesian Methods for Machine Learning

Gibbs Sampling for the Probit Regression Model with Gaussian Markov Random Field Latent Variables

Markov Chain Monte Carlo

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

Forward Problems and their Inverse Solutions

Machine Learning for Data Science (CS4786) Lecture 24

Multiclass Relevance Vector Machines: Sparsity and Accuracy

Bayesian Learning in Undirected Graphical Models

Machine Learning Techniques for Computer Vision

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Nonparametric Bayesian Methods (Gaussian Processes)

Active and Semi-supervised Kernel Classification

Conditional Random Field

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 7 and 8: Markov Chain Monte Carlo

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Markov chain Monte Carlo methods for visual tracking

VCMC: Variational Consensus Monte Carlo

Topic Modelling and Latent Dirichlet Allocation

Machine Learning Linear Classification. Prof. Matteo Matteucci

Part 1: Expectation Propagation

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Time-Sensitive Dirichlet Process Mixture Models

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

STA 4273H: Statistical Machine Learning

Machine Learning for Signal Processing Bayes Classification

Scaling Neighbourhood Methods

Bayesian Estimation with Sparse Grids

Markov Networks.

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

A = {(x, u) : 0 u f(x)},

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture : Probabilistic Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Sampling Algorithms for Probabilistic Graphical models

Machine Learning using Bayesian Approaches

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

On Markov chain Monte Carlo methods for tall data

Learning Bayesian Networks for Biomedical Data

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Pattern Recognition and Machine Learning

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Non-parametric Methods

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian Inference and MCMC

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

An introduction to Bayesian statistics and model calibration and a host of related topics

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Expectation Propagation for Approximate Bayesian Inference

Mining Classification Knowledge

Modelling and forecasting of offshore wind power fluctuations with Markov-Switching models

Density Estimation. Seungjin Choi

Transcription:

Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3

KNN is a remarkably simple algorithm with proven error-rates June 21, 2007 p. 2/3

KNN is a remarkably simple algorithm with proven error-rates One drawback is that it is not built on any probabilistic framework June 21, 2007 p. 2/3

KNN is a remarkably simple algorithm with proven error-rates One drawback is that it is not built on any probabilistic framework No posterior probabilities of class membership June 21, 2007 p. 2/3

KNN is a remarkably simple algorithm with proven error-rates One drawback is that it is not built on any probabilistic framework No posterior probabilities of class membership No way to infer number of neighbours or metric parameters probabilistically June 21, 2007 p. 2/3

KNN is a remarkably simple algorithm with proven error-rates One drawback is that it is not built on any probabilistic framework No posterior probabilities of class membership No way to infer number of neighbours or metric parameters probabilistically Let us try and get around this problem June 21, 2007 p. 2/3

The first thing which is needed is a likelihood June 21, 2007 p. 3/3

The first thing which is needed is a likelihood Consider a finite data sample {(t 1,x 1 ),, (t N,x N )} where each t n {1,,C} denotes the class label and D-dimensional feature vector x n R D. The feature space R D has an associated metric with parameters θ denoted as M θ. June 21, 2007 p. 3/3

The first thing which is needed is a likelihood Consider a finite data sample {(t 1,x 1 ),, (t N,x N )} where each t n {1,,C} denotes the class label and D-dimensional feature vector x n R D. The feature space R D has an associated metric with parameters θ denoted as M θ. A likelihood can be formed as p(t X,β,k,θ, M) n=1 exp C c=1 { exp β k { M θ j n k β k M θ j n k δ tn t j } δ ctn } June 21, 2007 p. 3/3

The number of nearest neighbours is k and β defines a scaling variable. The expression M θ j n k δ tn t j denotes the number of the nearest k neighbours of x n, as measured under the metric M θ within N 1 samples from X remaining when x n is removed which we denote as X n, and have the class label value of t n, whilst each of the terms in the summation of the denominator provides a count of the number of the k neighbours of x n which have class label equaling c. June 21, 2007 p. 4/3

Likelihood formed by product of terms p(t n x n,x n,t n,β,k,θ, M) June 21, 2007 p. 5/3

Likelihood formed by product of terms p(t n x n,x n,t n,β,k,θ, M) This is a Leave-One-Out (LOO) predictive likelihood, where t n denotes the vector t with the n th element removed June 21, 2007 p. 5/3

Likelihood formed by product of terms p(t n x n,x n,t n,β,k,θ, M) This is a Leave-One-Out (LOO) predictive likelihood, where t n denotes the vector t with the n th element removed Approximate joint likelihood provides an overall measure of the LOO predictive likelihood June 21, 2007 p. 5/3

Likelihood formed by product of terms p(t n x n,x n,t n,β,k,θ, M) This is a Leave-One-Out (LOO) predictive likelihood, where t n denotes the vector t with the n th element removed Approximate joint likelihood provides an overall measure of the LOO predictive likelihood Should exhibit some resiliance to overfitting due to the LOO nature of the approximate likelihood June 21, 2007 p. 5/3

Posterior inference will follow by obtaining the parameter posterior distribution p(β, k, θ t, X, M) June 21, 2007 p. 6/3

Posterior inference will follow by obtaining the parameter posterior distribution p(β, k, θ t, X, M) Predictions of the target class label t of a new datum x are made by posterior averaging such that p(t x,t,x, M) equals p(t x,t,x,β,k,θ, M)p(β,k,θ t,x, M)dβdθ k June 21, 2007 p. 6/3

Posterior inference will follow by obtaining the parameter posterior distribution p(β, k, θ t, X, M) Predictions of the target class label t of a new datum x are made by posterior averaging such that p(t x,t,x, M) equals p(t x,t,x,β,k,θ, M)p(β,k,θ t,x, M)dβdθ k Posterior takes an intractable form so MCMC procedure is proposed so that the following Monte-Carlo estimate is employed ˆp(t x,t,x, M) = 1 N s N s s=1 p(t x,t,x,β (s),k (s),θ (s), M) June 21, 2007 p. 6/3

Posterior sampling algorithm simple Metropolis algorithm June 21, 2007 p. 7/3

Posterior sampling algorithm simple Metropolis algorithm Assume priors on k and β are uniform over all possible values (integer & real) June 21, 2007 p. 7/3

Posterior sampling algorithm simple Metropolis algorithm Assume priors on k and β are uniform over all possible values (integer & real) Proposal distribution for β new is Gaussian i.e. N(β (i),h) June 21, 2007 p. 7/3

Posterior sampling algorithm simple Metropolis algorithm Assume priors on k and β are uniform over all possible values (integer & real) Proposal distribution for β new is Gaussian i.e. N(β (i),h) Proposal distribution for k is uniform between Min & Max values index U(0,k step + 1) k new = k old + k inc (index); June 21, 2007 p. 7/3

Need to accept this new move using Metropolis ratio June 21, 2007 p. 8/3

Need to accept this new move using Metropolis ratio min { 1, p(t X,β } new,k new,θ new, M) p(t X,β,k,θ, M) June 21, 2007 p. 8/3

Need to accept this new move using Metropolis ratio min { 1, p(t X,β } new,k new,θ new, M) p(t X,β,k,θ, M) Builds up a Markov Chain whose stationary distribution is p(β,k,θ t,x, M) June 21, 2007 p. 8/3

Need to accept this new move using Metropolis ratio min { 1, p(t X,β } new,k new,θ new, M) p(t X,β,k,θ, M) Builds up a Markov Chain whose stationary distribution is p(β,k,θ t,x, M) Very simple algorithm to implement - Matlab and C implementations available June 21, 2007 p. 8/3

Trace of Metropolis Sampler for β & k 10 8 6 4 2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 4 100 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 4 June 21, 2007 p. 9/3

18000 16000 No. of SAMPLES 14000 12000 10000 8000 6000 4000 2000 0 0 10 20 30 40 50 60 70 80 90 K %CV ERROR 28 26 24 22 20 18 16 14 12 0 10 20 30 40 50 60 70 80 90 K Figure 1: The top graph shows a histogram of the marginal posterior for K on the synthetic Ripley dataset and the bottom shows the 10CV error against the value of K. June 21, 2007 p. 10/3

Percentage Test Error 18 16 14 12 10 PKNN KNN 8 50 100 150 200 250 Size of Data Set Figure 2: The percentage test error obtained with training sets of varying size from 25 to 250 data points. For each sub-sample size, 50 random subsets were sampled and each of these used to obtain a KNN and PKNN classifier which were then used to make predictions on the 1000 independent test points. The mean percentage performance and associated standard error obtained for each training set are shown in the above figure for each classifier. June 21, 2007 p. 11/3

Data KNN PKNN P-Value Glass 29.91 ± 9.22 26.67 ± 8.81 0.517 Iris 5.33 ± 5.25 4.00 ± 5.62 0.537 Crabs 15.00 ± 8.82 19.50 ± 6.85 0.240 Pima 27.00 ± 8.88 24.00 ± 14.68 0.645 Soybean 14.50 ± 16.74 4.50 ± 9.56 0.155 Wine 3.922 ± 3.77 3.37 ± 2.89 0.805 Balance 11.52 ± 2.99 10.23 ± 3.02 0.324 Heart 15.18 ± 5.91 15.18 ± 4.43 1.000 Liver 33.60 ± 6.98 36.26 ± 12.93 0.705 Diabetes 25.91 ± 7.15 25.25 ± 8.11 0.970 Vehicle 36.28 ±5.16 37.22 ± 4.53 0.732 June 21, 2007 p. 12/3

Data KNN PKNN Glass 39.55 243.52 Iris 7.58 91.8 Crabs 21.99 156.30 Pima 24.10 103.60 Soybean 1.16 38.38 Wine 27.9 144.90 Balance 609.86 555.72 Heart 96.11 145.22 Liver 116.71 189.73 Diabetes 1643.09 567.03 Vehicle 4226.69 1063.13 June 21, 2007 p. 13/3

PKNN is a fully Bayesian method for KNN classification June 21, 2007 p. 14/3

PKNN is a fully Bayesian method for KNN classification Requires MCMC therefore slow June 21, 2007 p. 14/3

PKNN is a fully Bayesian method for KNN classification Requires MCMC therefore slow Possible to learn metric though this is computationally demanding June 21, 2007 p. 14/3

PKNN is a fully Bayesian method for KNN classification Requires MCMC therefore slow Possible to learn metric though this is computationally demanding Predictive probabilities more useful in certain applications - e.g. clinical prediction June 21, 2007 p. 14/3

PKNN is a fully Bayesian method for KNN classification Requires MCMC therefore slow Possible to learn metric though this is computationally demanding Predictive probabilities more useful in certain applications - e.g. clinical prediction On 0-1 loss no statistically significant difference with CV & KNN June 21, 2007 p. 14/3