Machine Learning using Bayesian Approaches

Similar documents
Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Machine Learning Overview

Bayesian Models in Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

INTRODUCTION TO PATTERN RECOGNITION

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PATTERN RECOGNITION AND MACHINE LEARNING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probability Theory for Machine Learning. Chris Cremer September 2015

Bayesian Inference and MCMC

Bayesian Methods for Machine Learning

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Introduction to Machine Learning

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Need for Sampling in Machine Learning. Sargur Srihari

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

an introduction to bayesian inference

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Pattern Recognition and Machine Learning

Probability and Estimation. Alan Moses

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

13: Variational inference II

Answers and expectations

Density Estimation. Seungjin Choi

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probabilistic and Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture : Probabilistic Machine Learning

Naïve Bayes classification

STA 4273H: Statistical Machine Learning

Conditional Independence

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Recent Advances in Bayesian Inference Techniques

STA414/2104 Statistical Methods for Machine Learning II

Lecture 7 and 8: Markov Chain Monte Carlo

Parametric Techniques

Quantitative Biology II Lecture 4: Variational Methods

Lecture 2: Priors and Conjugacy

Parametric Techniques Lecture 3

Expectation Propagation for Approximate Bayesian Inference

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Curve Fitting Re-visited, Bishop1.2.5

Variational Inference. Sargur Srihari

Nonparametric Bayesian Methods (Gaussian Processes)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

MCMC and Gibbs Sampling. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Grundlagen der Künstlichen Intelligenz

The connection of dropout and Bayesian statistics

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian RL Seminar. Chris Mansley September 9, 2008

Artificial Intelligence

STA 4273H: Statistical Machine Learning

Learning Bayesian network : Given structure and completely observed data

Introduction to Machine Learning

Introduction to Probabilistic Graphical Models

CPSC 540: Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Stat 5101 Lecture Notes

Statistical Machine Learning Lectures 4: Variational Bayes

An overview of Bayesian analysis

Machine learning - HT Maximum Likelihood

COS513 LECTURE 8 STATISTICAL CONCEPTS

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Variational Scoring of Graphical Model Structures

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Variational Bayesian Logistic Regression

Lecture 13 : Variational Inference: Mean Field Approximation

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Lecture 2: Conjugate priors

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Review of Probabilities and Basic Statistics

ECE521 week 3: 23/26 January 2017

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Probability and Statistics (Continued)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Randomized Algorithms

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Data Analysis and Uncertainty Part 2: Estimation

Lecture 6: Model Checking and Selection

Discrete Binary Distributions

Bayesian analysis in nuclear physics

Lecture 11: Probability Distributions and Parameter Estimation

Part 1: Expectation Propagation

Transcription:

Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1

Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes theorem, Conjugate Distributions 2. Making predictions using marginalization 3. Role of sampling methods 4. Generalization of neural networks, SVM 3. Biometric/Forensic Application 4. Concluding Remarks 2

Example of Machine Learning: Handwritten Digit Recognition Wide variability of same numeral Handcrafted rules will result in large no of rules and exceptions Better to have a machine that learns from a large training set Provide a probabilistic prediction 3

Progress in PRML theory Unified theory involving Regression and classification Linear models, neural networks, SVMs Basis vectors polynomial, non-linear functions of weighted inputs, centered around samples, etc The fully Bayesian approach Models from statistical physics 4

The Fully Bayesian Approach Bayes Rule Bayesian Probabilities Concept of Conjugacy Monte Carlo Sampling 5

Rules of Probability Given random variables X and Y Sum Rule gives Marginal Probability Y P(X,Y) Product Rule: joint probability in terms of conditional and marginal X Combining we get Bayes Rule where p(x) = Y p(x Y)p(Y) Viewed as Posterior α likelihood x prior 6

Bayesian Probabilities Classical or Frequentist view of Probabilities Probability is frequency of random, repeatable event Bayesian View Probability is a quantification of uncertainty Whether Shakespeare s plays were written by Francis Bacon Whether moon was once in its own orbit around the sun Use of probability to represent uncertainty is not an ad-hoc choice Cox s Theorem: If numerical values represent degrees of belief, then simple set of axioms for manipulating degrees of belief leads to sum and product rules of probability 7

Role of Likelihood Function Uncertainty in w expressed as where p(d) = p(d w)p(w) p(w D) = p(d) p(d w) p(w)dw by Sum Rule Denominator is normalization factor Involves marginalization over w Bayes theorem in words posterior α likelihood x prior Likelihood Function used in both Bayesian and frequentist paradigms Frequentist: w is fixed parameter determined by estimator; error bars on estimate from possible data sets D Bayesian: there is a single data set D, uncertainty expressed as probability distribution over w

Bayesian versus Frequentist Approach Inclusion of prior knowledge arises naturally Coin Toss Example Suppose we need to learn from coin tosses so that we can predict whether a coin toss will land Heads Fair looking coin is tossed three times and lands Head each time Classical m.l.e of the probability of landing heads is 1 implying all future tosses will land Heads Bayesian approach with reasonable prior will lead to less extreme conclusion Beta Distribution has property of conjugacy p(µ) p(µ/d) 9

Bayesian Inference with Beta Posterior obtained by multiplying beta prior with binomial likelihood ( N m )µm (1-µ) N-m yields Illustration of one step in process a=2, b=2 where l=n-m, which is no of tails m is no of heads It is another beta distribution N=m=1, with x=1 µ 1 (1-µ) 0 Effectively increase value of a by m and b by l As number of observations increases distribution becomes more peaked a=3, b=2 10

Predicting next trial outcome Need predictive distribution of x given observed D From sum and products rule p(x =1 D) = 1 1 p(x =1,µ D)dµ = p(x =1 µ)p(µ D)dµ = 0 1 = µp(µ D)dµ = E[µ D] 0 Expected value of the posterior distribution can be shown to be 0 Which is fraction of observations (both fictitious and real) that correspond to x=1 Maximum likelihood and Bayesian results agree in the limit of infinite observations On average uncertainty (variance) decreases with observed data 11

Bayesian Regression (Curve Fitting) Estimating polynomial weight vector w In fully Bayesian approach consistently apply sum and product rules and integrate over all values of w Given training data x and t and new test point x, goal is to predict value of t i.e, wish to evaluate predictive distribution p(t x,x,t) Applying sum and product rules of probability p(t x,x,t) = p(t,w x,x,t)dw by Sum Rule (marginalizing over w) = p(t x,w,x,t) p( w x,x,t) by Product Rule = p(t x,w)p(w x,t)dw by eliminating unnecessary variables p(t x,w) = N(t y(x,w),β 1 ) Posterior distribution over parameters Also a Gaussian 12

Conjugate Distributions Discrete- Binary Binomial N samples of Bernoulli N=1 Bernoulli Single binary variable Conjugate Prior Beta Continuous variable between {0,1] Discrete- Multi-valued Large N Multinomial One of K values = K-dimensional binary vector Conjugate Prior K=2 Dirichlet K random variables between [0.1] Continuous Gaussian Student s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians Gamma ConjugatePrior of univariate Gaussian precision Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision Wishart Conjugate Prior of multivariate Gaussian precision matrix Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix Exponential Special case of Gamma Angular Von Mises Uniform 13

Bayesian Approaches in Classical Pattern Recognition Bayesian Neural Networks Classical neural networks focus on maximum likelihood to determine network parameters (weights and biases) Marginalizes over distribution of parameters in order to make prediction Bayesian Support Vector Machines 14

Practicality of Bayesian Approach Marginalization over whole parameter space is required to make predictions or compare models Factors making it practical: Sampling Methods such as Markov Chain Monte Carlo, Increased speed and memory of computers Deterministic approximation schemes Variational Bayes is an alternative to sampling 15

Markov Chain Monte Carlo (MCMC) Rejection sampling and importance sampling for evaluating expectations of functions suffer from severe limitations, particularly with high dimensionality MCMC is a very general and powerful framework Markov refers to sequence of samples rather than the model being Markovian Allows sampling from large class of distributions Scales well with dimensionality MCMC origin is in statistical physics (Metropolis 1949) 16

Gibbs Sampling A simple and widely applicable Markov Chain Monte Carlo algorithm It is a special case of Metropolis-Hastings algorithm Consider distribution p(z)=p(z 1,..,z M ) from which we wish to sample We have chosen an initial state for the Markov chain Each step involves replacing value of one variable by a value drawn from p(z i z \i ) where symbol z \i denotes z 1,..,z M with z i omitted Procedure is repeated by cycling through variables in some particular order 17

Machine Learning in forensics Q-K Comparison involves significant uncertainty in some modalities E.g., handwriting Learn intra-class variations so as to determine whether questioned sample is genuine 18

Genuine and Questioned Distributions a Known Signatures 1 b Knowns Compared with each other Genuine Distribution.4,.1,.2,.3,.2,.1 c 2 3 d 6 4 5 Distribution Comparison Questioned Signature a b c d Questioned Compared with Knowns 0,.4,.1,.2 Questioned Distribution 19

Methods for Comparing Distributions Kolmogorov- Smirnov Test Kullback-Leibler Divergence Bayesian Formulation 20

Kolmogorov Smirnov Test To determine if two data sets were drawn from same distribution Compute a statistic D maximum value of the absolute difference between two c.d.f.s K-S statistic for two cdfs S N1 (x) and S N2 (x) D = max S N1 (x) S N 2 (x) <x< Mapped to a probability P according to P KS = Q KS (sqrtn e +0.12+(0.11/sqrtN e )D)

Kullback-Leibler (KL) Divergence If unknown distribution p(x) is modeled by approximating distribution q(x) Average additional information required to specify value of x as a result of using q(x) instead of p(x) KL( p q) = p(x)lnq(x)dx ( p(x)ln p(x)dx) = p(x)ln p(x) q(x) dx 22

K-L Properties Measure of Dissimilarity of p(x) and q(x) Non-symmetric: KL(p q) n.e. KL(q p) -- KL(p q) > 0 with equality iff p(x)=q(x) Proof with convex functions Not a metric Convert to a probability using D KL = B b=1 P b log( P b ) Q b P KL = e ςd KL 23

Bayesian Comparison of Distributions S 1 Samples from one distribution S 2 Samples from another distribution Task: To determine statistical similarity between S 1 and S 2 D 1,D 2 Theoretical distributions from which S 1 and S 2 came from Determine p (D 1 = D 2 ) given that we only observe samples from them Method: Determine p(d 1 =D 2 = a particular distribution f ) Marginalize over all f (Non-informative prior) If f is Gaussian we obtain a closed-form solution for marginalization over the family of Gaussians F. 24

Bayesian Comparison of Distributions P(D 1 = D 2 S 1,S 2 ) = = = = = = F F F F F P(D 1 = D 2 = f S 1,S 2 )df F p(s 1,S 2 D 1 = D 2 = f )p(d 1 = D 2 = f ) p(s 1,S 2 ) p(s 1,S 2 D 1 = D 2 = f ) p(s 1,S 2 ) F F F df p(s 1,S 2 D 1 = D 2 = f ) df p(s 1,S 2 D 1 = D 2 = f )p(d 1 = D 2 = f )df p(s 1,S 2 D 1 = D 2 = f ) p(s 1,S 2 D 1 = D 2 = f )df df p(s 1,S 2 D 1 = D 2 = f ) p(s 1 D 1 = f 1 )df 1 p(s 2 D 2 = f 2 )df 2 = Q(S 1 S 2 ) Q(S 1 ) Q(S 2 ) F df where Q(S) = p(s f )df F df Marginalizing over all f, a member of family F Bayes rule Assuming all distributions equally likely or uniform p(f) Marginalizing Denominator over all f Due to uniform p(f) Since S 1 and S 2 are cond independent given f 25

Bayesian Comparison of Distributions (F= Gaussian Family) p(d 1 = D 2 S 1,S 2 ) = Q(S 1 S 2 ) Q(S 1 ) Q(S 2 ) where Q(S) = F p(s f )df Evaluation of Q for Gaussian family Q(S) = 0 p(s µ,σ)dµdσ Q(S) = 2 3 / 2 (1 n )/ 2 ( 1/ 2) (1 n / 2) π n C Γ(n /2 1) C = 2 x x S 2 x S ( x) n For Gaussians closed form for Q is obtainable where n is the cardinality of S 26

Signature Data Set 55 individuals Samples for one person Genuines Forgeries

Feature computation Feature Vector and Similarity Correlation-Similarity Measure

Bayesian Comparison Clear Separation Genuines Forgeries

Informa0on Theore0c (KS+KL)

Graph Matching

Comparison of KS and KL * Threshold Methods **Information-Theoretic Knowns KS KL KS & KL 4 76.38% 75.10% 79.50% 7 80.10% 79.50% 81.0% 10 83.53% 84.27% 86.78% *S.N.Srihari A.Xu M.K.Kalera, Learning strategies and classification methods for off-line signature verification. Proc. of the 7 th Int. Workshop on Frontiers in handwriting recognition (IWHR), pages 161-166, 2004 **Harish Srinivasan, Sargur.N.Srihari and Mattew.J.Beal, Machine learning for person identification and verification, Machine Learning in Document Analysis and Recognition (S. Marinai and H. Fujisawa, ed.), Springer 2008

Error Rates of Three Methods CEDAR Data Set Percentage error 5 10 15 20 25 30 K-L K-S KS+KL Bayesian 5 10 15 20 Number of Training Samples

Why Does Bayesian Formalism do much better? Few samples available for determining distributions Significant uncertainty exists Bayesian models uncertainty better than information-theoretic method which assumes that distributions are fixed 34

Summary ML is a systematic approach instead of ad-hoc methods in designing information processing systems There is much progress in ML and PR developing a unified theory of different machine learning approaches, e,g., neural networks, SVM Fully Bayesian approach involves making predictions involving complex distributions, made feasible by sampling methods Bayesian approach leads to higher performing systems with small sample sizes 35

Lecture Slides PRML lecture slides at http: //www.cedar.buffalo.edu/~srihari/cse574 /index.html 36