Pattern Recognition and Machine Learning

Save this PDF as:

Size: px
Start display at page:

Download "Pattern Recognition and Machine Learning"


1 Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger

2 Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction Example: Polynomial Curve Fitting Probability Theory Probability densities Expectations and covariances Bayesian probabilities The Gaussian distribution Curvefittingre-visited Bayesian curve fitting Model Selection The Curse of Dimensionality Decision Theory Minimizing the misclassification rate Minimizing the expected loss The reject option Inference and decision Loss functions for regression Information Theory Relative entropy and mutual information 55 Exercises 58 xiii

3 xiv CONTENTS 2 Probability Distributions Binary Variables The beta distribution Multinomial Variables The Dirichlet distribution The Gaussian Distribution Conditional Gaussian distributions Marginal Gaussian distributions Bayes' theorem for Gaussian variables Maximum likelihood for the Gaussian Sequential estimation Bayesian inference for the Gaussian Student's t-distribution Periodic variables Mixtures of Gaussians The Exponential Family Maximum likelihood and sufficient statistics Conjugate priors Noninformative priors Nonparametric Methods Kernel density estimators Nearest-neighbour methods 124 Exercises Linear Models for Regression Linear Basis Function Models Maximum likelihood and least squares Geometry of least squares Sequential learning Regularized least squares Multiple outputs The Bias-Variance Decomposition Bayesian Linear Regression Parameter distribution Predictive distribution Equivalent kernel Bayesian Model Comparison The Evidence Approximation " Evaluation of the evidence function Maximizing the evidence function Effective number of parameters Limitations of Fixed Basis Functions 172 Exercises 173

4 CONTENTS xv 4 Linear Models for Classification Discriminant Functions Two classes Multiple classes Least squares for classification Fisher's linear discriminant Relation to least squares Fisher's discriminant for multiple classes The perceptron algorithm Probabilistic Generative Models Continuous inputs Maximum likelihood solution Discrete features Exponential family Probabilistic Discriminative Models Fixed basis functions Logistic regression Iterative reweighted least squares Multiclass logistic regression Probit regression Canonical link functions The Laplace Approximation Model comparison and BIC Bayesian Logistic Regression Laplace approximation Predictive distribution 218 Exercises Neural Networks Feed-forward Network Functions Weight-space symmetries Network Training Parameter optimization Local quadratic approximation Use of gradient information Gradient descent optimization Error Backpropagation Evaluation of error-function derivatives A simple example Efficiency of backpropagation The Jacobian matrix The Hessian Matrix Diagonal approximation Outer product approximation Inverse Hessian 252

5 XVI CONTENTS ; Finite differences Exact evaluation of the Hessian l Fast multiplication by the Hessian Regularization in Neural Networks Consistent Gaussian priors Early stopping Invariances Tangent propagation Training with transformed data Convolutional networks Soft weight sharing Mixture Density Networks Bayesian Neural Networks Posterior parameter distribution Hyperparameter optimization Bayesian neural networks for classification 281 Exercises Kernel Methods Dual Representations Constructing Kernels Radial Basis Function Networks Nadaraya-Watson model Gaussian Processes Linear regression revisited Gaussian processes for regression Learning the hyperparameters Automatic relevance determination Gaussian processes for classification Laplace approximation Connection to neural networks 319 Exercises Sparse Kernel Machines Maximum Margin Classifiers Overlapping class distributions Relation to logistic regression Multiclass SVMs SVMs for regression Computational learning theory Relevance Vector Machines RVM for regression Analysis of sparsity RVM for classification 353 Exercises 357

6 CONTENTS xvii 8 Graphical Models Bayesian Networks Example: Polynomial regression Generative models Discrete variables Linear-Gaussian models Conditional Independence Three example graphs D-separation Markov Random Fields Conditional independence properties Factorization properties Illustration: Image de-noising Relation to directed graphs Inference in Graphical Models Inference on a chain Trees Factor graphs The sum-product algorithm The max-sum algorithm Exact inference in general graphs Loopy belief propagation Learning the graph structure 418 Exercises Mixture Models and EM K-means Clustering Image segmentation and compression Mixtures of Gaussians Maximum likelihood EM for Gaussian mixtures An Alternative View of EM Gaussian mixtures revisited Relation to K-means Mixtures of Bernoulli distributions EM for Bayesian linear regression The EM Algorithm in General 450 Exercises Approximate Inference Variational Inference Factorized distributions Properties of factorized approximations Example: The univariate Gaussian Model comparison Illustration: Variational Mixture of Gaussians 474

7 Variational distribution Variational lower bound Predictive density Determining the number of components Induced factorizations Variational Linear Regression Variational distribution Predictive distribution \ Lowerbound Exponential Family Distributions Variational message passing Local Variational Methods Variational Logistic Regression Variational posterior distribution Optimizing the variational parameters Inference of hyperparameters Expectation Propagation Example: The clutter problem Expectation propagation on graphs 513 Exercises 517 Sampling Methods Basic Sampling Algorithms Standard distributions Rejection sampling Adaptive rejection sampling Importance sampling Sampling-importance-resampling Sampling and the EM algorithm Markov Chain Monte Carlo Markov chains The Metropolis-Hastings algorithm Gibbs Sampling Slice Sampling The Hybrid Monte Carlo Algorithm Dynamical systems Hybrid Monte Carlo Estimating the Partition Function 554 Exercises 556 Continuous Latent Variables Principal Component Analysis Maximum variance formulation Minimum-error formulation Applications of PCA PCA for high-dimensional data 569

8 CONTENTS xix 12.2 Probabilistic PCA Maximum likelihood PCA EM algorithm for PCA BayesianPCA Factor analysis Kernel PCA Nonlinear Latent Variable Models Independent component analysis Autoassociative neural networks Modelling nonlinear manifolds 595 Exercises Sequential Data Markov Models Hidden Markov Models Maximum likelihood for the HMM The forward-backward algorithm The sum-product algorithm for the HMM Scaling factors The Viterbi algorithm Extensions of the hidden Markov model Linear Dynamical Systems Inference in LDS Learning in LDS Extensions of LDS Particle filters 645 Exercises Combining Models Bayesian Model Averaging Committees Boosting Minimizing exponential error Error functions for boosting Tree-based Models Conditional Mixture Models Mixtures of linear regression models Mixtures of logistic models Mixtures of experts 672 Exercises 674 Appendix A Data Sets 677 Appendix В Probability Distributions 685 AppendixC Properties of Matrices 695

9 xx CONTENTS Appendix D Calculus of Variations 703 Appendix E Lagrange Multipliers 707 References 711 Index 729