An Adaptive Bayesian Network for Low-Level Image Processing
|
|
- Ilene Wells
- 6 years ago
- Views:
Transcription
1 An Adaptive Bayesian Network for Low-Level Image Processing S P Luttrell Defence Research Agency, Malvern, Worcs, WR14 3PS, UK. I. INTRODUCTION Probability calculus, based on the axioms of inference, Cox [1], is the only consistent scheme for performing inference; this is also known as Bayesian inference. The objects which this approach manipulates, namely probability density functions (PDFs), may be created in a variety of ways, but the focus of this paper is on the use of adaptive PDF networks. Adaptive mixture distribution (MD) networks are already widely used, Luttrell [2]. In this paper an extension of the standard MD approach is presented; it is called a partitioned mixture distribution (PMD). PMD networks are designed specically to scale sensibly to high-dimensional problems, such as image processing. Several numerical simulations are performed which demonstrate that the emergent properties of PMD networks are similar to those of biological low-level vision processing systems. II. THEORY In this section the use of PDFs as a vehicle for solving inference problems is discussed, and the standard theory of MDs is summarised. The new theory of PMDs is then presented. A. The Bayesian Approach The axioms of inference, Cox [1], lead to the use of probabilities as the unique scheme for performing consistent inference. This approach has three essential stages: (i) choose a state space x, (ii) assign a joint PDF Q(x), and (iii) draw inferences by computing conditional PDFs. This is generally called the Bayesian approach. Usually, x is split into two or more subspaces in order to distinguish between externally accessible components (e.g. data) and inaccessible components (e.g. model parameters). To illustrate the use of the Bayesian approach, the joint PDF Q(x +, x, s) (where x is the past data, x + is the This appeared in Proceedings of the 3rd International Confererence on Articial Neural Networks, Brighton, 61-65, c British Crown Copyright 1993/DRA. Published with the permission of the Controller of Her Britannic Majesty's Stationery Oce Electronic address: luttrell%uk.mod.hermes@relay.mod.uk future data, and s are the model parameters) allows predictions to be made by computing the conditional PDF of the future given the past. Q(x + x ) = ds Q(x + s) Q(s x ) (2.1) If Q(s x ) has a single sharp peak (as a function of s) then the integration can be approximated as follows. Q(x + x ) Q(x + s(x )) Q(x s(x s s) max. likelihood ) = Q(s x s ) max. posterior (2.2) This approximation is accurate only in situations where the integration over model parameters is dominated by contributions in the vicinity of s = s(x ). The above approximations can be interpreted in neural network language as follows. The parameters s = s(x ) correspond to the neural network weights (or whatever) that emerge from the training process. Subsequent computation of Q(x s) corresponds to testing the neural network in the eld. This analogy holds for both supervised and unsupervised networks, although the details of the training algorithm and the correspondence between (x, s) and (network inputs/outputs, network weights) is dierent in each case. This correspondence between PDF models and neural networks will be tacitly assumed throughout this paper. B. The Mixture Distribution An MD can be written as (omitting s, for clarity) Q(x) = n Q(x, c) = n Q(x c) Q(c) (2.3) An additional variable c is introduced, and the required PDF Q(x) is recovered by averaging the joint PDF Q(x, c) over the possible states of c. c can be interpreted as a class label, in which case the Q(x c) are class likelihoods, and the Q(c) are class prior probabilities. In Figure 1 a MD is drawn as if it were a neural network. It is a type of radial basis function network in which the input layer is transformed through a set of nonlinear functions Q(x c) (which do not necessarily have a purely radial dependence), and then transformed through a linear summation operation.
2 2 An Adaptive Bayesian Network for Low-Level Image Processing Figure 1: Mixture distribution network. MDs are computed. The phrase partitioned mixture distribution derives from the fact that the state space of the class label c is partitioned into overlapping sets of n states. The network in Figure 2 can be extended laterally without encountering any scaling problems because its connectivity is local. However, it is not able to capture long-range statistical properties of the input vector x (e.g. the correlation between pixels on opposite sides of an image) because no partition simultaneously sees information from widely separated sources. A multilayer PMD network based on the Adaptive Cluster Expansion approach, Luttrell [4, 5], is needed to remove this limitation. C. The Partitioned Mixture Distribution A useful extension of the standard MD is a set of coupled MDs, called a partitioned mixture distribution (PMD), Luttrell [3]. The basic idea is to retain the additional variable c, and to construct a large number of MD models, each of which uses only a subset of the states that are available to c, and each of which sees only a subspace of the input data x. For simplicity, the theory of a 1-dimensional PMD is presented; the extension to higher dimensions or alternative topologies is obvious. Assume that x is N- dimensional, c has N possible states, and each MD uses only n of these states (n N, n is odd-valued). Each MD is then dened as (assuming circular boundary conditions, so that the class label c is modulo N) Q c (x) = c =c c =c Q(x c )Q(c ) Q(c ) c = 1, 2,, N (2.4) where the integer part of is used. The set of all N MDs is the PMD. The summations used in this denition of Q c (x) can be generalised to any symmetric weighted summation. D. Optimising the Network Parameters The PMD network is optimised by maximising the geometric mean of the likelihoods of the N MD models. s(x ) = s N Q c (x s) (2.5) In the limit of a large training set size this is equivalent to maximising the average logarithmic likelihood. s(x ) = s dx P (x) N log Q c (x s) (2.6) The average over training data has been replaced by a shorthand notation in which the PDF P (x) is used to represent the density of training vectors, as follows 1 T T ( ) t=1 x=xt dx P (x) ( ) (2.7) The posterior probability for model c is (omitting the s variables) Q c (c x) = c =c Q(x,c ) Q(x,c ) c c 0 c c > (2.8) which satises the normalisation condition Q c (c x) = 1. c =c The sum over model index of this posterior probability is Figure 2: Partitioned mixture distribution network. The network representation of a PMD is shown schematically in Figure 2, where multiple overlapping Q(c x) c =c Q c (c x) = c =c c =c + c =c Q(x, c) Q(x, c ) (2.9)
3 An Adaptive Bayesian Network for Low-Level Image Processing 3 and similar expression for the sum of the prior probabilities is Q(c) c =c Q(c) c =c + c =c Q(c ) (2.10) These denitions satisfy the normalisation conditions N Q(c x) = N Q(c) = N. Normalisation to N, rather than 1, is used to ensure that Q(c x) and Q(c) do not depend on the dimensionality N of the input data. Note that this tilde-transform can be used heuristically to normalise any array of numbers. For clarity, two alternative prescriptions for optimising s are now presented without proof. The detailed derivations of these two prescriptions can be found in Appendix AA 1. Re-estimation prescription: ŝ = s dx P (x) N Q(c x, s) log Q(x, c s ) log c =c Q(c s ) (2.11) 2. Gradient ascent prescription: s s dx P (x) N ( log Q(x, c s) Q(c x, s) ) log Q(c s) Q(c s) (2.12) In both of these prescriptions, the second term derives from the denominator of the PMD expression, and in the case n = N the standard MD prescriptions emerge. E. The Gaussian Partitioned Mixture Distribution A popular model to use for the Q(x c) is a set of Gaussian PDFs with means m(c) and covariances A(c). The prior probabilities Q(c) are their own parameters (i.e. non-parametric). Dene some average moments as M 0 M 1 (c) dx P (x) Q(c x) 1 x M 2 xx T (2.13) The gradient ascent equations become Q(c) Q(c x) Q(c) Q(c) m(c) Q(c x) (x m(c)) A 1 (c) Q(c x) (A(c) ) (x m(c)) (x m(c)) T (2.15) These results for a Gaussian PMD reduce to the standard results for the corresponding Gaussian MD when n = N. III. NUMERICAL SIMULATIONS Numerical simulations were performed, using images as the input data, in order to demonstrate some of the emergent properties of a PMD network. The network architecture that was used in these simulations is shown in Figure 3. (which should be compared with Figure 2). The re-estimation equations become Q(c) = M 0 (c) m(c) = M 1(c) M 0(c) Â(c) = M 2(c) M 0(c) m(c) m(c)t (2.14) The re-estimation equation for Q(c) is the same whether not a Gaussian model is assumed, and it must be solved iteratively. The re-estimation equations for m(c) and Â(c) are straightforward to solve. A. Completeness of Partitioned Mixture Distribution This simulation was conducted to demonstrate that each partition of the PMD has sucient resources to construct a full MD within its input window. This is a type of completeness property. The network was a toroidal array, each input window size was 9 9, each mixture window size was 5 5. The training data was normalised by applying the tilde-transform discussed earlier, the variance of the Gaussians was xed at 0.5, and (for convenience only)
4 4 An Adaptive Bayesian Network for Low-Level Image Processing B. Translation Invariant Partitions This simulation was conducted to demonstrate that a PMD network can compute MD models in an approximately translation invariant fashion, despite being nontranslation invariant at the level of its class probabilities. A toroidal network was used. Figure 3: Partitioned mixture distribution network for image image processing. the Gaussian means became topographically ordered by including Kohonen-style neighbourhood updates. Figure 4: (a) Example of a training image. The superimposed square indicates the size of an input window. (b) Montage of Gaussian means after training. The superimposed square indicates the size of a mixture window. Note how a complete repertoire of Gaussian means can be found within each 5 5 mixture window, wherever the window is located. The topographic ordering ensures that the Gaussian mean varies smoothly across the montage. Topographic ordering is not necessary for a PMD to function correctly; it is introduced to make the montage easier to interpret visually. However, in a multilayer PMD network (not studied here) topographic ordering is actually needed for non-cosmetic reasons. In more sophisticated simulations (e.g. pairs of input images) the striations that occur in the montage can be directly related to the dominance columns that are observed in the visual cortex. Figure 5: Response patterns that occur as the input image is moved upwards past the network. (a) Class probabilities. (b) Mixture probabilities. In Figure 5a note that the class probabilities uctuate dramatically as the input data is moved past the network, but in Figure 5b the mixture probabilities move with the data and uctuate only slightly. This invariance property is important in image processing applications, where all parts of the image are (initially) treated on an equal footing. For obvious reasons, it is convenient to refer to the images in Figure 5 as probability images.
5 An Adaptive Bayesian Network for Low-Level Image Processing 5 C. Normalised Partitioned Mixture Distribution Inputs This simulation was conducted to demonstrate what happens when the input to each Gaussian is normalised. This forces the PMD to concentrate on the variation of the input data (rather than its absolute value) within each input window. A toroidal network was used. various displacements and orientations. For convenience, the displacement of the centre of mass of each Gaussian mean (with respect to its central pixel) is shown in Figure 6b. This resembles an orientation map, as observed in the visual cortex. Each partition of the PMD contains the full repertoire of Gaussian means that is needed to construct an MD using normalised inputs. IV. CONCLUSIONS Figure 6: (a) Montage of Gaussian means after training with normalised input vectors. (b) Vector eld of displacements after training with normalised input vectors. In Figure 6a the PMD network develops Gaussian means that resemble small blob or bar-like features with Partitioned mixture distribution networks are a scalable generalisation of standard mixture distributions. Ultimately, as in all Bayesian networks, the power of these networks derives from the use of probabilities as the basic computational objects. The results of the numerical simulations demonstrate that PMD networks have many desirable properties in common with the visual cortex. Multilayer versions of PMD networks, Luttrell [4, 5], have the potential to extend this correspondence. PMD networks have a structure that is amenable to hardware implementation. This opens up the possibility of constructing a fast low-level vision engine based entirely on rigorous Bayesian principles. Appendix A This appendix contains the derivations of two prescriptions for maximising the average logarithmic likelihood that the set of models contained in a PMD ts the training data. The last line of each of these derivations appears in the main text without proof.
6 6 An Adaptive Bayesian Network for Low-Level Image Processing 1. Re-estimation Prescription L(s ; s) = L(s ) L(s) = dx P (x) N log( Qc(x s ) Q ) c(x s) = dx P (x) N log = dx P (x) N log dx P (x) N c =c c =c Q c(x s) c =c c =c Q(x,c s ) Q(c s ) Q c (c x, s) Q c (c x, s) log = ( dx P (x) N Q(c x, s) log Q(x, c s ) log +a term that is independent of s Q(x,c s ) n Q c(x,c 2 s) Q(c s ) c =c Q(x,c s ) c =c n Q(x,c 2 s) c =c ( c =c Q(c s) Q(c s ) Q(c s ) )) (A1) The penultimate step was obtained by using Jensen's inequality for convex functions, and the last step was obtained by using the result Q c (c x) = 1 and the result c =c N c =c Q c (c x, s) ( ) = N c=c + c =1 c=c = N c =1 Q c (c x, s) ( ) Q(c x, s) ( ) (A2) The maximisation of L(s ) can now be replaced by the maximisation of its lower bound L(s ; s) with respect to s, which immediately yields the re-estimation equation. 2. Gradient Ascent Prescription The gradient ascent prescription can be similarly derived by directly dierentiating the logarithmic likelihood with respect to the parameter vector. L(s) = = N dx P (x) log Q c (x s) N dx P (x) log = dx P (x) N = dx P (x) N c =c c =c c =c Q(x,c s) Q(c s) Q(x,c s) log Q(x,c s) c =c Q(x,c s) ( Q(c x, s) log Q(x,c s) dx P (x) N ) log Q(c s) Q(c s) c =c Q(c s) log Q(c s) c =c Q(c s) (A3) As in the re-estimation prescription, the c and c summations are interchanged to obtain the nal result.
7 An Adaptive Bayesian Network for Low-Level Image Processing 7 [1] R. T. Cox (1946). Probability, frequency and reasonable expectation. Am. J. Phys., 14(1), [2] S. P. Luttrell (1992). Adaptive Bayesian networks. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, Orlando), [3] S. P. Luttrell (1992). Partitioned mixture distributions: an introduction. DRA, Malvern. Technical Report [4] S. P. Luttrell (1990). A trainable texture anomaly detector using the adaptive cluster expansion (ACE) method. RSRE, Malvern. Technical Report [5] S. P. Luttrell (1991). A hierarchical network for clutter and texture modelling. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, San Diego),
Unsupervised Classifiers, Mutual Information and 'Phantom Targets'
Unsupervised Classifiers, Mutual Information and 'Phantom Targets' John s. Bridle David J.e. MacKay Anthony J.R. Heading California Institute of Technology 139-74 Defence Research Agency Pasadena CA 91125
More informationTWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen
TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationHierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References
24th March 2011 Update Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationWidths. Center Fluctuations. Centers. Centers. Widths
Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationGaussian process for nonstationary time series prediction
Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong
More informationManaging Uncertainty
Managing Uncertainty Bayesian Linear Regression and Kalman Filter December 4, 2017 Objectives The goal of this lab is multiple: 1. First it is a reminder of some central elementary notions of Bayesian
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationLearning Vector Quantization
Learning Vector Quantization Neural Computation : Lecture 18 John A. Bullinaria, 2015 1. SOM Architecture and Algorithm 2. Vector Quantization 3. The Encoder-Decoder Model 4. Generalized Lloyd Algorithms
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationPDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a preprint version which may differ from the publisher's version. For additional information about this
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationLie Groups for 2D and 3D Transformations
Lie Groups for 2D and 3D Transformations Ethan Eade Updated May 20, 2017 * 1 Introduction This document derives useful formulae for working with the Lie groups that represent transformations in 2D and
More informationLecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1
Lecture 13: Data Modelling and Distributions Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Why data distributions? It is a well established fact that many naturally occurring
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More information1 Using standard errors when comparing estimated values
MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail
More informationError Empirical error. Generalization error. Time (number of iteration)
Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp
More informationContents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces
Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v 250) Contents 2 Vector Spaces 1 21 Vectors in R n 1 22 The Formal Denition of a Vector Space 4 23 Subspaces 6 24 Linear Combinations and
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationLinear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space
Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................
More informationExpectation Maximization and Mixtures of Gaussians
Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation
More informationLearning Vector Quantization (LVQ)
Learning Vector Quantization (LVQ) Introduction to Neural Computation : Guest Lecture 2 John A. Bullinaria, 2007 1. The SOM Architecture and Algorithm 2. What is Vector Quantization? 3. The Encoder-Decoder
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationp(z)
Chapter Statistics. Introduction This lecture is a quick review of basic statistical concepts; probabilities, mean, variance, covariance, correlation, linear regression, probability density functions and
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationNatural Image Statistics
Natural Image Statistics A probabilistic approach to modelling early visual processing in the cortex Dept of Computer Science Early visual processing LGN V1 retina From the eye to the primary visual cortex
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationSpatial Bayesian Nonparametrics for Natural Image Segmentation
Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University Joint work with Michael Jordan University of California Soumya Ghosh Brown University Parsing Visual Scenes
More informationStatistical Learning Reading Assignments
Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective
More informationINTRODUCTION TO PATTERN RECOGNITION
INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take
More informationSupervised Learning Coursework
Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationIs early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005
Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005 Richard Turner (turner@gatsby.ucl.ac.uk) Gatsby Computational Neuroscience Unit, 02/03/2006 Outline Historical
More informationBatch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug
A theoretical comparison of batch-mode, on-line, cyclic, and almost cyclic learning Tom Heskes and Wim Wiegerinck RWC 1 Novel Functions SNN 2 Laboratory, Department of Medical hysics and Biophysics, University
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationHomework 1 Solutions ECEn 670, Fall 2013
Homework Solutions ECEn 670, Fall 03 A.. Use the rst seven relations to prove relations (A.0, (A.3, and (A.6. Prove (F G c F c G c (A.0. (F G c ((F c G c c c by A.6. (F G c F c G c by A.4 Prove F (F G
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationStatistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit
Statistics Lent Term 2015 Prof. Mark Thomson Lecture 2 : The Gaussian Limit Prof. M.A. Thomson Lent Term 2015 29 Lecture Lecture Lecture Lecture 1: Back to basics Introduction, Probability distribution
More informationContrastive Divergence
Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationBayesian Inference of Noise Levels in Regression
Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop
More informationMinimum Entropy Data Partitioning. Exhibition Road, London SW7 2BT, UK. in even a 3-dimensional data space). a Radial-Basis Function (RBF) classier in
Minimum Entropy Data Partitioning Stephen J. Roberts, Richard Everson & Iead Rezek Intelligent & Interactive Systems Group Department of Electrical & Electronic Engineering Imperial College of Science,
More informationbelow, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing
Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationf(z)dz = 0. P dx + Qdy = D u dx v dy + i u dy + v dx. dxdy + i x = v
MA525 ON CAUCHY'S THEOREM AND GREEN'S THEOREM DAVID DRASIN (EDITED BY JOSIAH YODER) 1. Introduction No doubt the most important result in this course is Cauchy's theorem. Every critical theorem in the
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationEstimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator
Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationCV-NP BAYESIANISM BY MCMC. Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUEZ
CV-NP BAYESIANISM BY MCMC Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUE Department of Mathematics and Statistics University at Albany, SUNY Albany NY 1, USA
More information1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo
The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,
More informationMatching the dimensionality of maps with that of the data
Matching the dimensionality of maps with that of the data COLIN FYFE Applied Computational Intelligence Research Unit, The University of Paisley, Paisley, PA 2BE SCOTLAND. Abstract Topographic maps are
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More information= w 2. w 1. B j. A j. C + j1j2
Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp
More information1 Introduction Duality transformations have provided a useful tool for investigating many theories both in the continuum and on the lattice. The term
SWAT/102 U(1) Lattice Gauge theory and its Dual P. K. Coyle a, I. G. Halliday b and P. Suranyi c a Racah Institute of Physics, Hebrew University of Jerusalem, Jerusalem 91904, Israel. b Department ofphysics,
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More information1 Matrices and Systems of Linear Equations
Linear Algebra (part ) : Matrices and Systems of Linear Equations (by Evan Dummit, 207, v 260) Contents Matrices and Systems of Linear Equations Systems of Linear Equations Elimination, Matrix Formulation
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationDetection of Anomalies in Texture Images using Multi-Resolution Features
Detection of Anomalies in Texture Images using Multi-Resolution Features Electrical Engineering Department Supervisor: Prof. Israel Cohen Outline Introduction 1 Introduction Anomaly Detection Texture Segmentation
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationBayesian ensemble learning of generative models
Chapter Bayesian ensemble learning of generative models Harri Valpola, Antti Honkela, Juha Karhunen, Tapani Raiko, Xavier Giannakopoulos, Alexander Ilin, Erkki Oja 65 66 Bayesian ensemble learning of generative
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationDETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja
DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box
More informationExpectation propagation for signal detection in flat-fading channels
Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA
More informationNotes on Markov Networks
Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum
More informationHuman Pose Tracking I: Basics. David Fleet University of Toronto
Human Pose Tracking I: Basics David Fleet University of Toronto CIFAR Summer School, 2009 Looking at People Challenges: Complex pose / motion People have many degrees of freedom, comprising an articulated
More informationSparse Kernel Density Estimation Technique Based on Zero-Norm Constraint
Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationGatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV
Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyvärinen Gatsby Unit University College London Part III: Estimation of unnormalized models Often,
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More information10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification
10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More information