DIFFUSION NETWORKS, PRODUCT OF EXPERTS, AND FACTOR ANALYSIS. Tim K. Marks & Javier R. Movellan

Size: px
Start display at page:

Download "DIFFUSION NETWORKS, PRODUCT OF EXPERTS, AND FACTOR ANALYSIS. Tim K. Marks & Javier R. Movellan"

Transcription

1 DIFFUSION NETWORKS PRODUCT OF EXPERTS AND FACTOR ANALYSIS Tim K Marks & Javier R Movellan University of California San Diego 9500 Gilman Drive La Jolla CA ABSTRACT Hinton (in press) recently proposed a learning algorithm called contrastive divergence learning for a class of proailistic models called product of experts (PoE) Whereas in standard mixture models the eliefs of individual experts are averaged in PoEs the eliefs are multiplied together and then renormalized One advantage of this approach is that the comined eliefs can e much sharper than the individual eliefs of each expert It has een shown that a restricted version of the Boltzmann machine in which there are no lateral connections etween hidden units or etween oservation units is a PoE In this paper we generalize these results to diffusion networks a continuous-time continuous-state version of the Boltzmann machine We show that when the unit activation functions are linear this PoE architecture is equivalent to a factor analyzer This result suggests novel non-linear generalizations of factor analysis and independent component analysis that could e implemented using interactive neural circuitry 1 INTRODUCTION Hinton (in press) recently proposed a fast learning algorithm known as contrastive divergence learning for a class of models called product of experts (PoE) In standard mixture models the eliefs of individual experts are averaged ut in PoEs the eliefs are multiplied together and then renormalized ie (1) is the proaility of the oserved vector under expert and is the proaility assiged to when the opinions of all the experts are comined One advantage of this approach is that the comined eliefs are sharper than the individual eliefs of each expert potentially avoiding the prolems of standard mixture models when applied to high-dimensional prolems It can e shown that a restricted version of the Boltzmann machine in which hidden units and oservale units form a ipartite graph is in fact a PoE (Hinton in press) Indeed Hinton (in press) recently trained these restricted Boltzmann machines (RBMs) using contrastive divergence learning with very good results One prolem with Boltzmann machines is that they use inarystate units a representation which may not e optimal for continuous data such as images and sounds This paper concerns the diffusion network a continuoustime continuous-state version of the Boltzmann machine Like Boltzmman machines diffusion networks have idirectional connections which in this paper we assume to e symmetric If a diffusion network is given the same connectivity structure as an RBM the result will also e a PoE model The main result presented here is that when the activation functions are linear these restricted diffusion networks model the same class of oservale distriutions as factor analysis This is somewhat surprising given the fact that diffusion networks are feedack models while factor analyzers are feedforward models Most importantly the result suggests novel non-linear generalizations of factor analysis and independent component analysis that could e implemented using interactive circuitry We show results of simulations in which we trained restricted diffusion networks using contrastive divergence 2 DIFFUSION NETWORKS A diffusion network is a neural network specified y a stochastic differential equation (2) is a random vector descriing the state of the system at time and is a standard Brownian motion process The term is a constant known as the dispersion that controls the amount of noise in the system The function known as the drift represents the deterministic kernel of the system If the drift is the gradient of an energy function!#"%$'&)( and the energy function satisfies certain conditions (Gidas 1986) then it can e shown that has a limit proaility density * +-0/1 2"%3 ( which is independent of the initial conditions 481

2 [ [ [ x a 21 Linear diffusions If we let 9;: 3 and the energy is a quadratic function ( 4!< =>5 3)?4@1A A is a symmetric positive definite matrix then the limit distriution is B +-0/1 2" = A DC (3) This distriution is Gaussian with zero mean and covariance matrix AFEG Taking the gradient of the energy function we get 4 *#"%$ & ( H#"IA) (4) and thus the activation dynamics are linear: #"IA J (5) These dynamics have the following neural network interpretation: Let AKL "NMO M represents a matrix of synaptic conductances and is a diagonal matrix of transmemrane conductances )P (current leakage terms) We then get the following form QP *<RTS QP U" )PVQP?W B ) (6) QP S KX M P (7) is interpreted as the net input current to neuron Y and QP as the activation of that neuron 3 FACTOR ANALYSIS Factor analysis is a proailistic model of the form [ N\ _^ (8) is an `1a dimensional random vector representing oservale data; is an `1 dimensional Gaussian vector representing hidden independent sources c dfe!lk the identity matrix; and ^ is a Gaussian vec- tor independent of representing noise in the oservation generation process c m^o Qne m^o oqp a diagonal matrix If is logistic instead of Gaussian and in the limit as psrte Equation (8) is the standard model underlying independent component analysis (ICA) (Bell & Sejnowski 1995; Attias 1999) Note that in factor analysis as well as in ICA the oservations are generated in a feedforward manner from the hidden sources and the noise process It follows from Equation 8 that and have covariances vu dw yx Taking the inverse of this lock matrix we get \ \z\z@{_p \z@ k (9) p EG "Op E7G \ E7G yx "I\z@ p EG ko\z@up EG \ (10) 4 RESTRICTED DIFFUSION NETWORKS In this section we discuss a suclass of diffusion networks that have a PoE architecture To egin with we divide the state vector of a diffusion network into oservale and hidden [ It is easy to show that if the connectivity matrix is restricted so that there are no lateral connections from hidden units to hidden units and no lateral connections from oservale units to oservale units ie if "M A' a "M (11) a and are diagonal then the limit distriution of is a PoE Hereafter we refer to a diffusion network with no hidden-hidden or oservale-oservale connections as a restricted diffusion network (RDN) Suppose we are given an aritrary linear RDN The limit distriution B [ has Gaussian marginal distriutions B [ and B whose covariances can e found y inverting the lock matrix in Equation 11: # " M ae7g M a EG (12) [ * a " M a EG a E7G (13) Let ƒ a represent the set containing every distriution on U that is the limit distriution of the oservale units of any linear RDN with ` hidden units In some cases we can think of the hidden units of a network as the network s internal representation of the world Some authors such as Barlow (1994) have postulated that one of the organizing principles of early processing in the rain is for these internal representations to e as independent as possile Thus it is of interest to study the set of linear RDNs whose hidden units are independent ie for which is diagonal One might wonder for example whether restricting to e the identity matrix imposes any constraints on [ Let ƒ Pa represent the set containing every distriution on ˆ that is the limit distriution of the oservale units of any linear RDN with ` hidden units that are independent ˆ k We now show that forcing the hidden units to e independent does not constrain the class of oservale distriutions that can e represented y a linear RDN Theorem 1 : ƒ Pa ƒ a Proof : Since ƒ Pa is a suset of ƒ a we just need to show that any distriution in ƒ a is also in ƒ P a Let [ e the limit distriution of an aritrary linear RDN with parameters M a and a We will show there exists a new linear RDN with limit distriution [ˆŠ Š and parameters M Š a Š and a such that Š Nk and [ Š [ 482

3 First we take the eigenvalue decomposition k " E7GŒ 6 a E7G a M a EGŒ 6 N7 (14) is a unitary matrix and is diagonal Equation 14 can e rewritten a E7G a M a " GŒ 6 Š M Š We define N EG Ž F a M a 7 GŒ 6 E7GŒ 6 C (15) 7 E7GŒ 6 C (16) By Equation 12 (and after some derivations) we find that Š Š " M Š lk C ae7g M Š a EG By Equation 13 (and after some derivations) we find that [ Š a " M Š a Š DE7G M Š EG a " M a EG M%@ a E7G [ DC (17) (18) Theorem 1 will help to elucidate the relationship etween factor analysis and linear RDNs We will show that the class of distriutions over oservale variales that can e generated y factor analysis models with ` hidden sources is the same as the class of distriutions over oservale units that can e generated y linear RDNs with `1 hidden units 5 FACTOR ANALYSIS AND DIFFUSION NETS In this section we examine the relationship etween feedforward factor analysis models and feedack diffusion networks Let a represent the class of proaility distriutions over oservale units that can e generated y factor analysis models with ` hidden units In Theorem 2 we will show that this class of distriutions is equivalent to ƒ a the class of distriutions generated y linear RDNs To do so we first need to prove a Lemma regarding the suclass of factor analysis models for which the lower right lock of the matrix in Equation 10 kb \z@up E7G \ is diagonal We define Pa as the class of proaility distriutions on that can e generated y such restricted factor analysis models Lemma : Pa a Proof : Since Pa is a suset of a we just need to show that any distriution in a is also in P a Let [ e the distriution generated y an aritrary factor analysis model with parameters \ and p We will show that there exists a new factor analysis model with joint distriution [ˆŠ and parameters \ Š and p such that [ˆŠ H [ and such that kol \ p EG \ Š is diagonal First we take the eigenvalue decomposition p E7G \IK (19) S is a unitary matrix and the eigenvalue matrix is diagonal Now define a rotation š y š l Ž F \ Š N\ š C (20) Then V\ p EG \ Š N (21) which is diagonal Thus k < V\ [ˆŠ Š 2@ p E7G \ is diagonal Furthermore from the equation 9\ Š ^ we can derive [ˆŠ l\ Š \ Š 2@œ pkn\z\@{ p [ DC Now we are ready to prove the main result of this paper: the fact that factor analysis models are equivalent to linear RDNs Theorem 2 : ƒ a a Proof : [ž : 'ŸQJƒ Ÿ Let B [ e the distriution over oservale units generated y any factor analysis model By the Lemma there exists a factor analysis model with and parameters \j~dp such that its marginal distriution over the oservale units equals the given distriution B [ and for which k \@ p E7G \ is diagonal e the limit distriution of the linear diffusion network with parameters M a a given y setting the connection matrix of Equation 11 equal to the inverse covariance matrix of Equation 10: A' EG C (22) Then 9p EG and a <k \z@ p E7G \ are oth diagonal so this diffusion net is a RDN By Equation 3 the limit distriution of is Gaussian with covariance la E7G from which it follows that ˆ In particular [ [ Thus [ ˆlB [ is the limit distriution of a linear RDN R W ƒ ŸQ Ÿ Let [ e the limit distriution over oservale units of any linear RDN By Theorem 1 there exists a linear RDN with distriution and parameters M a a such that its marginal distriution over the oservale units equals the given distriution B [ and for which lk Consider the factor analysis model with parameters p a E7G ~ˆ\ aeg M a C Using reasoning similar to that used in the first part of this proof one can verify that this factor analysis model generates a distriution over oservale units that is equal to B [ 483

4 6 SIMULATIONS RDNs have continuous states rather than inary states so they can represent pixel intensities in a more natural manner than restricted Boltzmann machines (RBMs) can We used contrastive divergence learning rules to train a linear RDN with 10 hidden units in an unsupervised manner on a dataase of 48 face images (16 people in three different lighting conditions) from the MIT Face Dataase (Turk & Pentland 1991) The linear RDN had 3600 oservale units ( e ª e pixels) Each expert in the product of experts consists of one hidden unit and the learned connection weights etween that hidden unit and all of the oservale units These learned connection weights can e thought of as the receptive field of that hidden unit Figure 1 shows the receptive fields of the 10 hidden units after training Because of Fig 2: Eigenfaces calculated from the same dataase of 48 face images the unoccluded oservale units to the values «and let the network settle to equilirium The equilirium distriution of the network will then e the posterior distriution B [% ˆ [ «N «Using the linear RDN that was trained on face images we reconstructed occluded versions of the images in the training set y taking the mean of this posterior distriution Figure 3 shows the results of reconstructing two occluded face images Fig 1: Receptive fields of the 10 hidden units of a linear RDN that was trained on a dataase of 48 face images y minimizing contrastive divergence the correspondence etween linear RDNs and factor analysis that was proven in previous sections the receptive field of a hidden unit in the linear RDN should correspond to a column of the factor loading matrix \ in Equation 8 (after the column is normalized y premultiplying y p EG the inverse of the variance of the noise in each oservale dimension) Because of the relationship etween factor analysis and principal component analysis one would expect the receptive fields of the hidden units of the linear RDN to resemle the eigenfaces (Turk & Pentland 1991) which are the eigenvectors of the pixelwise covariance matrix of the same dataase We calculated the eigenfaces of the dataase The eigenfaces corresponding to the 10 largest eigenvalues are shown in Figure 2 The qualitative similarity etween the linear RDN receptive fields in Figure 1 and the eigenfaces in Figure 2 is readily apparent Partially occluded [ images correspond to splitting the oservale units into those [% [O«with known values and those with unknown values Given an image with known values «reconstructing the values of the occluded units involves finding the posterior distriution [O [O««The feedack architecture of diffusion networks provides a natural way to find such a posterior distriution: clamp Fig 3: Reconstruction of two occluded images Left: Image from the training set for the linear RDN Center: The oservale units corresponding to the visile pixels were clamped while the oservale units corresponding to the occluded pixels were allowed to run free Right: Reconstructed image shown is the mean of the equilirium distriution 7 DISCUSSION We proved the rather surprising result that a feedack model the linear RDN spans the same family of distriutions as factor analysis a feedworward model Furthermore we have shown that for this class of feedack models restricting the marginal distriution of the hidden units to e independent does not restrict the distriutions over the oservale units that they can generate Linear RDNs can e trained using the contrastive divergence learning method of Hinton (in press) which is relatively fast A main advan- 484

5 tage of feedack networks over feedforward models is that one can easily solve inference prolems such as pattern completion for occluded images using the same architecture and algorithm that are used for generation Most importantly diffusion networks can also e defined with nonlinear activation functions (Movellan Mineiro & Williams In Press) and we have egun to explore nonlinear extensions of the linear case outlined here Extensions of this work to nonlinear diffusion networks open the door to possiilities of new ICA-like algorithms ased on feedack rather than feedforward architectures References Attias H (1999) Independent Factor Analysis Neural Computation 11(4) Barlow H (1994) What is the computational goal of the neocortex? In C Koch (Ed) Large scale neuronal theories of the rain pages 1 22 Camridge MA: MIT Press Bell A & Sejnowski T (1995) An informationmaximisation approach to lind separation and lind deconvolution Neural Computation Gidas B (1986) Metropolis-type monte carlo simulation algorithms and simulated annealing In J L Snell (Ed) Topics in contermporary proaility and its applications pages Boca Raton: CRC Press Hinton G E (in press) Training Products of Experts y Minimizing Contrastive Divergence Neural Computation Movellan J R Mineiro P & Williams R J (In Press) Partially Oservale SDE Models for Image sequence recognition tasks In T Leen T G Dietterich & V Tresp (Eds) Advances in Neural Information Processing Systems numer 13 pages Camridge Massachusetts: MIT Press Turk M & Pentland A (1991) Eigenfaces for recognition Journal of Cognitive Neuroscience 3(1)

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Cognitive Science 30 (2006) 725 731 Copyright 2006 Cognitive Science Society, Inc. All rights reserved. Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Geoffrey Hinton,

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Probabilistic Modelling and Reasoning: Assignment Modelling the skills of Go players. Instructor: Dr. Amos Storkey Teaching Assistant: Matt Graham

Probabilistic Modelling and Reasoning: Assignment Modelling the skills of Go players. Instructor: Dr. Amos Storkey Teaching Assistant: Matt Graham Mechanics Proailistic Modelling and Reasoning: Assignment Modelling the skills of Go players Instructor: Dr. Amos Storkey Teaching Assistant: Matt Graham Due in 16:00 Monday 7 th March 2016. Marks: This

More information

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 20: Autoencoders CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16 Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete

More information

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References 24th March 2011 Update Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Eigenfaces. Face Recognition Using Principal Components Analysis

Eigenfaces. Face Recognition Using Principal Components Analysis Eigenfaces Face Recognition Using Principal Components Analysis M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, 3(1), pp. 71-86, 1991. Slides : George Bebis, UNR

More information

Static and Dynamic Modelling of Materials Forging

Static and Dynamic Modelling of Materials Forging Static and Dynamic Modelling of Materials Forging Coryn A.L. Bailer-Jones, David J.C. MacKay Cavendish Laoratory, University of Camridge email: calj@mrao.cam.ac.uk; mackay@mrao.cam.ac.uk Tanya J. Sain

More information

arxiv: v1 [stat.ml] 10 Apr 2017

arxiv: v1 [stat.ml] 10 Apr 2017 Integral Transforms from Finite Data: An Application of Gaussian Process Regression to Fourier Analysis arxiv:1704.02828v1 [stat.ml] 10 Apr 2017 Luca Amrogioni * and Eric Maris Radoud University, Donders

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Tutorial on Principal Component Analysis

Tutorial on Principal Component Analysis Tutorial on Principal Component Analysis Copyright c 1997, 2003 Javier R. Movellan. This is an open source document. Permission is granted to copy, distribute and/or modify this document under the terms

More information

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Deriving Principal Component Analysis (PCA)

Deriving Principal Component Analysis (PCA) -0 Mathematical Foundations for Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deriving Principal Component Analysis (PCA) Matt Gormley Lecture 11 Oct.

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network LETTER Communicated by Geoffrey Hinton Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network Xiaohui Xie xhx@ai.mit.edu Department of Brain and Cognitive Sciences, Massachusetts

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Viewpoint invariant face recognition using independent component analysis and attractor networks

Viewpoint invariant face recognition using independent component analysis and attractor networks Viewpoint invariant face recognition using independent component analysis and attractor networks Marian Stewart Bartlett University of California San Diego The Salk Institute La Jolla, CA 92037 marni@salk.edu

More information

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Centre for Computational Statistics and Machine Learning University College London c.archambeau@cs.ucl.ac.uk CSML

More information

Nonlinear reverse-correlation with synthesized naturalistic noise

Nonlinear reverse-correlation with synthesized naturalistic noise Cognitive Science Online, Vol1, pp1 7, 2003 http://cogsci-onlineucsdedu Nonlinear reverse-correlation with synthesized naturalistic noise Hsin-Hao Yu Department of Cognitive Science University of California

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley Submitteed to the International Conference on Independent Component Analysis and Blind Signal Separation (ICA2) ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA Mark Plumbley Audio & Music Lab Department

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Eigenface-based facial recognition

Eigenface-based facial recognition Eigenface-based facial recognition Dimitri PISSARENKO December 1, 2002 1 General This document is based upon Turk and Pentland (1991b), Turk and Pentland (1991a) and Smith (2002). 2 How does it work? The

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network Srinjoy Das Department of Electrical and Computer Engineering University of California, San Diego srinjoyd@gmail.com Bruno Umbria

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero ( ( A NEW VIEW OF ICA G.E. Hinton, M. Welling, Y.W. Teh Department of Computer Science University of Toronto 0 Kings College Road, Toronto Canada M5S 3G4 S. K. Osindero Gatsby Computational Neuroscience

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Higher Order Statistics

Higher Order Statistics Higher Order Statistics Matthias Hennig Neural Information Processing School of Informatics, University of Edinburgh February 12, 2018 1 0 Based on Mark van Rossum s and Chris Williams s old NIP slides

More information

Unsupervised Neural Nets

Unsupervised Neural Nets Unsupervised Neural Nets (and ICA) Lyle Ungar (with contributions from Quoc Le, Socher & Manning) Lyle Ungar, University of Pennsylvania Semi-Supervised Learning Hypothesis:%P(c x)%can%be%more%accurately%computed%using%

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Recursive Generalized Eigendecomposition for Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis Recursive Generalized Eigendecomposition for Independent Component Analysis Umut Ozertem 1, Deniz Erdogmus 1,, ian Lan 1 CSEE Department, OGI, Oregon Health & Science University, Portland, OR, USA. {ozertemu,deniz}@csee.ogi.edu

More information

c Springer, Reprinted with permission.

c Springer, Reprinted with permission. Zhijian Yuan and Erkki Oja. A FastICA Algorithm for Non-negative Independent Component Analysis. In Puntonet, Carlos G.; Prieto, Alberto (Eds.), Proceedings of the Fifth International Symposium on Independent

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

Supplement to: Guidelines for constructing a confidence interval for the intra-class correlation coefficient (ICC)

Supplement to: Guidelines for constructing a confidence interval for the intra-class correlation coefficient (ICC) Supplement to: Guidelines for constructing a confidence interval for the intra-class correlation coefficient (ICC) Authors: Alexei C. Ionan, Mei-Yin C. Polley, Lisa M. McShane, Kevin K. Doin Section Page

More information

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Single Channel Signal Separation Using MAP-based Subspace Decomposition Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong,

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces LETTER Communicated by Bartlett Mel Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces Aapo Hyvärinen Patrik Hoyer Helsinki University

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Convolution and Pooling as an Infinitely Strong Prior

Convolution and Pooling as an Infinitely Strong Prior Convolution and Pooling as an Infinitely Strong Prior Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Convolutional

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

CS 4495 Computer Vision Principle Component Analysis

CS 4495 Computer Vision Principle Component Analysis CS 4495 Computer Vision Principle Component Analysis (and it s use in Computer Vision) Aaron Bobick School of Interactive Computing Administrivia PS6 is out. Due *** Sunday, Nov 24th at 11:55pm *** PS7

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction

More information

Principal Component Analysis

Principal Component Analysis B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Dimensionality Reduction CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Visualize high dimensional data (and understand its Geometry) } Project the data into lower dimensional spaces }

More information

Reconnaissance d objetsd et vision artificielle

Reconnaissance d objetsd et vision artificielle Reconnaissance d objetsd et vision artificielle http://www.di.ens.fr/willow/teaching/recvis09 Lecture 6 Face recognition Face detection Neural nets Attention! Troisième exercice de programmation du le

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

The wake-sleep algorithm for unsupervised neural networks

The wake-sleep algorithm for unsupervised neural networks The wake-sleep algorithm for unsupervised neural networks Geoffrey E Hinton Peter Dayan Brendan J Frey Radford M Neal Department of Computer Science University of Toronto 6 King s College Road Toronto

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

1 Hoeffding s Inequality

1 Hoeffding s Inequality Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

More information

Class Mean Vector Component and Discriminant Analysis for Kernel Subspace Learning

Class Mean Vector Component and Discriminant Analysis for Kernel Subspace Learning 1 Class Mean Vector Component and Discriminant Analysis for Kernel Suspace Learning Alexandros Iosifidis Department of Engineering, ECE, Aarhus University, Denmark alexandros.iosifidis@eng.au.dk arxiv:1812.05988v1

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models $ Technical Report, University of Toronto, CSRG-501, October 2004 Density Propagation for Continuous Temporal Chains Generative and Discriminative Models Cristian Sminchisescu and Allan Jepson Department

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Modeling human function learning with Gaussian processes

Modeling human function learning with Gaussian processes Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650

More information

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Natural Gradient Learning for Over- and Under-Complete Bases in ICA NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Image Analysis. PCA and Eigenfaces

Image Analysis. PCA and Eigenfaces Image Analysis PCA and Eigenfaces Christophoros Nikou cnikou@cs.uoi.gr Images taken from: D. Forsyth and J. Ponce. Computer Vision: A Modern Approach, Prentice Hall, 2003. Computer Vision course by Svetlana

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

A732: Exercise #7 Maximum Likelihood

A732: Exercise #7 Maximum Likelihood A732: Exercise #7 Maximum Likelihood Due: 29 Novemer 2007 Analytic computation of some one-dimensional maximum likelihood estimators (a) Including the normalization, the exponential distriution function

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to: System 2 : Modelling & Recognising Modelling and Recognising Classes of Classes of Shapes Shape : PDM & PCA All the same shape? System 1 (last lecture) : limited to rigidly structured shapes System 2 :

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

Lecture: Face Recognition

Lecture: Face Recognition Lecture: Face Recognition Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 12-1 What we will learn today Introduction to face recognition The Eigenfaces Algorithm Linear

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Instantaneous Measurement of Nonlocal Variables

Instantaneous Measurement of Nonlocal Variables Instantaneous Measurement of Nonlocal Variales Richard A. Low richard@wentnet.com August 006 Astract We ask the question: is it possile to measure nonlocal properties of a state instantaneously? Relativistic

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Blind Machine Separation Te-Won Lee

Blind Machine Separation Te-Won Lee Blind Machine Separation Te-Won Lee University of California, San Diego Institute for Neural Computation Blind Machine Separation Problem we want to solve: Single microphone blind source separation & deconvolution

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information