Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih
|
|
- Edith Black
- 5 years ago
- Views:
Transcription
1 Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011
2 Example: facial expression recognition Unlabeled face images Few labeled face images happiness neutral anger fear sadness What is the expression? (unseen occluded images)
3 Example: facial expression recognition 1) training on unlabeled data: maximum likelihood layer 2 layer 1 Input image Ranzato, Susskind, Mnih, Hinton On deep generative models.. CVPR11
4 Example: facial expression recognition 1) training on unlabeled data: maximum likelihood 2) use the generative model to inpute missing values Ranzato, Susskind, Mnih, Hinton On deep generative models.. CVPR11
5 Example: facial expression recognition 1) 2) 3) 4) training on unlabeled data: maximum likelihood use the generative model to inpute missing values use the latent representation as features train classifier on labeled features Ranzato, Susskind, Mnih, Hinton On deep generative models.. CVPR11
6 The power of hierarchical representations MNIST digits nd layer sparse code 1st layer sparse code input image Receptive field of 2nd layer units Ranzato et al. Sparse feature learning for Deep Belief Nets NIPS07
7 The power of hierarchical representations MNIST digits nd layer sparse code 1st layer sparse code input image Receptive field of 2nd layer units Ranzato et al. Sparse feature learning for Deep Belief Nets NIPS07
8 The power of hierarchical representations MNIST digits 2nd layer sparse code 1st layer sparse code input image Receptive fields of 2nd layer units Hierarchical representations can discover complex dependencies! Ranzato et al. Sparse feature learning for Deep Belief Nets NIPS07
9 Unsupervised Learning - we will learn hierarchical representations in a greedy fashion using unlabeled data (little if any assumption on the input) - we can focus on single layer unsupervised algorihtms - how can we interpret unsupervised algorithms? - how can they learn representations? - how can we compare them?
10 Two Approaches to Unsupervised Learning 1st strategy: constrain latent representation & optimize score only at training samples - K-Means - sparse coding Ranzato et al. NIPS 06, Ranzato et al. CVPR 07, Ranzato et al. NIPS 07, Kavukcuoglu et al. CVPR 09 - use lower dimensional representations Ranzato et al. ICML 08
11 Two Approaches to Unsupervised Learning 1st strategy: constrain latent representation & optimize score only at training samples - K-Means - sparse coding - use lower dimensional representations 2nd strategy: optimize score for training samples while normalizing the score over the whole space (maximum likelihood) Ranzato et al. AISTATS 10, Ranzato et al. CVPR 10, Ranzato et al. NIPS 10, Ranzato et al. CVPR 11 p(x) x
12 Two Approaches to Unsupervised Learning Pros st 1 strategy (constrain code) 2nd strategy (maximum likelihood) Cons - efficient learning - defined through objective function (monitor training) - choice of functions - not robust to missing inputs - not good for generation - intuitive design - it can generate - compositionality - hard to train (variational inference, MCMC, etc.)
13 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
14 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
15 Means and Covariances: PPCA p x h p x p x specifies covariance across all data points INPUT SPACE LATENT SPACE p x h Training sample Latent vector
16 Means and Covariances: PPCA p x h p x h t ~ p h x t p x ht specifies distribution for each data point INPUT SPACE LATENT SPACE p x h Training sample Latent vector
17 Means and Covariances: PPCA p x = h p x h p h p x h specifies distribution for each data point INPUT SPACE LATENT SPACE p x h Training sample Latent vector
18 Conditional Distribution Over Input p x h = N mean h, D - examples: PPCA, Factor Analysis, ICA, Gaussian INPUT SPACE LATENT SPACE p x h Training sample Latent vector
19 Conditional Distribution Over Input p x h = N mean h, D - examples: PPCA, Factor Analysis, ICA, Gaussian Poor model of dependency between input variables INPUT SPACE LATENT SPACE p x h Training sample Latent vector
20 Conditional Distribution Over Input p x h = N 0, Covariance h - examples: PoT, c INPUT SPACE LATENT SPACE p x h Training sample Welling et al. NIPS 2003, Ranzato et al. AISTATS 10 Latent vector
21 Conditional Distribution Over Input p x h = N 0, Covariance h - examples: PoT, c Poor model of mean intensity INPUT SPACE LATENT SPACE p x h Training sample Welling et al. NIPS 2003, Ranzato et al. AISTATS 10 Latent vector
22 input image
23 input image Which image has similar mean (but different covariance)? Which image has same covariance (but different mean)?
24 input image same covariance but different mean Equivalent image according to PoT similar mean but different covariance Equivalent image according to FA
25 Conditional Distribution Over Input p x h = N mean h,covariance h - this is what we propose: mc, mpot INPUT SPACE LATENT SPACE p x h Training sample Latent vector Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11
26 Conditional Distribution Over Input p x h = N mean h,covariance h - this is what we propose: mc, mpot INPUT SPACE LATENT SPACE p x h Training sample Latent vector Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11
27 mc x x x x - two sets of latent variables to modulate mean and covariance of the conditional distribution over the input - energy-based model p x, h m, hc exp E x, hm, h c x ℝ D c M h {0,1 } m N h {0,1} Ranzato Hinton CVPR 10
28 - easy generation - slow inference p x hm, h c = N m h m, hc 1 1 E = x m ' x m 2 hm hc x
29 - easy generation - slow inference p x hm, h c = N m h m, hc - less easy generation - fast inference hm hc 1 1 E = x m ' x m 2 hm hc x 1 1 E = x ' x m x 2 x p x hm, h c = N hc 1 m h m, hc
30 Geometric interpretation If we multiply them, we get...
31 Geometric interpretation p x hm, h c If we multiply them, we get... a sharper and shorter Gaussian
32 Covariance part of the energy function: 1 1 E x, h, h = x ' x 2 D x ℝ 1 D D ℝ c m xp xq pair-wise MRF Ranzato Hinton CVPR 10
33 Covariance part of the energy function: 1 E x, h, h = x ' C C ' x 2 D factorization x ℝ D F C ℝ c m xp xq pair-wise MRF Ranzato Hinton CVPR 10
34 Covariance part of the energy function: 1 c E x, h, h = x ' C [diag h ]C ' x 2 D factorization + hiddens x ℝ D F C ℝ c hk c F h {0,1 } c m xp C C xq gated MRF Ranzato Hinton CVPR 10
35 Covariance part of the energy function: 1 c E x, h, h = x ' C [diag P h ]C ' x 2 D factorization + hiddens x ℝ D F C ℝ c hk c M h {0,1 } F M P ℝ P c m xp C C xq gated MRF Ranzato Hinton CVPR 10
36 Overall energy function: 1 1 c m E x, h, h = x ' C [diag P h ]C ' x x ' x x ' W h 2 2 D covariance mean part x ℝ part D N W ℝ c hk m N h {0,1} c m P xp C C xq gated MRF W h m j Ranzato Hinton CVPR 10
37 Overall energy function: 1 1 c m E x, h, h = x ' C [diag P h ]C ' x x ' x x ' W h 2 2 c m covariance part h mean part c k c m m p x h, h = N Wh, P xp C 1 =C diag [ P hc ]C ' I C xq gated MRF W h m j Ranzato Hinton CVPR 10
38 Overall energy function: 1 1 c m E x, h, h = x ' C [diag P h ]C ' x x ' x x ' W h 2 2 c m covariance part h inference c k mean part 1 2 p h =1 x = P k C ' x bk 2 c k P xp C C xq gated MRF W h m j p hmj =1 x = W j ' x b j Ranzato Hinton CVPR 10
39 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
40 Learning m c E x, h, h - maximum likelihood h, h e p x = E x,h e x,h, h m c m m c,h c - Fast Persistent Contrastive Divergence - Hybrid Monte Carlo to draw samples h ck P xi C W C h xj m n 1 c m E = x ' C [ diag P h ]C ' x x ' W h... 2
41 Learning F= - log(p(x)) + K F w training sample
42 Learning F w training sample HMC dynamics
43 Learning F= - log(p(x)) + K F w training sample HMC dynamics
44 Learning F= - log(p(x)) + K F w training sample HMC dynamics
45 Learning F= - log(p(x)) + K F w training sample HMC dynamics
46 Learning F= - log(p(x)) + K F w training sample HMC dynamics
47 Learning F= - log(p(x)) + K F w HMC dynamics
48 Learning F= - log(p(x)) + K F w F w data sample point w w weight update F data w F sample w
49 Learning F= - log(p(x)) + K F w F w data sample point w w weight update F data w F sample w
50 Learning Start dynamics from previous point F= - log(p(x)) + K data sample point w w F data w F sample w
51 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
52 Speech Recognition on TIMIT INPUT - frame: 25ms Hamming window - 10ms overlap between frames - 39 log-filterbank outputs + energy per frame - 15 frames - PCA whitening 384 components - no deltas, no delta-deltas... <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
53 Speech Recognition on TIMIT gated MRF with 1024 precision and 512 mean units trained on independent windows gmrf <-235 ms->
54 Speech Recognition on TIMIT gated MRF with 1024 precision and 512 mean units trained on independent windows gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
55 Speech Recognition on TIMIT gated MRF with 1024 precision and 512 mean units trained on independent windows gmrf Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
56 Speech Recognition on TIMIT - For each training sample, infer latent variables of gated MRF - Train 2nd layer binary-binary (2048 units) gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
57 Speech Recognition on TIMIT - Train 1st layer gated MRF - Train 2nd layer binary-binary (2048 units) - Train 3rd layer binary-binary (2048 units) gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
58 Speech Recognition on TIMIT 8 layer (deep) model... gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
59 Speech Recognition on TIMIT 8 layer (deep) model Supervised training: predict states of aligned HMM... gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
60 Speech Recognition on TIMIT 8 layer (deep) model Test: predict states of HMM decoding (bigram language model)... gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010
61 Speech Recognition on TIMIT METHOD PER CRF Large-Margin GMM CD-HMM Augmented CRF RNN Bayesian Triphone HMM Triphone HMM discrim. trained DBN with gated MRF 34.8% 33.0% 27.3% 26.6% 26.1% 25.6% 22.7% 20.5%
62 Speech Recognition on TIMIT METHOD PER Year CRF Large-Margin GMM CD-HMM Augmented CRF RNN Bayesian Triphone HMM Triphone HMM discrim. trained DBN with gated MRF 34.8% 33.0% 27.3% 26.6% 26.1% 25.6% 22.7% 20.5%
63 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
64 Generation natural image patches mc Ranzato and Hinton CVPR 2010 Natural images G from Osindero and Hinton NIPS 2008 S- + DBN from Osindero and Hinton NIPS 2008
65 From Patches to High-Resolution Images Training by picking patches at random
66 From Patches to High-Resolution Images This is not a good way to extend the model to big images: block artifacts But we could also take them from a grid
67 From Patches to High-Resolution Images IDEA: have one subset of filters applied to these locations,
68 From Patches to High-Resolution Images IDEA: have one subset of filters applied to these locations, another subset to these locations
69 From Patches to High-Resolution Images IDEA: have one subset of filters applied to these locations, another subset to these locations, etc. Train jointly all parameters. Gregor LeCun arxiv 2010 Ranzato, Mnih, Hinton NIPS 2010 No block artifacts Reduced redundancy
70 Sampling High-Resolution Images Gaussian model from Simoncelli 2005 marginal wavelet
71 Sampling High-Resolution Images Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010
72 Sampling High-Resolution Images Gaussian model marginal wavelet Mean Covariance Model from Simoncelli 2005 Pair-wise MRF FoE Ranzato, Mnih, Hinton NIPS 2010 from Schmidt, Gao, Roth CVPR 2010
73 Sampling High-Resolution Images Gaussian model marginal wavelet Mean Covariance Model from Simoncelli 2005 Pair-wise MRF FoE Ranzato, Mnih, Hinton NIPS 2010 from Schmidt, Gao, Roth CVPR 2010
74 Sampling High-Resolution Images Gaussian model marginal wavelet Mean Covariance Model from Simoncelli 2005 Pair-wise MRF FoE Ranzato, Mnih, Hinton NIPS 2010 from Schmidt, Gao, Roth CVPR 2010
75 Making the model.. DEEPER Treat these units as data to train a similar model on the top SECOND STAGE Field of binary 's. Each hidden unit has a receptive field of 30x30 pixels in input space.
76 Sampling from the DEEPER model - Sample from 2nd layer - project sample in image space using 1st layer p(x h) h 2 h... nd 2 layer gmrf 1st layer 1 E h1, h2 = h1 ' W h 2 2 j 1 k j p h =1 h = W j ' h b k p h =1 h = W k h b
77 Samples from Deep Generative Model 1st stage model 3rd stage model
78 Samples from Deep Generative Model 1st stage model 3rd stage model
79 Samples from Deep Generative Model 1st stage model 3rd stage model
80 Samples from Deep Generative Model 1st stage model 3rd stage model
81 Sampling High-Resolution Images Gaussian model FoE from Schmidt, etal CVPR 2010 from Simoncelli 2005 Deep - 1 layer Ranzato, Mnih, Hinton NIPS 2010 marginal wavelet from Simoncelli 2005 Deep - 3 layers Ranzato, et al. CVPR 2011
82 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
83 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions anger Ranzato, et al. CVPR 2011
84 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions disgust
85 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions fear
86 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions happines s
87 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions neutral
88 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions sadness
89 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions surprise
90 Facial Expression Recognition - 1st layer using local (not shared) connectivity - layers above are fully connected - 5 layers in total - Result - Linear Classifier on raw pixels 71.5% - Gaussian RBF SVM on raw pixels 76.2% - Gabor + PCA + linear classifier 80.1% Dailey et al. J. Cog. Science Sparse coding 74.6% Wright et al. PAMI DEEP model (3 layers): 82.5%
91 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)... 5th layer... gmrf 1st layer
92 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)
93 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens) gmrf gmrf... 5th layer
94 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)
95 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)
96 Facial Expression Recognition - 7 synthetic occlusions - use generative model to fill-in (conditional on the known pixels) gmrf gmrf gmrf gmrf gmrf...
97 Facial Expression Recognition originals Type 1 occlusion: eyes Restored images
98 Facial Expression Recognition originals Type 2 occlusion: mouth Restored images
99 Facial Expression Recognition originals Type 3 occlusion: right half Restored images
100 Facial Expression Recognition originals Type 4 occlusion: bottom half Restored images
101 Facial Expression Recognition originals Type 5 occlusion: top half Restored images
102 Facial Expression Recognition originals Type 6 occlusion: nose Restored images
103 Facial Expression Recognition originals Type 7 occlusion: 70% of pixels at random Restored images
104 Facial Expression Recognition Origina l Input
105 Facial Expression Recognition Origina l Input 1st layer gmrf gmrf
106 Facial Expression Recognition Origina l Input 1st layer 2nd layer gmrf gmrf
107 Facial Expression Recognition Origina l Input 1st layer 2nd layer 3rd layer gmrf gmrf
108 Facial Expression Recognition Origina l Input 1st layer 2nd layer 3rd layer 4th layer gmrf gmrf
109 Facial Expression Recognition Original Input 1st layer 2nd layer 3rd layer gmrf gmrf gmrf gmrf gmrf 4th layer times
110 Facial Expression Recognition CASE 1: original images for training, occluded for test Dailey, et al. J. Cog. Neuros Wright, et al. PAMI 2008 Ranzato, et al. CVPR 2011
111 Facial Expression Recognition CASE 2: occluded images for both training and test Dailey, et al. J. Cog. Neuros Wright, et al. PAMI 2008 Ranzato, et al. CVPR 2011
112 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion
113 Summary - Unsupervised Learning: regularize & learn representations - Deep Generative Model - 1st layer: gated MRF - Higher layers: binary 's - fast inference - Realistic generation: natural images - Applications: - facial expression recognition robust to occlusion - speech recognition
114 THANK YOU
115 Learned Filters h N P C vp c k C vq F W M h m j
116 Learned Filters h N P C C F W h M c k m j
117 Using -Energy to Score Images less likely test images > > >
118 Using Energy to Score Images Upside-down images > > > > > >
119 Using Energy to Score Images Average of those images for which difference of energy is higher
Modeling Natural Images with Higher-Order Boltzmann Machines
Modeling Natural Images with Higher-Order Boltzmann Machines Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu joint work with Geoffrey Hinton and Vlad Mnih CIFAR
More informationTUTORIAL PART 1 Unsupervised Learning
TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew
More informationLearning Deep Architectures
Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationPart 2. Representation Learning Algorithms
53 Part 2 Representation Learning Algorithms 54 A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis;c regression func;ons, then
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationAu-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto
Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationLearning Deep Architectures
Learning Deep Architectures Yoshua Bengio, U. Montreal CIFAR NCAP Summer School 2009 August 6th, 2009, Montreal Main reference: Learning Deep Architectures for AI, Y. Bengio, to appear in Foundations and
More informationGraphical Object Models for Detection and Tracking
Graphical Object Models for Detection and Tracking (ls@cs.brown.edu) Department of Computer Science Brown University Joined work with: -Ying Zhu, Siemens Corporate Research, Princeton, NJ -DorinComaniciu,
More informationGreedy Layer-Wise Training of Deep Networks
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationDynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji
Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationMeasuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information
Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University
More informationDenoising Autoencoders
Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationDeep Learning & Neural Networks Lecture 2
Deep Learning & Neural Networks Lecture 2 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 16, 2014 2/45 Today s Topics 1 General Ideas in Deep Learning Motivation
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationA Generative Perspective on MRFs in Low-Level Vision Supplemental Material
A Generative Perspective on MRFs in Low-Level Vision Supplemental Material Uwe Schmidt Qi Gao Stefan Roth Department of Computer Science, TU Darmstadt 1. Derivations 1.1. Sampling the Prior We first rewrite
More informationVariational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed
Variational Autoencoders Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed Contents 1. Why unsupervised learning, and why generative models? (Selected slides from Yann
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationDeep Generative Models. (Unsupervised Learning)
Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationDeep Neural Networks
Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationAugmented Statistical Models for Speech Recognition
Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent
More informationIs early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005
Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005 Richard Turner (turner@gatsby.ucl.ac.uk) Gatsby Computational Neuroscience Unit, 02/03/2006 Outline Historical
More informationFeature Design. Feature Design. Feature Design. & Deep Learning
Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately
More informationGenerative models for missing value completion
Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationA two-layer ICA-like model estimated by Score Matching
A two-layer ICA-like model estimated by Score Matching Urs Köster and Aapo Hyvärinen University of Helsinki and Helsinki Institute for Information Technology Abstract. Capturing regularities in high-dimensional
More informationarxiv: v3 [cs.lg] 18 Mar 2013
Hierarchical Data Representation Model - Multi-layer NMF arxiv:1301.6316v3 [cs.lg] 18 Mar 2013 Hyun Ah Song Department of Electrical Engineering KAIST Daejeon, 305-701 hyunahsong@kaist.ac.kr Abstract Soo-Young
More informationDynamic Probabilistic Models for Latent Feature Propagation in Social Networks
Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationKnowledge Extraction from DBNs for Images
Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationDeep Learning Architectures and Algorithms
Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning
More informationAutoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas
On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised
More informationReading Group on Deep Learning Session 4 Unsupervised Neural Networks
Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann
More informationEfficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition
Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition Kihyuk Sohn Dae Yon Jung Honglak Lee Alfred O. Hero III Dept. of Electrical Engineering and Computer
More informationSubspace Analysis for Facial Image Recognition: A Comparative Study. Yongbin Zhang, Lixin Lang and Onur Hamsici
Subspace Analysis for Facial Image Recognition: A Comparative Study Yongbin Zhang, Lixin Lang and Onur Hamsici Outline 1. Subspace Analysis: Linear vs Kernel 2. Appearance-based Facial Image Recognition.
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationDeep Belief Networks are compact universal approximators
1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationarxiv: v2 [cs.ne] 22 Feb 2013
Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationAn efficient way to learn deep generative models
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto Joint work with: Ruslan Salakhutdinov, Yee-Whye
More informationHierarchical Sparse Bayesian Learning. Pierre Garrigues UC Berkeley
Hierarchical Sparse Bayesian Learning Pierre Garrigues UC Berkeley Outline Motivation Description of the model Inference Learning Results Motivation Assumption: sensory systems are adapted to the statistical
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationStochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence
ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient
More informationUsing Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes Ruslan Salakhutdinov and Geoffrey Hinton Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada
More informationLearning to Disentangle Factors of Variation with Manifold Learning
Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08
More informationSemi-Supervised Learning
Semi-Supervised Learning getting more for less in natural language processing and beyond Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning many human
More informationApproximate inference in Energy-Based Models
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based
More informationFeature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationWhy DNN Works for Acoustic Modeling in Speech Recognition?
Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationAuto-Encoders & Variants
Auto-Encoders & Variants 113 Auto-Encoders MLP whose target output = input Reconstruc7on=decoder(encoder(input)), input e.g. x code= latent features h encoder decoder reconstruc7on r(x) With bo?leneck,
More informationOverview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated
Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationHow to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto
1 How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto What is wrong with back-propagation? It requires labeled training data. (fixed) Almost
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationClustering and Gaussian Mixtures
Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationarxiv: v1 [stat.ml] 30 Mar 2016
A latent-observed dissimilarity measure arxiv:1603.09254v1 [stat.ml] 30 Mar 2016 Yasushi Terazono Abstract Quantitatively assessing relationships between latent variables and observed variables is important
More informationUsually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,
Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny
More informationUnsupervised Learning: K- Means & PCA
Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning
More informationCS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines
CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationStructured deep models: Deep learning on graphs and beyond
Structured deep models: Deep learning on graphs and beyond Hidden layer Hidden layer Input Output ReLU ReLU, 25 May 2018 CompBio Seminar, University of Cambridge In collaboration with Ethan Fetaya, Rianne
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models
Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationStatistical NLP Spring The Noisy Channel Model
Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationBayesian Networks Inference with Probabilistic Graphical Models
4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationDeep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)
CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More information