Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Size: px

Start display at page:

Download "Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih"

Edith Black
5 years ago
Views:

1 Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011

2 Example: facial expression recognition Unlabeled face images Few labeled face images happiness neutral anger fear sadness What is the expression? (unseen occluded images)

3 Example: facial expression recognition 1) training on unlabeled data: maximum likelihood layer 2 layer 1 Input image Ranzato, Susskind, Mnih, Hinton On deep generative models.. CVPR11

4 Example: facial expression recognition 1) training on unlabeled data: maximum likelihood 2) use the generative model to inpute missing values Ranzato, Susskind, Mnih, Hinton On deep generative models.. CVPR11

5 Example: facial expression recognition 1) 2) 3) 4) training on unlabeled data: maximum likelihood use the generative model to inpute missing values use the latent representation as features train classifier on labeled features Ranzato, Susskind, Mnih, Hinton On deep generative models.. CVPR11

6 The power of hierarchical representations MNIST digits nd layer sparse code 1st layer sparse code input image Receptive field of 2nd layer units Ranzato et al. Sparse feature learning for Deep Belief Nets NIPS07

7 The power of hierarchical representations MNIST digits nd layer sparse code 1st layer sparse code input image Receptive field of 2nd layer units Ranzato et al. Sparse feature learning for Deep Belief Nets NIPS07

8 The power of hierarchical representations MNIST digits 2nd layer sparse code 1st layer sparse code input image Receptive fields of 2nd layer units Hierarchical representations can discover complex dependencies! Ranzato et al. Sparse feature learning for Deep Belief Nets NIPS07

9 Unsupervised Learning - we will learn hierarchical representations in a greedy fashion using unlabeled data (little if any assumption on the input) - we can focus on single layer unsupervised algorihtms - how can we interpret unsupervised algorithms? - how can they learn representations? - how can we compare them?

10 Two Approaches to Unsupervised Learning 1st strategy: constrain latent representation & optimize score only at training samples - K-Means - sparse coding Ranzato et al. NIPS 06, Ranzato et al. CVPR 07, Ranzato et al. NIPS 07, Kavukcuoglu et al. CVPR 09 - use lower dimensional representations Ranzato et al. ICML 08

11 Two Approaches to Unsupervised Learning 1st strategy: constrain latent representation & optimize score only at training samples - K-Means - sparse coding - use lower dimensional representations 2nd strategy: optimize score for training samples while normalizing the score over the whole space (maximum likelihood) Ranzato et al. AISTATS 10, Ranzato et al. CVPR 10, Ranzato et al. NIPS 10, Ranzato et al. CVPR 11 p(x) x

12 Two Approaches to Unsupervised Learning Pros st 1 strategy (constrain code) 2nd strategy (maximum likelihood) Cons - efficient learning - defined through objective function (monitor training) - choice of functions - not robust to missing inputs - not good for generation - intuitive design - it can generate - compositionality - hard to train (variational inference, MCMC, etc.)

13 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

14 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

15 Means and Covariances: PPCA p x h p x p x specifies covariance across all data points INPUT SPACE LATENT SPACE p x h Training sample Latent vector

16 Means and Covariances: PPCA p x h p x h t ~ p h x t p x ht specifies distribution for each data point INPUT SPACE LATENT SPACE p x h Training sample Latent vector

17 Means and Covariances: PPCA p x = h p x h p h p x h specifies distribution for each data point INPUT SPACE LATENT SPACE p x h Training sample Latent vector

18 Conditional Distribution Over Input p x h = N mean h, D - examples: PPCA, Factor Analysis, ICA, Gaussian INPUT SPACE LATENT SPACE p x h Training sample Latent vector

19 Conditional Distribution Over Input p x h = N mean h, D - examples: PPCA, Factor Analysis, ICA, Gaussian Poor model of dependency between input variables INPUT SPACE LATENT SPACE p x h Training sample Latent vector

20 Conditional Distribution Over Input p x h = N 0, Covariance h - examples: PoT, c INPUT SPACE LATENT SPACE p x h Training sample Welling et al. NIPS 2003, Ranzato et al. AISTATS 10 Latent vector

21 Conditional Distribution Over Input p x h = N 0, Covariance h - examples: PoT, c Poor model of mean intensity INPUT SPACE LATENT SPACE p x h Training sample Welling et al. NIPS 2003, Ranzato et al. AISTATS 10 Latent vector

22 input image

23 input image Which image has similar mean (but different covariance)? Which image has same covariance (but different mean)?

24 input image same covariance but different mean Equivalent image according to PoT similar mean but different covariance Equivalent image according to FA

25 Conditional Distribution Over Input p x h = N mean h,covariance h - this is what we propose: mc, mpot INPUT SPACE LATENT SPACE p x h Training sample Latent vector Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11

26 Conditional Distribution Over Input p x h = N mean h,covariance h - this is what we propose: mc, mpot INPUT SPACE LATENT SPACE p x h Training sample Latent vector Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11

27 mc x x x x - two sets of latent variables to modulate mean and covariance of the conditional distribution over the input - energy-based model p x, h m, hc exp E x, hm, h c x ℝ D c M h {0,1 } m N h {0,1} Ranzato Hinton CVPR 10

28 - easy generation - slow inference p x hm, h c = N m h m, hc 1 1 E = x m ' x m 2 hm hc x

29 - easy generation - slow inference p x hm, h c = N m h m, hc - less easy generation - fast inference hm hc 1 1 E = x m ' x m 2 hm hc x 1 1 E = x ' x m x 2 x p x hm, h c = N hc 1 m h m, hc

30 Geometric interpretation If we multiply them, we get...

31 Geometric interpretation p x hm, h c If we multiply them, we get... a sharper and shorter Gaussian

32 Covariance part of the energy function: 1 1 E x, h, h = x ' x 2 D x ℝ 1 D D ℝ c m xp xq pair-wise MRF Ranzato Hinton CVPR 10

33 Covariance part of the energy function: 1 E x, h, h = x ' C C ' x 2 D factorization x ℝ D F C ℝ c m xp xq pair-wise MRF Ranzato Hinton CVPR 10

34 Covariance part of the energy function: 1 c E x, h, h = x ' C [diag h ]C ' x 2 D factorization + hiddens x ℝ D F C ℝ c hk c F h {0,1 } c m xp C C xq gated MRF Ranzato Hinton CVPR 10

35 Covariance part of the energy function: 1 c E x, h, h = x ' C [diag P h ]C ' x 2 D factorization + hiddens x ℝ D F C ℝ c hk c M h {0,1 } F M P ℝ P c m xp C C xq gated MRF Ranzato Hinton CVPR 10

36 Overall energy function: 1 1 c m E x, h, h = x ' C [diag P h ]C ' x x ' x x ' W h 2 2 D covariance mean part x ℝ part D N W ℝ c hk m N h {0,1} c m P xp C C xq gated MRF W h m j Ranzato Hinton CVPR 10

37 Overall energy function: 1 1 c m E x, h, h = x ' C [diag P h ]C ' x x ' x x ' W h 2 2 c m covariance part h mean part c k c m m p x h, h = N Wh, P xp C 1 =C diag [ P hc ]C ' I C xq gated MRF W h m j Ranzato Hinton CVPR 10

38 Overall energy function: 1 1 c m E x, h, h = x ' C [diag P h ]C ' x x ' x x ' W h 2 2 c m covariance part h inference c k mean part 1 2 p h =1 x = P k C ' x bk 2 c k P xp C C xq gated MRF W h m j p hmj =1 x = W j ' x b j Ranzato Hinton CVPR 10

39 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

40 Learning m c E x, h, h - maximum likelihood h, h e p x = E x,h e x,h, h m c m m c,h c - Fast Persistent Contrastive Divergence - Hybrid Monte Carlo to draw samples h ck P xi C W C h xj m n 1 c m E = x ' C [ diag P h ]C ' x x ' W h... 2

41 Learning F= - log(p(x)) + K F w training sample

42 Learning F w training sample HMC dynamics

43 Learning F= - log(p(x)) + K F w training sample HMC dynamics

44 Learning F= - log(p(x)) + K F w training sample HMC dynamics

45 Learning F= - log(p(x)) + K F w training sample HMC dynamics

46 Learning F= - log(p(x)) + K F w training sample HMC dynamics

47 Learning F= - log(p(x)) + K F w HMC dynamics

48 Learning F= - log(p(x)) + K F w F w data sample point w w weight update F data w F sample w

49 Learning F= - log(p(x)) + K F w F w data sample point w w weight update F data w F sample w

50 Learning Start dynamics from previous point F= - log(p(x)) + K data sample point w w F data w F sample w

51 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

52 Speech Recognition on TIMIT INPUT - frame: 25ms Hamming window - 10ms overlap between frames - 39 log-filterbank outputs + energy per frame - 15 frames - PCA whitening 384 components - no deltas, no delta-deltas... <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

53 Speech Recognition on TIMIT gated MRF with 1024 precision and 512 mean units trained on independent windows gmrf <-235 ms->

54 Speech Recognition on TIMIT gated MRF with 1024 precision and 512 mean units trained on independent windows gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

55 Speech Recognition on TIMIT gated MRF with 1024 precision and 512 mean units trained on independent windows gmrf Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

56 Speech Recognition on TIMIT - For each training sample, infer latent variables of gated MRF - Train 2nd layer binary-binary (2048 units) gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

57 Speech Recognition on TIMIT - Train 1st layer gated MRF - Train 2nd layer binary-binary (2048 units) - Train 3rd layer binary-binary (2048 units) gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

58 Speech Recognition on TIMIT 8 layer (deep) model... gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

59 Speech Recognition on TIMIT 8 layer (deep) model Supervised training: predict states of aligned HMM... gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

60 Speech Recognition on TIMIT 8 layer (deep) model Test: predict states of HMM decoding (bigram language model)... gmrf <-235 ms-> Dahl, Ranzato, Mohamed, Hinton, NIPS 2010

61 Speech Recognition on TIMIT METHOD PER CRF Large-Margin GMM CD-HMM Augmented CRF RNN Bayesian Triphone HMM Triphone HMM discrim. trained DBN with gated MRF 34.8% 33.0% 27.3% 26.6% 26.1% 25.6% 22.7% 20.5%

62 Speech Recognition on TIMIT METHOD PER Year CRF Large-Margin GMM CD-HMM Augmented CRF RNN Bayesian Triphone HMM Triphone HMM discrim. trained DBN with gated MRF 34.8% 33.0% 27.3% 26.6% 26.1% 25.6% 22.7% 20.5%

63 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

64 Generation natural image patches mc Ranzato and Hinton CVPR 2010 Natural images G from Osindero and Hinton NIPS 2008 S- + DBN from Osindero and Hinton NIPS 2008

65 From Patches to High-Resolution Images Training by picking patches at random

66 From Patches to High-Resolution Images This is not a good way to extend the model to big images: block artifacts But we could also take them from a grid

67 From Patches to High-Resolution Images IDEA: have one subset of filters applied to these locations,

68 From Patches to High-Resolution Images IDEA: have one subset of filters applied to these locations, another subset to these locations

69 From Patches to High-Resolution Images IDEA: have one subset of filters applied to these locations, another subset to these locations, etc. Train jointly all parameters. Gregor LeCun arxiv 2010 Ranzato, Mnih, Hinton NIPS 2010 No block artifacts Reduced redundancy

70 Sampling High-Resolution Images Gaussian model from Simoncelli 2005 marginal wavelet

71 Sampling High-Resolution Images Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010

72 Sampling High-Resolution Images Gaussian model marginal wavelet Mean Covariance Model from Simoncelli 2005 Pair-wise MRF FoE Ranzato, Mnih, Hinton NIPS 2010 from Schmidt, Gao, Roth CVPR 2010

73 Sampling High-Resolution Images Gaussian model marginal wavelet Mean Covariance Model from Simoncelli 2005 Pair-wise MRF FoE Ranzato, Mnih, Hinton NIPS 2010 from Schmidt, Gao, Roth CVPR 2010

74 Sampling High-Resolution Images Gaussian model marginal wavelet Mean Covariance Model from Simoncelli 2005 Pair-wise MRF FoE Ranzato, Mnih, Hinton NIPS 2010 from Schmidt, Gao, Roth CVPR 2010

75 Making the model.. DEEPER Treat these units as data to train a similar model on the top SECOND STAGE Field of binary 's. Each hidden unit has a receptive field of 30x30 pixels in input space.

76 Sampling from the DEEPER model - Sample from 2nd layer - project sample in image space using 1st layer p(x h) h 2 h... nd 2 layer gmrf 1st layer 1 E h1, h2 = h1 ' W h 2 2 j 1 k j p h =1 h = W j ' h b k p h =1 h = W k h b

77 Samples from Deep Generative Model 1st stage model 3rd stage model

78 Samples from Deep Generative Model 1st stage model 3rd stage model

79 Samples from Deep Generative Model 1st stage model 3rd stage model

80 Samples from Deep Generative Model 1st stage model 3rd stage model

81 Sampling High-Resolution Images Gaussian model FoE from Schmidt, etal CVPR 2010 from Simoncelli 2005 Deep - 1 layer Ranzato, Mnih, Hinton NIPS 2010 marginal wavelet from Simoncelli 2005 Deep - 3 layers Ranzato, et al. CVPR 2011

82 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

83 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions anger Ranzato, et al. CVPR 2011

84 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions disgust

85 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions fear

86 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions happines s

87 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions neutral

88 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions sadness

89 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions surprise

90 Facial Expression Recognition - 1st layer using local (not shared) connectivity - layers above are fully connected - 5 layers in total - Result - Linear Classifier on raw pixels 71.5% - Gaussian RBF SVM on raw pixels 76.2% - Gabor + PCA + linear classifier 80.1% Dailey et al. J. Cog. Science Sparse coding 74.6% Wright et al. PAMI DEEP model (3 layers): 82.5%

91 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)... 5th layer... gmrf 1st layer

92 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)

93 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens) gmrf gmrf... 5th layer

94 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)

95 Facial Expression Recognition - We can draw samples from the model (5th layer with 128 hiddens)

96 Facial Expression Recognition - 7 synthetic occlusions - use generative model to fill-in (conditional on the known pixels) gmrf gmrf gmrf gmrf gmrf...

97 Facial Expression Recognition originals Type 1 occlusion: eyes Restored images

98 Facial Expression Recognition originals Type 2 occlusion: mouth Restored images

99 Facial Expression Recognition originals Type 3 occlusion: right half Restored images

100 Facial Expression Recognition originals Type 4 occlusion: bottom half Restored images

101 Facial Expression Recognition originals Type 5 occlusion: top half Restored images

102 Facial Expression Recognition originals Type 6 occlusion: nose Restored images

103 Facial Expression Recognition originals Type 7 occlusion: 70% of pixels at random Restored images

104 Facial Expression Recognition Origina l Input

105 Facial Expression Recognition Origina l Input 1st layer gmrf gmrf

106 Facial Expression Recognition Origina l Input 1st layer 2nd layer gmrf gmrf

107 Facial Expression Recognition Origina l Input 1st layer 2nd layer 3rd layer gmrf gmrf

108 Facial Expression Recognition Origina l Input 1st layer 2nd layer 3rd layer 4th layer gmrf gmrf

109 Facial Expression Recognition Original Input 1st layer 2nd layer 3rd layer gmrf gmrf gmrf gmrf gmrf 4th layer times

110 Facial Expression Recognition CASE 1: original images for training, occluded for test Dailey, et al. J. Cog. Neuros Wright, et al. PAMI 2008 Ranzato, et al. CVPR 2011

111 Facial Expression Recognition CASE 2: occluded images for both training and test Dailey, et al. J. Cog. Neuros Wright, et al. PAMI 2008 Ranzato, et al. CVPR 2011

112 Outline - mathematical formulation of the model - training - learning acoustic features for spech recognition - generation of natural images - recognition of facial expression under occlusion - conclusion

113 Summary - Unsupervised Learning: regularize & learn representations - Deep Generative Model - 1st layer: gated MRF - Higher layers: binary 's - fast inference - Realistic generation: natural images - Applications: - facial expression recognition robust to occlusion - speech recognition

114 THANK YOU

115 Learned Filters h N P C vp c k C vq F W M h m j

116 Learned Filters h N P C C F W h M c k m j

117 Using -Energy to Score Images less likely test images > > >

118 Using Energy to Score Images Upside-down images > > > > > >

119 Using Energy to Score Images Average of those images for which difference of energy is higher

Modeling Natural Images with Higher-Order Boltzmann Machines

Modeling Natural Images with Higher-Order Boltzmann Machines Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu joint work with Geoffrey Hinton and Vlad Mnih CIFAR