TUTORIAL PART 1 Unsupervised Learning

Size: px

Start display at page:

Download "TUTORIAL PART 1 Unsupervised Learning"

John Lee
6 years ago
Views:

1 TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew Ng Deep Learning and Unsupervised Feature Learning Workshop, 10 Dec. 2010

2 Feature Learning Learning algorithm Input Motorbikes Non -Motorbikes color Input space brightness kindly borrowed from Andrew Ng ECCV10

3 Feature Learning Feature Extractor Learning algorithm Input Motorbikes Non -Motorbikes Input space color wheel Input space brightness handle kindly borrowed from Andrew Ng ECCV10

4 How is computer perception done? Object detection Image Low-level vision features Recognition Audio classification Audio Low-level audio features Speaker identification Helicopter control Helicopter Low-level state features Action

5 Computer vision features SIFT Spin image HoG RIFT Textons GLOH kindly borrowed from Andrew Ng ECCV10

6 Audio features MFCC Spectrogram Flux ZCR Rolloff kindly borrowed from Andrew Ng ECCV10

7 Engineering features: Need expert knowledge Sub-optimal Time-consuming and expensive Does not generalize to other domain

8 The goal of Unsupervised Feature Learning Unlabeled images Learning algorithm Feature representation kindly borrowed from Andrew Ng ECCV10

9 Outline What is Unsupervised Learning? Unsupervised Learning Algorithms Comparing Unsupervised Learning Algorithms (Tutorial II) Deep Learning

10 Unsupervised Learning Data points belonging to 3 classes

11 Unsupervised Learning No labels are provided during training

12 Unsupervised Learning Fit mixture of 3 Gaussians: use responsibility to represent a data point (indicative of its class)

13 Unsupervised Learning Unsupervised Learning Density estimation Latent variables ( features, possibly useful for discrimination)

14 Unsupervised Learning Unsupervised Learning Density estimation Latent variables ( features, possibly useful for discrimination) Energy-based interpretation Each data-point x has associated energy E(x) Training has to make E(x) lower for x in training set E(x) x

15 Unsupervised Learning Unsupervised Learning Density estimation Latent variables ( features, possibly useful for discrimination) Energy-based interpretation Each data-point x has associated energy E(x) Training has to make E(x) lower for x in training set E(x) BEFORE TRAINING x

16 Unsupervised Learning Unsupervised Learning Density estimation Latent variables ( features, possibly useful for discrimination) Energy-based interpretation Each data-point x has associated energy E(x) Training has to make E(x) lower for x in training set E(x) AFTER TRAINING x

17 Principal Component Analysis 2 E X, Z ; W = X W Z Feature: Z =W ' X, it must be lower dimensional Training: minimize E s.t. orthogonality constraint Input data points Reconstructions (1D feature space)

18 Principal Component Analysis 2 E X, Z ; W = X W Z Feature: Z =W ' X, it must be lower dimensional Training: minimize E s.t. orthogonality constraint Input data points Reconstructions (1D feature space)

19 Principal Component Analysis 2 E X, Z ; W = X W Z Feature: Z =W ' X PROS Simple training (tuning free) Unique solution Fast CONS Feature must be lower dimensional Features are linear

20 Auto-encoder Neural Network 2 E X, Z ; W = X f W Z Feature: Z =g A X, lower dimensional Training: minimize E Input data points Reconstructions (1D feature space)

21 Auto-encoder Neural Network 2 E X, Z ; W = X f W Z Feature: Z =g A X, lower dimensional PROS Non-linear features Pretty fast training CONS Feature must be lower dimensional A few hyper-parameters Optimization becomes hard if highly non-linear

22 Denoising Auto-encoder 2 E X, Z ; W = X f W Z Feature: Z =g A X n, n is noise Training: minimize E Input data points Reconstructions (1D feature space)

23 Denoising Auto-encoder 2 E X, Z ; W = X f W Z Feature: Z =g A X n PROS Non-linear features Pretty fast training Robustness to noise in the input Possibly higher dimensional features Can check convergence CONS A few hyper-parameters Optimization becomes hard if highly non-linear Choice of noise distribution

24 K-Means 2 E X, Z ; W = X W Z Feature: Z 1-of-N code Training: minimize E Input data points Reconstructions (2 prototypes)

25 K-Means 2 E X, Z ; W = X W Z Feature: Z 1-of-N code PROS Simple training (tuning free) Fast CONS One might need lots of prototypes to cover highdimensional space Representation is too sparse

26 Sparse Coding 2 E X, Z ; W = X W Z Z 1 Feature: Z sparse Training: minimize E (coordinate descent) Input data points Reconstructions (2 components)

27 Sparse Coding 2 E X, Z ; W = X W Z Z 1 Feature: Z sparse PROS Possibly higher-dimensional features It often yields more interpretable features Biologically plausible (?) CONS Expensive training Expensive inference Need to tune

28 Predictive Sparse Coding 2 2 E X, Z ; W = X W Z Z 1 Z g A ' X Feature: Z sparse PROS Possibly higher-dimensional features Fast inference CONS Expensive training Need to tune

29 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X All variables are binary: X i {0,1}, Z j {0,1} E = w 11 X 1 Z 1 w 12 X 1 Z 2 w 21 X 2 Z 1... Z1 Z2 W X1 X2 X3

30 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X exp E X, Z ;W p X, Z ; W = x ' z ' exp E x ', z ' ; W Z1 Z2 W X1 X2 X3

31 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X exp E X, Z ;W p X, Z ; W = x ' z ' exp E x ', z ' ; W Z1 INTRACTABLE Z2 W X1 X2 X3

32 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X 1 u = 1 exp u p X =1 Z ; W = j W j Z, p Z =1 X ; W = i W i ' X Z1 Z2 W X1 X2 X3

33 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Easy conditionals: p Z X ; W = W ' X efficient Gibbs sampling p X Z ; W = WZ Z1 Z2 W X1 X2 X3

34 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Easy conditionals: p Z X ; W = W ' X efficient Gibbs sampling p X Z ; W = WZ Z1 p Z X ; W X1 X2 X3 Z2

35 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Easy conditionals: p Z X ; W = W ' X efficient Gibbs sampling p X Z ; W = WZ Z1 Z2 p X Z ; W X1 X2 X3

36 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Easy conditionals: p Z X ; W = W ' X efficient Gibbs sampling p X Z ; W = WZ Z1 p Z X ; W X1 X2 X3 Z2...

37 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X p X Z ; W = WZ Subsequently used as features p Z X ; W = W ' X Z1 Z2 W X1 X2 X3

38 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Loss= log p X ; W W W z t E X, z ;W E x, z ;W t p z X x, z p x, z W W

39 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Loss= log p X ; W W E(x) W z t E X, z ;W E x, z ;W t p z X x, z p x, z W W BEFORE UPDATE x

40 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Loss= log p X ; W W E(x) W z t E X, z ;W E x, z ;W t p z X x, z p x, z W W AFTER UPDATE x

41 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Loss= log p X ; W W E(x) W z t m E X, z ;W E X,z ;W t m p z X z p z X W W USING MCMC x

42 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Loss= log p X ; W W W z t m E X, z ;W E X,z ;W t m p z X z p z X W W In practice, Gibbs sampler takes too long to converge: Contrastive Divergence Persistent Contrastive Divergence Fast Persistent Contrastive Divergence Score Matching Ratio Matching Margin-Based Losses Variational Methods

43 Restricted Boltzmann Machine E X, Z ; W = Z ' W ' X Feature: Z =g W ' X PROS Possibly higher-dimensional features It can generate data Simple interpretation of learning rule Simple to extend variables to other distributions CONS Exact learning is intractable Approximations do not let easily assess convergence A few hyper-parameters to tune

44 Comparison PCA Auto-encoder K-Means Sparse Coding RBM Denoising Auto-enc. Linear Features yes no no no no no

45 Comparison PCA Auto-encoder K-Means Sparse Coding RBM Denoising Auto-enc. Sparse features no (but it can be added) no (but it can be added) yes yes no (but it can be added) no (but it can be added)

46 Comparison PCA Auto-encoder K-Means Sparse Coding RBM Denoising Auto-enc. Energy pull-up restriction on code restriction on code restriction on code restriction on code partition function noise to input

47 Comparison PCA on 8x8 patches ICA on 8x8 patches

48 Comparing Unsupervised Algorithms Properties of features Reconstruction error Likelihood on test data Discriminative performance of classifier trained on features Statistical dependency of components Denoising performance Other tasks

49 References: RBM Hinton Training Product of Experts by minimizing contrastive divergence, Neural Computation 2001 Welling, Rosen-Zvi, Hinton Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2005 code (matlab) (python using gnumpy module to run on a GPU)

50 References: sparse RBM Lee, Ekanadham, Ng Sparse deep belief net model for visual area V2 NIPS 2008 Lee, Grosse, Ranganath, Ng Convolutional Deep Belief Networks for scalable unsupervised learning of hierarchical representations, ICML 2009

51 References: S-RBM Osindero, Hinton Modeling image patches with a directed hierarchy of markov random fields NIPS 2008

52 References: mcrbm Ranzato, Hinton Modeling pixel means and covariances using factorized third-order Boltzmann machines CVPR 2010 Ranzato, Mnih, Hinton Generating more realistic images using gated MRF NIPS 2010 code (python code using CUDAMAT to run on a GPU)

53 References: Sparse Coding Olshausen, Field Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 1997 Lee, Battle, Raina, Ng Efficient sparse coding algorithms NIPS 2007 Lee, Raina, Teichman, Ng Exponential family sparse coding with applications to self-taught learning IJCAI 2009 code (Julien Mairal's work on fast sparse coding methods with several extensions) (see also work by M. Elad et al. on K-SVD, another sparse coding algorithm)

54 References: Predictive Sparse Coding Kavukcuoglu, Ranzato, LeCun, "Fast inference in sparse coding algorithms with applications to object recognition," CBLL Technical Report, December ArXiv Kavukcuoglu, Sermanet, Boureau, Gregor, Mathieu, LeCun, "Learning Convolutional Feature Hierachies for Visual Recognition" NIPS 2010 code (basic algorithm for predictive sparse decomposition)

55 References: Local Coordinate Coding Yu, Zhang, Gong Nonlinear learning using local coordinate coding NIPS 2009 Lin, Zhang, Zhu, Yu Deep coding networks NIPS 2010

56 References: Product of Student's t Osindero, Welling, Hinton "Topographic product models applied to natural scene statistics," Neural Computation 2006

57 References: Denoising Auto-encoders Pascal, Larochelle, Bengio, Manzagol Extracting and composing robust features with denoising autoencoders ICML 2008 code code written in Theano, a python library with interface to GPU, developed in Y. Bengio's lab, see more at:

58 End of Part 1 Any questions?

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011 Example: facial expression recognition