Unsupervised Neural Nets

Size: px

Start display at page:

Download "Unsupervised Neural Nets"

Amy Marsh
5 years ago
Views:

1 Unsupervised Neural Nets (and ICA) Lyle Ungar (with contributions from Quoc Le, Socher & Manning) Lyle Ungar, University of Pennsylvania

2 Semi-Supervised Learning Hypothesis:%P(c x)%can%be%more%accurately%computed%using% shared%structure%with%p(x)%% purely% supervised% from Socher and Manning 2

3 Semi-Supervised Learning Hypothesis:%P(c x)%can%be%more%accurately%computed%using% shared%structure%with%p(x)%% semij% supervised% from Socher and Manning 3

Unsupervised Neural Nets u Autoencoders l Take same image as input and output n often adding noise to the input l Learn weights to minimize the reconstruction error l Avoid perfect

4 Unsupervised Neural Nets u Autoencoders l Take same image as input and output n often adding noise to the input l Learn weights to minimize the reconstruction error l Avoid perfect fitting n Pass through a bottleneck n Impose sparsity w Dropout n Add noise to the input u Generalize PCA or ICA Autoencoders_and_Sparsity

5 Independent Components Analysis (ICA) u Given observations X, find W such that components s j of S = XW are as independent of each other as possible l E.g. have maximum KL-divergence or low mutual information l Alternatively, find directions in X that are most skewed n farthest from Gaussian l Usually mean center and whiten the data (make unit covariance) first n whiten: (X T X) -1/2 X u Very similar to PCA l But the loss function is not quadratic l So optimization cannot be done by SVD

6 Independent Components Analysis (ICA) u Given observations X, find W and S such that components s j of S = XW are as independent of each other as possible l S k = sources should be independent u Reconstruct X ~ (XW)W + = SW + l S like principle components l W + like loadings l x ~ Σ j s j w j + u Auto-encoder nonlinear generalization that encodes X as S and then decodes it

7 Reconstruction ICA (RICA) u Reconstruction ICA: find W to minimize l Reconstruction error n X - SW + 2 = X - (XW)W + 2 And minimize l Mutual information between sources S = XW I(s 1, s 2...s k ) = H(y) = k H(s i ) H(s) i 1 p(y)log p(y)dy Difference between the entropy of each source s i and the entropy of all of them together Note: this is a bit more complex than it looks, as we have real numbers, not distributions

8 Mutual information I(y 1,y 2, y m ) = KL(p(y 1,y 2, y m ) p(y 1 )p(y 2 )...p(y m )) 8

9 Unsupervised Neural Nets u Auto-encoders l Take same image as input and output n often adding noise to the input l Learn weights to minimize the reconstruction error l This can be done repeatedly (reconstructing features) l Use for semi-supervised learning

10 PCA = Linear Manifold = Linear Auto-Encoder from Socher and Manning 10

11 The Manifold Learning Hypothesis from Socher and Manning 11

12 Auto-Encoders are like nonlinear PCA from Socher and Manning 12

13 Stacking for deep learning from Socher and Manning 13

14 Stacking for deep learning Now learn to reconstruct the features (using more abstract ones) from Socher and Manning 14

15 Stacking for deep learning u Recurse many layers deep u Can be used to initialize supervised learning from Socher and Manning 15

16 Tera-scale deep learning Quoc%V.%Le% Stanford%University%and%Google% Now at google

17 Joint%work%with% Kai%Chen% Greg%Corrado% Jeff%Dean% MaQhieu%Devin% Rajat%Monga% Andrew%Ng% Marc Aurelio% Ranzato% Paul%Tucker% Ke%Yang% AddiNonal% Thanks:% Samy%Bengio,%Zhenghao%Chen,%Tom%Dean,%Pangwei%Koh,% Mark%Mao,%Jiquan%Ngiam,%Patrick%Nguyen,%Andrew%Saxe,% Mark%Segal,%Jon%Shlens,%%Vincent%Vanhouke,%%Xiaoyun%Wu,%% Peng%Xe,%Serena%Yeung,%Will%Zou%

18 TICA:% Warning: this x and W are the transpose of what we use ReconstrucNon%ICA:% Equivalence%between%Sparse%Coding,%Autoencoders,%RBMs%and%ICA% Build%deep%architecture%by%treaNng%the%output%of%one%layer%as%input%to% another%layer% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%

19 Visualization of features learned Most%are%% local%features%

20 Challenges%with%1000s%of%machines%

21 Asynchronous%Parallel%SGDs% Parameter%server% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%

22 Local%recepNve%field%networks% Machine%#1% Machine%#2% Machine%#3% Machine%#4% RICA%features% Image% Le,%et%al.,%Tiled$Convolu1onal$Neural$Networks.%NIPS%2010%

23 10%million%200x200%images%% 1%billion%parameters%

$15%billion%parameters% \ 100x%larger%than%previously%reported%% \$

24 Training% RICA% RICA% Dataset:%10%million%200x200%unlabeled%images%%from%YouTube/Web% % Train%on%2000%machines%(16000%cores)%for%1%week% % 1.15%billion%parameters% \ 100x%larger%than%previously%reported%% \ Small%compared%to%visual%cortex% % RICA% Image% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%

25 The%face%neuron% Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opnmizanon% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%

26 The cat neuron Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opnmizanon% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%

27 What you should know u Supervised neural nets l Generalize logistic regression l Often have built-in structure n local receptive fields, max pooling l Usually solved by minibatch stochastic gradient descent n With chain rule ( backpropagation ), u Unsupervised neural nets l Generalize PCA or ICA l Generally learn an overcomplete basis l Often trained recursively as nonlinear autoencoders l Used in semi-supervised learning n Or to initialize supervised deep nets

How to do backpropagation in a brain

How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep