Unsupervised Neural Nets (and ICA) Lyle Ungar (with contributions from Quoc Le, Socher & Manning) Lyle Ungar, University of Pennsylvania
Semi-Supervised Learning Hypothesis:%P(c x)%can%be%more%accurately%computed%using% shared%structure%with%p(x)%% purely% supervised% from Socher and Manning 2
Semi-Supervised Learning Hypothesis:%P(c x)%can%be%more%accurately%computed%using% shared%structure%with%p(x)%% semij% supervised% from Socher and Manning 3
Unsupervised Neural Nets u Autoencoders l Take same image as input and output n often adding noise to the input l Learn weights to minimize the reconstruction error l Avoid perfect fitting n Pass through a bottleneck n Impose sparsity w Dropout n Add noise to the input u Generalize PCA or ICA http://ufldl.stanford.edu/wiki/index.php/ Autoencoders_and_Sparsity
Independent Components Analysis (ICA) u Given observations X, find W such that components s j of S = XW are as independent of each other as possible l E.g. have maximum KL-divergence or low mutual information l Alternatively, find directions in X that are most skewed n farthest from Gaussian l Usually mean center and whiten the data (make unit covariance) first n whiten: (X T X) -1/2 X u Very similar to PCA l But the loss function is not quadratic l So optimization cannot be done by SVD
Independent Components Analysis (ICA) u Given observations X, find W and S such that components s j of S = XW are as independent of each other as possible l S k = sources should be independent u Reconstruct X ~ (XW)W + = SW + l S like principle components l W + like loadings l x ~ Σ j s j w j + u Auto-encoder nonlinear generalization that encodes X as S and then decodes it
Reconstruction ICA (RICA) u Reconstruction ICA: find W to minimize l Reconstruction error n X - SW + 2 = X - (XW)W + 2 And minimize l Mutual information between sources S = XW I(s 1, s 2...s k ) = H(y) = k H(s i ) H(s) i 1 p(y)log p(y)dy Difference between the entropy of each source s i and the entropy of all of them together http://mathworld.wolfram.com/differentialentropy.html Note: this is a bit more complex than it looks, as we have real numbers, not distributions
Mutual information http://www.stat.ucla.edu/~yuille/courses/stat161-261-spring14/hyvo00-icatut.pdf I(y 1,y 2, y m ) = KL(p(y 1,y 2, y m ) p(y 1 )p(y 2 )...p(y m )) 8
Unsupervised Neural Nets u Auto-encoders l Take same image as input and output n often adding noise to the input l Learn weights to minimize the reconstruction error l This can be done repeatedly (reconstructing features) l Use for semi-supervised learning http://theanalyticsstore.com/deep-learning/
PCA = Linear Manifold = Linear Auto-Encoder from Socher and Manning 10
The Manifold Learning Hypothesis from Socher and Manning 11
Auto-Encoders are like nonlinear PCA from Socher and Manning 12
Stacking for deep learning from Socher and Manning 13
Stacking for deep learning Now learn to reconstruct the features (using more abstract ones) from Socher and Manning 14
Stacking for deep learning u Recurse many layers deep u Can be used to initialize supervised learning from Socher and Manning 15
Tera-scale deep learning Quoc%V.%Le% Stanford%University%and%Google% Now at google
Joint%work%with% Kai%Chen% Greg%Corrado% Jeff%Dean% MaQhieu%Devin% Rajat%Monga% Andrew%Ng% Marc Aurelio% Ranzato% Paul%Tucker% Ke%Yang% AddiNonal% Thanks:% Samy%Bengio,%Zhenghao%Chen,%Tom%Dean,%Pangwei%Koh,% Mark%Mao,%Jiquan%Ngiam,%Patrick%Nguyen,%Andrew%Saxe,% Mark%Segal,%Jon%Shlens,%%Vincent%Vanhouke,%%Xiaoyun%Wu,%% Peng%Xe,%Serena%Yeung,%Will%Zou%
TICA:% Warning: this x and W are the transpose of what we use ReconstrucNon%ICA:% Equivalence%between%Sparse%Coding,%Autoencoders,%RBMs%and%ICA% Build%deep%architecture%by%treaNng%the%output%of%one%layer%as%input%to% another%layer% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%
Visualization of features learned Most%are%% local%features%
Challenges%with%1000s%of%machines%
Asynchronous%Parallel%SGDs% Parameter%server% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
Local%recepNve%field%networks% Machine%#1% Machine%#2% Machine%#3% Machine%#4% RICA%features% Image% Le,%et%al.,%Tiled$Convolu1onal$Neural$Networks.%NIPS%2010%
10%million%200x200%images%% 1%billion%parameters%
Training% RICA% RICA% Dataset:%10%million%200x200%unlabeled%images%%from%YouTube/Web% % Train%on%2000%machines%(16000%cores)%for%1%week% % 1.15%billion%parameters% \ 100x%larger%than%previously%reported%% \ Small%compared%to%visual%cortex% % RICA% Image% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
The%face%neuron% Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opnmizanon% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
The cat neuron Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opnmizanon% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%
What you should know u Supervised neural nets l Generalize logistic regression l Often have built-in structure n local receptive fields, max pooling l Usually solved by minibatch stochastic gradient descent n With chain rule ( backpropagation ), u Unsupervised neural nets l Generalize PCA or ICA l Generally learn an overcomplete basis l Often trained recursively as nonlinear autoencoders l Used in semi-supervised learning n Or to initialize supervised deep nets