Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11
Introduction Poor initialisation can lead to local minima 1986 - Rumelhart, Hinton, Williams [RHW88] random initialization and gradient descent shows bad performance Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 2 / 11
Introduction Poor initialisation can lead to local minima 1986 - Rumelhart, Hinton, Williams [RHW88] random initialization and gradient descent shows bad performance 2006 - Hinton, Osindero, Teh [HOT06] stacking Restricted Boltzmann Machines and tune with Up-Down shows very good performance Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 2 / 11
Introduction Poor initialisation can lead to local minima 1986 - Rumelhart, Hinton, Williams [RHW88] random initialization and gradient descent shows bad performance 2006 - Hinton, Osindero, Teh [HOT06] stacking Restricted Boltzmann Machines and tune with Up-Down shows very good performance 2007 - Bengio, Lamblin, Popovici, Larochelle [BLP + 07] [PCL06] stacking Autoencoders and tune with gradient descent shows good performance Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 2 / 11
Introduction Poor initialisation can lead to local minima 1986 - Rumelhart, Hinton, Williams [RHW88] random initialization and gradient descent shows bad performance 2006 - Hinton, Osindero, Teh [HOT06] stacking Restricted Boltzmann Machines and tune with Up-Down shows very good performance 2007 - Bengio, Lamblin, Popovici, Larochelle [BLP + 07] [PCL06] stacking Autoencoders and tune with gradient descent shows good performance Can we initialize it better? Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 2 / 11
Introduction Poor initialisation can lead to local minima 1986 - Rumelhart, Hinton, Williams [RHW88] random initialization and gradient descent shows bad performance 2006 - Hinton, Osindero, Teh [HOT06] stacking Restricted Boltzmann Machines and tune with Up-Down shows very good performance 2007 - Bengio, Lamblin, Popovici, Larochelle [BLP + 07] [PCL06] stacking Autoencoders and tune with gradient descent shows good performance Can we initialize it better? Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 2 / 11
Autoencoder hidden representation y f θ g θ reconstruction error L(x, z) x [0, 1] d y [0, 1] d z [0, 1] d y = f θ (x) = s(wx + b) θ = W, b input x reconstructed input z Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 3 / 11
Autoencoder hidden representation y f θ input x g θ reconstruction error L(x, z) reconstructed input z x [0, 1] d y [0, 1] d z [0, 1] d y = f θ (x) = s(wx + b) θ = W, b z = g θ (y) = s(w y + b ) θ = W, b Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 3 / 11
Autoencoder hidden representation y f θ input x g θ reconstruction error L(x, z) reconstructed input z x [0, 1] d y [0, 1] d z [0, 1] d y = f θ (x) = s(wx + b) θ = W, b z = g θ (y) = s(w y + b ) θ = W, b Squared error L(x, z) = x z 2 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 3 / 11
Autoencoder hidden representation y f θ g θ reconstruction error L(x, z) x [0, 1] d y [0, 1] d z [0, 1] d y = f θ (x) = s(wx + b) θ = W, b z = g θ (y) = s(w y + b ) θ = W, b Squared error input x reconstructed L(x, z) = x z 2 input z θ, θ = arg min θ,θ 1 n n L(x (i), g θ (f θ (x (i) ))) i=1 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 3 / 11
Autoencoder hidden representation y f θ g θ reconstruction error L(x, z) x [0, 1] d y [0, 1] d z [0, 1] d y = f θ (x) = s(wx + b) θ = W, b z = g θ (y) = s(w y + b ) θ = W, b Squared error input x reconstructed L(x, z) = x z 2 input z θ, θ = arg min θ,θ 1 n n L(x (i), g θ (f θ (x (i) ))) i=1 Reconstruction cross-entropy L H (x, z) = d [x k log z k + (1 x k ) log(1 z k )] k=1 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 3 / 11
Autoencoder hidden representation y f θ g θ reconstruction error L(x, z) x [0, 1] d y [0, 1] d z [0, 1] d y = f θ (x) = s(wx + b) θ = W, b z = g θ (y) = s(w y + b ) θ = W, b Squared error input x reconstructed L(x, z) = x z 2 input z θ, θ = arg min θ,θ 1 n n L(x (i), g θ (f θ (x (i) ))) i=1 Reconstruction cross-entropy L H (x, z) = d [x k log z k + (1 x k ) log(1 z k )] k=1 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 3 / 11
Denoising Autoencoder hidden representation y reconstruction error L H (x, z) g θ X X f θ q D corrupted input x input x input z input x [0, 1] d, destroy partially, corrupted input x q D ( x x) Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 4 / 11
Denoising Autoencoder hidden representation y reconstruction error L H (x, z) g θ X X f θ q D corrupted input x input x input z input x [0, 1] d, destroy partially, corrupted input x q D ( x x) x mapped to hidden representation y = f θ ( x) Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 4 / 11
Denoising Autoencoder hidden representation y reconstruction error L H (x, z) g θ X X f θ q D corrupted input x input x input z input x [0, 1] d, destroy partially, corrupted input x q D ( x x) x mapped to hidden representation y = f θ ( x) reconstruction from y leads to z = g θ (y) Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 4 / 11
Denoising Autoencoder hidden representation y reconstruction error L H (x, z) g θ X X f θ q D corrupted input x input x input z input x [0, 1] d, destroy partially, corrupted input x q D ( x x) x mapped to hidden representation y = f θ ( x) reconstruction from y leads to z = g θ (y) Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 4 / 11
Learning the layers L H X f (2) θ q D g (2) θ f θ input x 1 learn f θ with a denoising autoencoder on the first layer 2 remove autoencoder construct and use the learned mapping f θ directly on the input Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 5 / 11
2 remove autoencoder construct and use the learned mapping f θ directly on the input 3 learn next layer by repeating the steps Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 5 / 11 Learning the layers L H X f (2) θ q D g (2) θ f θ input x 1 learn f θ with a denoising autoencoder on the first layer
2 remove autoencoder construct and use the learned mapping f θ directly on the input 3 learn next layer by repeating the steps Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 5 / 11 Learning the layers L H X f (2) θ q D g (2) θ f θ input x 1 learn f θ with a denoising autoencoder on the first layer
Supervised fine tuning supervised cost initialize the network with unsupervised learning continue with a supervised learning for f sup θ f sup θ f (3) θ f (2) θ f θ target Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 6 / 11
Supervised fine tuning supervised cost initialize the network with unsupervised learning continue with a supervised learning for f sup θ fine tune the network with the supervised criterion f sup θ f (3) θ f (2) θ f θ target Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 6 / 11
Supervised fine tuning supervised cost initialize the network with unsupervised learning continue with a supervised learning for f sup θ fine tune the network with the supervised criterion f sup θ f (3) θ f (2) θ f θ target Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 6 / 11
Perspective view: Manifold There are several perspective views for the denoising autoencoders here: learning a manifold training data (x) lies nearby a low-dimensinal manifold Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 7 / 11
Perspective view: Manifold There are several perspective views for the denoising autoencoders here: learning a manifold training data (x) lies nearby a low-dimensinal manifold a corruption example ( ) is obtained by applying q D ( X X) Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 7 / 11
Perspective view: Manifold There are several perspective views for the denoising autoencoders here: learning a manifold training data (x) lies nearby a low-dimensinal manifold a corruption example ( ) is obtained by applying q D ( X X) learning the model with p(x X), project them back to the manifold Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 7 / 11
Perspective view: Manifold There are several perspective views for the denoising autoencoders here: learning a manifold training data (x) lies nearby a low-dimensinal manifold a corruption example ( ) is obtained by applying q D ( X X) learning the model with p(x X), project them back to the manifold Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 7 / 11
Results Dataset SVM rbf SAA-3 DBN-3 SdA-3 (v%) basic 3.03± 0.15 3.46± 0.16 3.11± 0.15 2.80± 0.14 (10) rot 11.11± 0.28 10.30± 0.27 10.30± 0.27 10.29± 0.27 (10) bg-rand 14.58± 0.31 11.28± 0.28 6.73± 0.22 10.38± 0.27 (40) bg-img 22.61± 0.37 23.00± 0.37 16.31± 0.32 16.68± 0.33 (25) ro-b-im 55.18± 0.44 51.93± 0.44 47.39± 0.44 44.49± 0.44 (25) rect 2.15± 0.13 2.41± 0.13 2.60± 0.14 1.99± 0.12 (10) rect-img 24.04± 0.37 24.05± 0.37 22.50± 0.37 21.59± 0.36 (25) convex 19.13± 0.34 18.41± 0.34 18.63± 0.34 19.06± 0.34 (10) MNIST data set Test error rate with a 95% confidence interval [VLBM08] Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 8 / 11
Results v = 0% Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 9 / 11
Results v = 10% Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 9 / 11
Results v = 25% Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 9 / 11
Results v = 50% Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 9 / 11
Summary extending autoencoders to denoising autoencoders is simple denoising helps to capture interesting structures from the input distribution Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 10 / 11
Summary extending autoencoders to denoising autoencoders is simple denoising helps to capture interesting structures from the input distribution initialization with stacked denoising autoencoders shows better performance than stacked basic autoencoders Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 10 / 11
Summary extending autoencoders to denoising autoencoders is simple denoising helps to capture interesting structures from the input distribution initialization with stacked denoising autoencoders shows better performance than stacked basic autoencoders denoising autoencoders perform even better than deep belief networks whose layers are initialized as Restricted Boltzmann Machines [VLBM08] Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 10 / 11
Summary extending autoencoders to denoising autoencoders is simple denoising helps to capture interesting structures from the input distribution initialization with stacked denoising autoencoders shows better performance than stacked basic autoencoders denoising autoencoders perform even better than deep belief networks whose layers are initialized as Restricted Boltzmann Machines [VLBM08] Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 10 / 11
References [BLP + 07] [HOT06] [PCL06] [RHW88] [VLBM08] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, Universite De Montreal, and Montreal Quebec. Greedy layer-wise training of deep networks. In In NIPS. MIT Press, 2007. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527 1554, July 2006. Christopher Poultney, Sumit Chopra, and Yann Lecun. Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems (NIPS 2006. MIT Press, 2006. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696 699. MIT Press, Cambridge, MA, USA, 1988. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pages 1096 1103, New York, NY, USA, 2008. ACM. Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 11 / 11