Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea
Unsupervised learning In machine learning, there are 3 kinds of learning paradigm. 1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning In this chapter, we only consider unsupervised learning. What is unsupervised learning? The problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Example clustering(k-means clustering, mixture of gaussian, etc.) factor analysis(pca, etc.)
Unsupervised learning in deep learning Stochastic case Restricted Boltzmann Machines Sigmoid belief networks Deep Boltzmann Machines Deep Belief Networks Deterministic case Auto Encoders Denoising Auto Encoders Contractive Auto Encoders
Restricted Boltzmann Machines A generative stochastic model that can learn a probability distribution over its set of inputs. Initially invented under the name Harmonium(P.Smolensky, 1986). Rose to prominence after Hinton and his collaborators started deep learning with two important papers in 2006.(G.E.Hinton et al., 2006, G.E.Hinton and R.R.Salakhutdinov, 2006) They can be trained in either supervised or unsupervised ways, depending on the task. Actually, RBM is not a deep structure, but it is a building block of deep structures.
Restricted Boltzmann Machines As their name implied, RBM are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph. Figure: Structures of BM and RBM
Restricted Boltzmann Machines Structure of RBM RBM has binary-valued hidden and visible units. v = (v 1,..., v m ) : visible units. h = (h 1,..., h n ) : hidden units. Given these, the energy of (v, h) is defined as: E(v, h) = b v c h v Wh = b i v i c j h j i j i w ij v i h i. j where b and c are called bias vector and W is called weight matrix. Parameter θ := (W, b, c).
Restricted Boltzmann Machines The probability distributions over hidden and visible vectors are defined in terms of the energy function: P (v, h) = 1 Z e E(v,h) where Z is called partition function, just a normalizing constant.(i.e. Z = v h e E(v,h) ) Figure: Example of RBM
Learning Restricted Boltzmann Machines It is easy to calculate conditional distribution, therefore, calculating posterior distribution is easy. The conditional probability of visible units v given h is : P (v h) = i P (v i h) where P (v i h) = sigm(b i + j w ij h j ). Conversely, the conditional probability of h given v is : P (h v) = j P (h j v) where P (h j v) = sigm(c j + i w ij v i ).
Learning Restricted Boltzmann Machines The basic idea is to minimize the minus log-likelihood function. Calculating minimizer of log P (v) = log h P (v, h) is hard, so we use gradient descent algorithm. [ log P (v)] θ = h = h E(v, h) P (h v) E(ṽ, h) θ θ ṽ,h E(v, h) E(ṽ, h) P (h v) Eṽ,h [ ] θ θ It is very slow to compute the above partial derivations, because an exponential number of sum operations is needed in computing Eṽ,h [ E(ṽ,h) θ ].
Learning Restricted Boltzmann Machines Contrastive Divergence algorithm To approximate these partial derivatives efficiently, the contrastive divergence (CD) algorithm, especially k-step contrastive divergence (CD-k), could be employed as a standard method for training RBMs. It can be seen that the core of the CD-k learning algorithm is a special case of Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution when direct sampling is difficult.
Learning Restricted Boltzmann Machines For k Gibbs steps starting from a training example(i.e., sampling from ˆP ): v 0 ˆP h 0 P (h v 0 ) v 1 P (v h 0 ) h 1 P (h v 1 ). v k P (v h k 1 ) It is easy to sample from conditional distribution because of the structure of RBM.
Learning Restricted Boltzmann Machines Using Gibbs chain, we can appropriately compute the gradient of log-likelihood with respect to θ at the training sample v, log P (v) θ h P (h v) E(v, h) θ h P (h v k ) E(v k, h). θ Figure: Contrastive Divergence
Deep generative models Deep generative model is a generative model that has more than 2 hidden layers. Deep models often have millions of parameters. There are three different kinds of deep generative models: directed, undirected and mixed. We study directed and undirected version briefly, and discuss about mixed model in depth.
Deep directed networks The bottom level contains the visible data, and the remaining layers are hidden. If all the nodes are binary, this is called a sigmoid belief net.(r.neal, 1992) Figure: Sigmoid belief network
Deep directed networks In this case, the model defines the following joint distribution: P (v, h (1),..., h l ) = P (v h (1) )P (h (2) h (1) ) P (h (l 1) h (l) )P (h (l) ) P (h (k) h (k+1) ) = n k i=1 P (h (k) i h (k+1) ) P (h (k) i = 1 h (k+1) ) = sigm(b (k+1) i + j W (k+1) ij h (k+1) k ) and where h (0) = v. n l P (h (l) ) = Ber(h (l) i w (l) i ). i=1
Deep directed networks Unfortunately, inference in directed models is intractable because the posterior on the hidden nodes is correlated due to explaining away. One can use mean field approximations, but these may not be very accurate, since they approximate the correlated posterior with a factorial posterior. Remark Explaining away means that a prior independent causes of an event can become nonindependent given the observation of the event.
Deep Boltzmann Machines A natural alternative to a directed model is to construct a deep undirected model. For example, we can stack a series of RBMs on top of each other, and this is known as a deep boltzmann machine(dbm) (R.R.Salakhutdinov and G.E.Hinton, 2009). The energy function of a DBM which has l hidden layers can be defined as: l l E(v, h (1),..., h (l) ) = h (k 1)T W (k) h (k) b (k)t h (k) k=1 k=0 where h (0) = v. Correspondingly, the joint distribution can be defined as: P (v, h (1),..., h (l) ) = 1 Z e E(v,h(1),...,h (l) ) where Z is a partition function.
Deep Boltzmann Machines The main advantage of DBM is that one can perform efficient layer-wise Gibbs sampling since all the nodes in each layer are conditionally independent of each other given the layers above and below. But training undirected models is more difficult, because the partition function Z is intractable to calculate. Figure: 3-hidden layers DBM
Deep Belief Networks Deep belief networks(g.e.hinton et al., 2006) is a hybrid graph model consisting of both undirected and directed parts. The top two layers make up an undirected graph of an associative memory network, which can also be viewed as an RBM. The other layers make up a directed graph. Figure: 3-hidden layers DBN
Deep Belief Networks A DBN with l layers models the joint distribution between observed variables v and hidden layers h (k), k = 1,..., l, as follows: p(v, h (1),..., h (l) ) = P (v h (1) ) P (h (l 2) h (l 1) )P (h (l 1), h (l) ) P (h (k) h (k+1) ) = i P (h (k) i h (k+1) ) P (h (k) i h (k+1) ) = sigm(b (k+1) i + j W (k+1) ij h (k+1) j ) and P (h (l 1), h (l) ) is a RBM.
Deep Belief Networks Figure: The structure of DBN
Learning Deep Belief Networks The learning procedure of a DBN can also be divided into two stages : pre-training and fine-tuning.(g.e.hinton et al., 2006) Unsupervised pre-training provides a good initialization of the network. Supervised fine-tuning Up-down algorithm We only discuss unsupervised pre-training algorithm.
Learning Deep Belief Networks First step: Construct an RBM with an input layer v and a hidden layer h. Train the RBM Second step: Stack another hidden layer on top of the RBM to form a new RBM Fix W (1), sample h (1) from Q(h (1) v) as input. Train W (2) as RBM. Third step: Continue to stack layers on top of the network, train it as previous step, with sample sampled from Q(h (2) h (1) ). And so on...
Learning Deep Belief Networks Why greedy training works? It was shown that greedy training process maximizes the lower-bound of the log-likelihood of the data. (G.E.Hinton et al., 2006) Figure: DBN Greedy training
Learning Deep Belief Networks Let s consider 2-hidden layer DBN. A DBN with two hidden layers and tied weight is equivalent to an RBM. (i.e. P (v, h 1 W 1 ) = h 2 P (v, h1, h 2 W 1 ) is RBM)(G.E.Hinton et al., 2006) Figure: tied DBN equals RBM
Learning Deep Belief Networks 1 So, if we constrain W 2 = W 1 T, then P (v, h 1 W 1 ) would be a RBM. Find optimal Ŵ 1 such that log P (v Ŵ1 ) = max W 1 log P (v W 1 ) using CD algorithm. Note that max W 1,W 2 P (v W1, W 2 ) max W 1 P (v W 1 ). 2 P (h 1, h 2 Ŵ 1 ) is a RBM, so untie the weight Ŵ 1 to increase P (h 1 ). Find optimal Ŵ 2 such that log P (h 1 Ŵ2 ) = max W 2 log P (h 1 W 2 ) using CD algorithm. It makes increase of lower bound of original log-likelihood.
Learning Deep Belief Networks However, in practice, we want to be able to use any number of hidden units in each level. This means we will not be able to initialize the weights so that W l = W l 1T and this voids the theoretical guarantee. Nevertheless the method works well in practice.
Auto-Encoders An AE can be used for dimensionality reduction of high-dimensional data. The standard AE is a symmetric multilayer feed-forward network and desired output of an AE should be identical to the input for learning the identity mapping and extracting unsupervised features. Figure: An example of AE
Auto-Encoders By training, An AE can not only generate a hidden representation from an input, but also reconstruct the output as the input from the hidden representation. The part from the input to the interlayer is called encoder, and the part from the interlayer to the output is called decoder. Figure: AE with encoder and decoder
Auto-Encoders Encoding refers to the process of mapping an input v R m to a hidden representation h(v) R n m namely, h(v) = σ(wv + b) where W R m n is a weight matrix, b R n is a bias vector, and σ(v) is sigmoid function. Decoding refers to the process of mapping a hidden representation h(v) to the output o for reconstructing the input v, o = σ( Wh(v) + b) where W R n m is a weight matrix, and b R m is a bias vector.
Auto-Encoders W is not necessarily equal to W T. If W = W T, then we call the weights are tied. Figure: A single hidden layer AE
Stacked AE(SAE) We can construct a deep AE that has an input layer, 2l 1 hidden layers and an output layer: h (1) = σ(w (1) v) h (k) = σ(w (k) h (k 1) + b (k) ), 2 k 2l 1 o = σ(w (2l) h (2l 1) + b (2l) ) where W (k) are weight matrices and b (k) are bias vectors. Figure: A multi hidden layer AE
Learning AE and SAE As a special kind of multilayer neural network, the weights and biases of an AE can be trained using back propagation algorithm. However, the BP algorithm may produce very different results from distinct initializations of weights and biases in deep cases, generally converging with relatively slow speed, or even not converging at all.
Learning AE and SAE An effective strategy for solving the problem of local minima is to adopt the following two-stage training approach.(g.e.hinton and R.R.Salakhutdinov, 2006) Unsupervised pre-training: Take every two adjacent layers as an RBM with its output as the input of the next higher RBM, and train all these layer-wise RBMs by an appropriate method, such as CD learning. Supervised fine-tuning: after unsupervised pre-training, fine-tune all the network parameters by using BP algorithm.
Learning AE and SAE Figure: Training process of SAE
Denoising Auto-Encoders A DAE(P.Vincent et al., 2008) is obtained by training AE in a special way, which can have some resistant ability to noise and more stable reconstruction of data. At first, it corrupts the raw input v to a partially destroyed version ṽ by means of a stochastic mapping ṽ q D (ṽ v). (e.g. set 25% of inputs to 0) Then, it takes ṽ as the input and v as the output to train the AE. Finally, it takes the output as a function of ṽ instead of v. Figure: A denoising AE
Contractive Auto-Encoders A CAE(S.Rifai et al., 2011) is similar to a DAE in function. It has a robust performance on small variations of training samples. Let assume there are N samples. The objective function of a CAE is: N N L(v i, o i ) + λ J f (v i ) 2 F i=1 where J f (v) 2 F = j,k i=1 ( hk (v j ) x j ) 2.
References G.E.Hinton, S.Osindero and Y.Teh. A fast learning algorithm for deep belief nets. Neural computation. 18(7). pp.1527-1554. 2006. G.E.Hinton and R.R.Salakhutdinov. Reducing the dimensionality of data with neural networks. Science. 313(5786). pp.504-507. 2006. H.Larochelle, D.Erhan, A.Courville, J.Bergstra and Y.Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. Proceedings of the 24th international conference on Machine learning. pp.473-480. 2007. R.Neal. Connectionist learning of belief networks. Artificial intelligence. 56(1). pp.71-113. 1992.
References S.Rifai, P.Vincent, X.Muller, X.Glorot and Y.Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp.833-840. 2011. R.Salakhutdinov and G.E.Hinton. Deep boltzmann machines. International Conference on Artificial Intelligence and Statistics. pp.448-455. 2009. P.Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Department of Computer Science, University of Colorado, Boulder. 1986. P.Vincent, H.Larochelle, Y.Bengio and P.Manzagol. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning. pp.1096-1103. 2008.