Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

NON-LINEAR DIMENSIONALITY REDUCTION USING NEURAL NETORKS Ruslan Salakhutdinov Joint work with Geoff Hinton University of Toronto, Machine Learning Group

Overview Document Retrieval Present layer-by-layer pretraining and the fine-tuning of the multi-layer network that discovers binary codes in the top layer. This allows us to significantly speed-up retrieval time. e also show how we can use our model to allow retrieval in constant time (a time independent of the number of documents). Show how to perform nonlinear embedding by preserving class neighbouthood structure (supervised, semi-supervised).

Motivation For the document retrieval tasks, we want to retrieve a small set of documents, relevant to the given query. Popular and widely used in practice text retrieval algorithm is based on TF-IDF (term frequency / inverse document frequency) word-weighting heuristic. Drawbacks: it computes document similarity directly in the word-count space, and it does not capture high-order correlations between words in a document. e want to extract semantic structure topics from documents. Latent Semantic Analysis is a simple and widely-used linear method. A network with multiple hidden layers and with many more parameters should be able to discover latent representations that work much better for retrieval.

Model 2000 Top Layer Binary Codes 128 3 T 2 128 3 2 RBM RBM Gaussian Noise T 1 +ε T 2 +ε T 3 +ε 128 Code Layer +ε 6 5 4 3 3 2000 T 1 2000 1 RBM 2000 +ε 2 2 +ε 1 1 The Deep Generative Model Recursive Pretraining Fine tuning

Document Retrieval: 20 newsgroup corpus here should talk.politics.guns go? Autoencoder 2 D Topic Space talk.politics.mideast rec.sport.hockey talk.religion.misc sci.cryptography comp.graphics misc.forsale

Document Retrieval: 20 newsgroup corpus here should alt.atheism go? talk.politics.mideast Autoencoder 2 D Topic Space talk.politics.guns rec.sport.hockey talk.religion.misc sci.cryptography comp.graphics misc.forsale

Document Retrieval: 20 newsgroup corpus Autoencoder 2 D Topic Space talk.politics.mideast alt.atheism talk.politics.guns rec.sport.hockey talk.religion.misc sci.cryptography comp.graphics misc.forsale

h Constrained Poisson Model Binary Latent Topic Features N* softmax v Constrained Poisson Observed Distribution over ords Modeled Distribution over ords Hidden units are binary and the visible word counts are modeled by constrained Poisson model. Conditional distributions over hidden and visible units are: Poisson where N is the total length of the document and

Learning Multiple Layers - Pretraining A single layer of binary features generally cannot perfectly model the structure in the data. Perform greedy, layer-by-layer learning: Learn and Freeze using Constrained Poisson Model. Treat the existing feature detectors, driven by training data, as if they were data. Learn and Freeze. Proceed recursive greedy learning as many times as desired.. Under certain conditions adding an extra layer always improves a lower bound on the log probability of data. (In our case, these conditions are violated) 128 2000 Recursive Learning 3 2 1 RBM RBM RBM. Each layer of features captures strong high-order correlations between the activities of units in the layer belows.

Learning Binary Representations X=Input Distibution 10 0 10 Convolve Gaussian Noise 0 = X_n = X+noise 0 F(X_n) 3 0 3 Y=F(X_n) Output Distribution 0 1

Document Retrieval: 20 Newsgroup Corpus e use a 2000---128 autoencoder to convert a document into a 128-bit vector. e corrupted the input signal to the code-layer with Gaussian noise. Empirical distributions of 128 code units before and after fine-tuning: 2.5 x 105 7 x 105 6 2 5 1.5 4 1 3 2 0.5 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 After fine-tuning, we binarize codes at 0.1.

Learning Binary Representations rec.sport.hockey talk.politics.guns comp.graphics European Community Monetary/Economic Disasters and Accidents Government Borrowing talk.politics.mideast sci.cryptography Energy Markets soc.religion.christian Accounts/Earnings 2-dimensional embedding of 128-bit codes using SNE for 20 Newsgroup data (left panel) and Reuters RCV2 corpus (right panel).

Semantic Address Space e could instead learn how to convert a document into a 20-bit code. e have ultimate retrieval tool: Given a query document, compute its 20-bit address and retrieve all of the documents stored at similar addresses with no search at all. Essentially we could learn semantic hashing table. e could also retrieve similar documents by looking at a hamming-ball of radius, for example, 4. The retrieved documents could then be given to a slower but more precise retrieval method, such as TF-IDF.

Results 0.55 0.5 Reuters RCV2 Corpus LSA 128 Fine tuned 128 bit codes 128 bit codes prior to fine tuning 0.55 0.5 Reuters RCV2 Corpus LSA 128 TF IDF TF IDF using 128 bit codes for prefiltering 0.45 0.45 Accuracy 0.4 0.35 Accuracy 0.4 0.35 0.3 0.3 0.25 0.25 0.2 1 3 7 15 31 63 127 255 511 1023 Number of Retrieved Documents 0.55 0.5 0.2 Reuters RCV2 Corpus 1 3 7 15 31 63 127 255 511 1023 Number of Retrieved Documents TF IDF TF IDF using 20 bits TF IDF using 20 and 128 bits 0.45 Accuracy 0.4 0.35 0.3 0.25 0.2 1 3 7 15 31 63 127 255 511 1023 Number of Retrieved Documents

Learning nonlinear embedding e tackle the problem of learning similarity measure or distance metric over the input space Given a distance metric (f.e. Euclidean) we can measure similarity between two input vectors, by computing. is a function, mapping input vectors in to a feature space and is parametarized by. d[f(x1),f(x2)] 10 2000 x1 3 2 1 10 2000 x2 3 2 1

Learning nonlinear embedding Most of the previous algorithms studied the case when is Euclidean measure and is a simple linear projection f(x )=x. The Euclidean distance is then the Mahalanobis distance. See Goldberger et. al. 2004, Globerson and Roweis 2005, Kilian et. al. 2005. e have a set of, and training labeled data vectors., where For each training vector, define the probability that point one of its neighbours in the transformed space as: selects where perceptron., and is a multi-layer

Learning nonlinear embedding Probability that point belongs to class is: Mazimize the expected number of correctly classified points on the training data: NCA One could alternatively minimize KL-divergence: where This induces the following objective function to maximize: NCA By considering a linear perceptron we arrive to linear NCA. if if

Results 7 atheism 9 sci.cryptography religion.christian 8 1 4 2 6 0 5 3 sci.space rec.autos comp.hardware rec.hokey comp.window

Results 3.2 3 2.8 2.6 2.4 80 70 60 Nonlinear NCA + Auto 10D Nonlinear NCA 10D Linear NCA 10D Autoencoder 10D LDA 10D LSA 10D Error (%) 2.2 2 1.8 1.6 1.4 Nonlinear NCA 30D Linear NCA 30D Autoencoder 30D PCA 30D Accuracy (%) 50 40 30 1.2 1 0.8 20 10 0.6 1 3 5 7 Number of Nearest Neighbours 0 1 3 7 15 31 63 127 255 511 1023 2047 4095 7531 Number of Retrieved Documents

Semi-supervised Extention 2000 10 T 2 +ε T 3 +ε 5 4 NCA Cost 2000 +ε 3 3 +ε 2 2 +ε 1 1 The combined objective we maximize: NCA So the derivative of the reconstruction error E is backpropagated through the autoencoder and is combined with derivatives of NCA at the code level.

Semi-supervised Extention 60 50 Newsgroup Dataset (5% Labels) Nonlinear NCA + Auto 10D Nonlinear NCA 10D Linear NCA 10D Autoencoder 10D Accuracy (%) 40 30 20 10 0 1 3 7 15 31 63 127 255 5111023204740957531 Number of Retrieved Documents