Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Similar documents
STA 414/2104: Machine Learning

Spectral Hashing: Learning to Leverage 80 Million Images

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

An efficient way to learn deep generative models

Lecture 16 Deep Neural Generative Models

STA 414/2104: Lecture 8

Learning Deep Architectures for AI. Part II - Vijay Chakilam

UNSUPERVISED LEARNING

Modeling Documents with a Deep Boltzmann Machine

Replicated Softmax: an Undirected Topic Model. Stephen Turner

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

CSC321 Lecture 20: Autoencoders

Co-training and Learning with Noise

Dimensionality Reduction and Principle Components Analysis

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

How to do backpropagation in a brain

人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen

STA 414/2104: Lecture 8

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

LEARNING AND REASONING ON BACKGROUND NETS FOR TEXT CATEGORIZATION WITH CHANGING DOMAIN AND PERSONALIZED CRITERIA

Deep Learning Architecture for Univariate Time Series Forecasting

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Greedy Layer-Wise Training of Deep Networks

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Deep Generative Models. (Unsupervised Learning)

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Machine Learning. Bayesian Learning.

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Autoencoder Models

Deep unsupervised learning

Encoder Based Lifelong Learning - Supplementary materials

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Metric Embedding for Kernel Classification Rules

Generative Models for Sentences

Unsupervised Learning

arxiv: v1 [cs.lg] 5 Jul 2010

Two Roles for Bayesian Methods

TUTORIAL PART 1 Unsupervised Learning

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Unsupervised Neural Nets

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Distance Metric Learning

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Latent Dirichlet Allocation Introduction/Overview

Lecture 3a: The Origin of Variational Bayes

Improved Local Coordinate Coding using Local Tangents

Classical Predictive Models

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Hash-based Indexing: Application, Impact, and Realization Alternatives

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Colored Maximum Variance Unfolding

Jakub Hajic Artificial Intelligence Seminar I

Artificial Neural Networks Examination, March 2004

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CSC 411 Lecture 12: Principal Component Analysis

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Lecture 14: Deep Generative Learning

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Discriminative Topic Modeling Based on Manifold Learning

unsupervised learning

Deep Boltzmann Machines

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Machine Learning for Signal Processing Bayes Classification and Regression

Restricted Boltzmann Machines for Collaborative Filtering

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Learning Deep Architectures

Mining Classification Knowledge

Supervised Learning. George Konidaris

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Spectral Hashing. Antonio Torralba 1 1 CSAIL, MIT, 32 Vassar St., Cambridge, MA Abstract

Cheng Soon Ong & Christian Walder. Canberra February June 2018

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Variable Latent Semantic Indexing

The role of dimensionality reduction in classification

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

topic modeling hanna m. wallach

A Hybrid Deep Learning Approach For Chaotic Time Series Prediction Based On Unsupervised Feature Learning

Dimensionality Reduction

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Discriminative Models

Introduction to Machine Learning

Learning Tetris. 1 Tetris. February 3, 2009

Transcription:

NON-LINEAR DIMENSIONALITY REDUCTION USING NEURAL NETORKS Ruslan Salakhutdinov Joint work with Geoff Hinton University of Toronto, Machine Learning Group

Overview Document Retrieval Present layer-by-layer pretraining and the fine-tuning of the multi-layer network that discovers binary codes in the top layer. This allows us to significantly speed-up retrieval time. e also show how we can use our model to allow retrieval in constant time (a time independent of the number of documents). Show how to perform nonlinear embedding by preserving class neighbouthood structure (supervised, semi-supervised).

Motivation For the document retrieval tasks, we want to retrieve a small set of documents, relevant to the given query. Popular and widely used in practice text retrieval algorithm is based on TF-IDF (term frequency / inverse document frequency) word-weighting heuristic. Drawbacks: it computes document similarity directly in the word-count space, and it does not capture high-order correlations between words in a document. e want to extract semantic structure topics from documents. Latent Semantic Analysis is a simple and widely-used linear method. A network with multiple hidden layers and with many more parameters should be able to discover latent representations that work much better for retrieval.

Model 2000 Top Layer Binary Codes 128 3 T 2 128 3 2 RBM RBM Gaussian Noise T 1 +ε T 2 +ε T 3 +ε 128 Code Layer +ε 6 5 4 3 3 2000 T 1 2000 1 RBM 2000 +ε 2 2 +ε 1 1 The Deep Generative Model Recursive Pretraining Fine tuning

Document Retrieval: 20 newsgroup corpus here should talk.politics.guns go? Autoencoder 2 D Topic Space talk.politics.mideast rec.sport.hockey talk.religion.misc sci.cryptography comp.graphics misc.forsale

Document Retrieval: 20 newsgroup corpus here should alt.atheism go? talk.politics.mideast Autoencoder 2 D Topic Space talk.politics.guns rec.sport.hockey talk.religion.misc sci.cryptography comp.graphics misc.forsale

Document Retrieval: 20 newsgroup corpus Autoencoder 2 D Topic Space talk.politics.mideast alt.atheism talk.politics.guns rec.sport.hockey talk.religion.misc sci.cryptography comp.graphics misc.forsale

h Constrained Poisson Model Binary Latent Topic Features N* softmax v Constrained Poisson Observed Distribution over ords Modeled Distribution over ords Hidden units are binary and the visible word counts are modeled by constrained Poisson model. Conditional distributions over hidden and visible units are: Poisson where N is the total length of the document and

Learning Multiple Layers - Pretraining A single layer of binary features generally cannot perfectly model the structure in the data. Perform greedy, layer-by-layer learning: Learn and Freeze using Constrained Poisson Model. Treat the existing feature detectors, driven by training data, as if they were data. Learn and Freeze. Proceed recursive greedy learning as many times as desired.. Under certain conditions adding an extra layer always improves a lower bound on the log probability of data. (In our case, these conditions are violated) 128 2000 Recursive Learning 3 2 1 RBM RBM RBM. Each layer of features captures strong high-order correlations between the activities of units in the layer belows.

Learning Binary Representations X=Input Distibution 10 0 10 Convolve Gaussian Noise 0 = X_n = X+noise 0 F(X_n) 3 0 3 Y=F(X_n) Output Distribution 0 1

Document Retrieval: 20 Newsgroup Corpus e use a 2000---128 autoencoder to convert a document into a 128-bit vector. e corrupted the input signal to the code-layer with Gaussian noise. Empirical distributions of 128 code units before and after fine-tuning: 2.5 x 105 7 x 105 6 2 5 1.5 4 1 3 2 0.5 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 After fine-tuning, we binarize codes at 0.1.

Learning Binary Representations rec.sport.hockey talk.politics.guns comp.graphics European Community Monetary/Economic Disasters and Accidents Government Borrowing talk.politics.mideast sci.cryptography Energy Markets soc.religion.christian Accounts/Earnings 2-dimensional embedding of 128-bit codes using SNE for 20 Newsgroup data (left panel) and Reuters RCV2 corpus (right panel).

Semantic Address Space e could instead learn how to convert a document into a 20-bit code. e have ultimate retrieval tool: Given a query document, compute its 20-bit address and retrieve all of the documents stored at similar addresses with no search at all. Essentially we could learn semantic hashing table. e could also retrieve similar documents by looking at a hamming-ball of radius, for example, 4. The retrieved documents could then be given to a slower but more precise retrieval method, such as TF-IDF.

Results 0.55 0.5 Reuters RCV2 Corpus LSA 128 Fine tuned 128 bit codes 128 bit codes prior to fine tuning 0.55 0.5 Reuters RCV2 Corpus LSA 128 TF IDF TF IDF using 128 bit codes for prefiltering 0.45 0.45 Accuracy 0.4 0.35 Accuracy 0.4 0.35 0.3 0.3 0.25 0.25 0.2 1 3 7 15 31 63 127 255 511 1023 Number of Retrieved Documents 0.55 0.5 0.2 Reuters RCV2 Corpus 1 3 7 15 31 63 127 255 511 1023 Number of Retrieved Documents TF IDF TF IDF using 20 bits TF IDF using 20 and 128 bits 0.45 Accuracy 0.4 0.35 0.3 0.25 0.2 1 3 7 15 31 63 127 255 511 1023 Number of Retrieved Documents

Learning nonlinear embedding e tackle the problem of learning similarity measure or distance metric over the input space Given a distance metric (f.e. Euclidean) we can measure similarity between two input vectors, by computing. is a function, mapping input vectors in to a feature space and is parametarized by. d[f(x1),f(x2)] 10 2000 x1 3 2 1 10 2000 x2 3 2 1

Learning nonlinear embedding Most of the previous algorithms studied the case when is Euclidean measure and is a simple linear projection f(x )=x. The Euclidean distance is then the Mahalanobis distance. See Goldberger et. al. 2004, Globerson and Roweis 2005, Kilian et. al. 2005. e have a set of, and training labeled data vectors., where For each training vector, define the probability that point one of its neighbours in the transformed space as: selects where perceptron., and is a multi-layer

Learning nonlinear embedding Probability that point belongs to class is: Mazimize the expected number of correctly classified points on the training data: NCA One could alternatively minimize KL-divergence: where This induces the following objective function to maximize: NCA By considering a linear perceptron we arrive to linear NCA. if if

Results 7 atheism 9 sci.cryptography religion.christian 8 1 4 2 6 0 5 3 sci.space rec.autos comp.hardware rec.hokey comp.window

Results 3.2 3 2.8 2.6 2.4 80 70 60 Nonlinear NCA + Auto 10D Nonlinear NCA 10D Linear NCA 10D Autoencoder 10D LDA 10D LSA 10D Error (%) 2.2 2 1.8 1.6 1.4 Nonlinear NCA 30D Linear NCA 30D Autoencoder 30D PCA 30D Accuracy (%) 50 40 30 1.2 1 0.8 20 10 0.6 1 3 5 7 Number of Nearest Neighbours 0 1 3 7 15 31 63 127 255 511 1023 2047 4095 7531 Number of Retrieved Documents

Semi-supervised Extention 2000 10 T 2 +ε T 3 +ε 5 4 NCA Cost 2000 +ε 3 3 +ε 2 2 +ε 1 1 The combined objective we maximize: NCA So the derivative of the reconstruction error E is backpropagated through the autoencoder and is combined with derivatives of NCA at the code level.

Semi-supervised Extention 60 50 Newsgroup Dataset (5% Labels) Nonlinear NCA + Auto 10D Nonlinear NCA 10D Linear NCA 10D Autoencoder 10D Accuracy (%) 40 30 20 10 0 1 3 7 15 31 63 127 255 5111023204740957531 Number of Retrieved Documents