Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Similar documents
Spectral Hashing: Learning to Leverage 80 Million Images

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Improved Local Coordinate Coding using Local Tangents

What is semi-supervised learning?

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Global vs. Multiscale Approaches

Manifold Regularization

Analysis of Spectral Kernel Design based Semi-supervised Learning

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

Graphs, Geometry and Semi-supervised Learning

Semi-Supervised Learning by Multi-Manifold Separation

How to learn from very few examples?

Spectral Hashing. Antonio Torralba 1 1 CSAIL, MIT, 32 Vassar St., Cambridge, MA Abstract

Linear Spectral Hashing

Cluster Kernels for Semi-Supervised Learning

Tensor Methods for Feature Learning

Graphs in Machine Learning

Semi-Supervised Learning with Graphs

Semi-Supervised Learning of Speech Sounds

Unlabeled Data: Now It Helps, Now It Doesn t

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

CS 231A Section 1: Linear Algebra & Probability Review

Predicting Graph Labels using Perceptron. Shuang Song

Nonlinear Dimensionality Reduction

One-class Label Propagation Using Local Cone Based Similarity

Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Seeing stars when there aren't many stars Graph-based semi-supervised learning for sentiment categorization

A graph based approach to semi-supervised learning

Introduction to Machine Learning

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Iterative Laplacian Score for Feature Selection

Nonlinear Dimensionality Reduction. Jose A. Costa

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised dimensionality reduction

Graphs in Machine Learning

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011

Statistical Machine Learning

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Semi-Supervised Learning through Principal Directions Estimation

Relevance Aggregation Projections for Image Retrieval

Machine learning for pervasive systems Classification in high-dimensional spaces

LECTURE NOTE #11 PROF. ALAN YUILLE

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Large-Scale Graph-Based Semi-Supervised Learning via Tree Laplacian Solver

Linear & nonlinear classifiers

TUM 2016 Class 3 Large scale learning by regularization

IFT LAPLACIAN APPLICATIONS. Mikhail Bessmeltsev

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

An Iterated Graph Laplacian Approach for Ranking on Manifolds

Fantope Regularization in Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Data dependent operators for the spatial-spectral fusion problem

Learning with Consistency between Inductive Functions and Kernels

Kernel methods for comparing distributions, measuring dependence

Efficient Iterative Semi-Supervised Classification on Manifold

Global Scene Representations. Tilke Judd

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Basis Expansion and Nonlinear SVM. Kai Yu

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Maximum Margin Matrix Factorization

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

LABELED data is expensive to obtain in terms of both. Laplacian Embedded Regression for Scalable Manifold Regularization

Data Mining Techniques

CS 6375 Machine Learning

Graph-Based Semi-Supervised Learning

L5 Support Vector Classification

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Active and Semi-supervised Kernel Classification

1 Graph Kernels by Spectral Transforms

Neuroscience Introduction

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Justin Solomon MIT, Spring 2017

A Least Squares Formulation for Canonical Correlation Analysis

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Kernels A Machine Learning Overview

Manifold Learning: Theory and Applications to HRI

Introduction to Gaussian Process

Discrete vs. Continuous: Two Sides of Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Data-dependent representations: Laplacian Eigenmaps

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Semi-supervised Learning using Sparse Eigenfunction Bases

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lecture 2 Machine Learning Review

CS798: Selected topics in Machine Learning

Unsupervised Learning: K- Means & PCA

On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Math 113 Homework 5 Solutions (Starred problems) Solutions by Guanyang Wang, with edits by Tom Church.

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Transcription:

Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Gigantic Image Collections What does the world look like? High Object level Recognition image statistics for large-scale image search

Spectrum of Label Information Human annotations Noisy labels Unlabeled

Semi-Supervised Learning Data Supervised Semi-Supervised Classification function should be smooth with respect to data density

W i Semi-Supervised Learning using Graph Laplacian is n x n affinity matrix (n = # of points) W ij = exp( kx i x j k/2² 2 ) 2 [Zhu03,Zhou04] Graph Laplacian: L = I D 1/2 WD 1/2 D ii = j W ij

SSL using Graph Laplacian Want to find label function f that minimizes: f T Lf +(f y) T Λ(f y) Smoothness Agreement with labels y = labels If labeled,, otherwise Λ ii = λ Λ ii =0 Solution: n x n system (n = # points)

Eigenvectors of Laplacian Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues: f = Uα U =[φ 1,...,φ k ] [Belkin & Niyogi 06, Schoelkopf & Smola 02, Zhu et al 03, 08]

Rewrite System Let f = Uα U = smallest k eigenvectors of L α = coeffs. k is user parameter (typically ~100) Optimal α is now solution to k x k system: (Σ + U T ΛU)α = U T Λy

Computational Bottleneck Consider a dataset of 80 million images Inverting L Inverting 80 million x 80 million matrix Finding eigenvectors of L Diagonalizing 80 million x 80 million matrix

Large Scale SSL - Related work Nystrom method: pick small set of landmark points Compute exact eigenvectors on these Interpolate solution to rest Data Landmarks [see Zhu 08 survey] Other approaches include: Mixture models (Zhu and Lafferty 05), Sparse Grids (Garcke and Griebel 05), Sparse Graphs (Tsang and Kwok 06)

Our Approach

Overview of Our Approach Compute approximate eigenvectors Density Data Landmarks Ours Nystrom Limit as n Linear in number of data-points Reduce n Polynomial in number of landmarks

Consider Limit as n Consider x to be drawn from 2D distribution p(x) Let L p (F) be a smoothness operator on p(x), for a function F(x) Smoothness operator penalizes functions that vary in areas of high density Analyze eigenfunctions of L p (F)

Eigenvectors & Eigenfunctions

Key Assumption: Separability of Input data p(x 1 ) Claim: If p is separable, then: p(x 2 ) Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue p(x 1,x 2 ) [Nadler et al. 06,Weiss et al. 08]

Numerical Approximations to Eigenfunctions in 1D 300,000 points drawn from distribution p(x) Consider p(x 1 ) p(x 1 ) p(x) Data Histogram h(x 1 )

Numerical Approximations to Eigenfunctions in 1D Solve for values of eigenfunction at set of discrete locations (histogram bin centers) and associated eigenvalues B x B system (B = # histogram bins, e.g. 50)

1D Approximate Eigenfunctions 1 st Eigenfunction of h(x 1 ) 2 nd Eigenfunction of h(x 1 ) 3 rd Eigenfunction of h(x 1 )

Separability over Dimension Build histogram over dimension 2: h(x 2 ) Now solve for eigenfunctions of h(x 2 ) 1 st Eigenfunction of h(x 2 ) 2 nd Eigenfunction of h(x 2 ) 3 rd Eigenfunction of h(x 2 )

From Eigenfunctions to Approximate Eigenvectors Take each data point Do 1-D interpolation in each eigenfunction Eigenfunction value 1 50 Histogram bin Very fast operation

Preprocessing Need to make data separable Rotate using PCA PCA Not separable Separable

Overall Algorithm 1. Rotate data to maximize separability (currently use PCA) 2. For each of the d input dimensions: Construct 1D histogram Solve numerically for eigenfunctions/values 3. Order eigenfunctions from all dimensions by increasing eigenvalue & take first k 4. Interpolate data into k eigenfunctions Yields approximate eigenvectors of Laplacian 5. Solve k x k least squares system to give label function

Experiments on Toy Data

Nystrom Comparison With Nystrom, too few landmark points result in highly unstable eigenvectors

Nystrom Comparison Eigenfunctions fail when data has significant dependencies between dimensions

Experiments on Real Data

Experiments Images from 126 classes downloaded from Internet search engines, total 63,000 images Dump truck Emu Labels (correct/incorrect) provided by Alex Krizhevsky, Vinod Nair & Geoff Hinton, (CIFAR & U. Toronto)

Input Image Representation Pixels not a convenient representation Use Gist descriptor (Oliva & Torralba, 2001) L2 distance btw. Gist vectors rough substitute for human perceptual distance Apply oriented Gabor filters over different scales Average filter energy in each bin

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after PCA to 64 dimensions PCA MI is mutual information score. 0 = Independent

Real 1-D Eigenfunctions of PCA d Gist descriptors Eigenfunction 1 1 8 16 24 32 40 Input Dimension 48 56 64

Protocol Task is to re-rank images of each class (class/non-class) Use eigenfunctions computed on all 63,000 images Vary number of labeled examples Measure precision @ 15% recall

0.7 Total number of images 4800 5000 6000 8000 0.65 Mean precision at 15% recall averaged over 16 classes 0.6 0.55 0.5 0.45 0.4 Least squares 0.35 SVM 0.3 Chance 0.25 Inf 0 1 2 3 4 5 6 7 Log 2 number of +ve training examples/class

0.7 Total number of images 4800 5000 6000 8000 0.65 Mean precision at 15% recall averaged over 16 classes 0.6 0.55 0.5 0.45 0.4 Nystrom Least squares 0.35 SVM 0.3 Chance 0.25 Inf 0 1 2 3 4 5 6 7 Log 2 number of +ve training examples/class

0.7 Total number of images 4800 5000 6000 8000 0.65 Mean precision at 15% recall averaged over 16 classes 0.6 0.55 0.5 Eigenfunction 0.45 0.4 Nystrom Least squares 0.35 SVM 0.3 Chance 0.25 Inf 0 1 2 3 4 5 6 7 Log 2 number of +ve training examples/class

0.7 Total number of images 4800 5000 6000 8000 0.65 Mean precision at 15% recall averaged over 16 classes 0.6 0.55 0.5 Eigenfunction 0.45 0.4 Nystrom Least squares 0.35 Eigenvector SVM 0.3 NN Chance 0.25 Inf 0 1 2 3 4 5 6 7 Log 2 number of +ve training examples/class

80 Million Images

Running on 80 million images PCA to 32 dims, k=48 eigenfunctions For each class, labels propagating through 80 million images Precompute approximate eigenvectors (~20Gb) Label propagation is fast <0.1secs/keyword

Japanese Spaniel 3 positive 3 negative Labels from CIFAR set

Airbus, Ostrich, Auto

Summary Semi-supervised scheme that can scale to really large problems linear in # points Rather than sub-sampling the data, we take the limit of infinite unlabeled data Assumes input data distribution is separable Can propagate labels in graph with 80 million nodes in fractions of second Related paper in this NIPS by Nadler, Srebro & Zhou See spotlights on Wednesday