Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Size: px

Start display at page:

Download "Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)"

Trevor Tucker
5 years ago
Views:

1 Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

2 Gigantic Image Collections What does the world look like? High Object level Recognition image statistics for large-scale image search

3 Spectrum of Label Information Human annotations Noisy labels Unlabeled

4 Semi-Supervised Learning Data Supervised Semi-Supervised Classification function should be smooth with respect to data density

5 W i Semi-Supervised Learning using Graph Laplacian is n x n affinity matrix (n = # of points) W ij = exp( kx i x j k/2² 2 ) 2 [Zhu03,Zhou04] Graph Laplacian: L = I D 1/2 WD 1/2 D ii = j W ij

6 SSL using Graph Laplacian Want to find label function f that minimizes: f T Lf +(f y) T Λ(f y) Smoothness Agreement with labels y = labels If labeled,, otherwise Λ ii = λ Λ ii =0 Solution: n x n system (n = # points)

7 Eigenvectors of Laplacian Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues: f = Uα U =[φ 1,...,φ k ] [Belkin & Niyogi 06, Schoelkopf & Smola 02, Zhu et al 03, 08]

8 Rewrite System Let f = Uα U = smallest k eigenvectors of L α = coeffs. k is user parameter (typically ~100) Optimal α is now solution to k x k system: (Σ + U T ΛU)α = U T Λy

9 Computational Bottleneck Consider a dataset of 80 million images Inverting L Inverting 80 million x 80 million matrix Finding eigenvectors of L Diagonalizing 80 million x 80 million matrix

10 Large Scale SSL - Related work Nystrom method: pick small set of landmark points Compute exact eigenvectors on these Interpolate solution to rest Data Landmarks [see Zhu 08 survey] Other approaches include: Mixture models (Zhu and Lafferty 05), Sparse Grids (Garcke and Griebel 05), Sparse Graphs (Tsang and Kwok 06)

11 Our Approach

12 Overview of Our Approach Compute approximate eigenvectors Density Data Landmarks Ours Nystrom Limit as n Linear in number of data-points Reduce n Polynomial in number of landmarks

p(x), for a function F(x) Smoothness operator penalizes

13 Consider Limit as n Consider x to be drawn from 2D distribution p(x) Let L p (F) be a smoothness operator on p(x), for a function F(x) Smoothness operator penalizes functions that vary in areas of high density Analyze eigenfunctions of L p (F)

14 Eigenvectors & Eigenfunctions

15 Key Assumption: Separability of Input data p(x 1 ) Claim: If p is separable, then: p(x 2 ) Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue p(x 1,x 2 ) [Nadler et al. 06,Weiss et al. 08]

16 Numerical Approximations to Eigenfunctions in 1D 300,000 points drawn from distribution p(x) Consider p(x 1 ) p(x 1 ) p(x) Data Histogram h(x 1 )

17 Numerical Approximations to Eigenfunctions in 1D Solve for values of eigenfunction at set of discrete locations (histogram bin centers) and associated eigenvalues B x B system (B = # histogram bins, e.g. 50)

18 1D Approximate Eigenfunctions 1 st Eigenfunction of h(x 1 ) 2 nd Eigenfunction of h(x 1 ) 3 rd Eigenfunction of h(x 1 )

19 Separability over Dimension Build histogram over dimension 2: h(x 2 ) Now solve for eigenfunctions of h(x 2 ) 1 st Eigenfunction of h(x 2 ) 2 nd Eigenfunction of h(x 2 ) 3 rd Eigenfunction of h(x 2 )

20 From Eigenfunctions to Approximate Eigenvectors Take each data point Do 1-D interpolation in each eigenfunction Eigenfunction value 1 50 Histogram bin Very fast operation

21 Preprocessing Need to make data separable Rotate using PCA PCA Not separable Separable

22 Overall Algorithm 1. Rotate data to maximize separability (currently use PCA) 2. For each of the d input dimensions: Construct 1D histogram Solve numerically for eigenfunctions/values 3. Order eigenfunctions from all dimensions by increasing eigenvalue & take first k 4. Interpolate data into k eigenfunctions Yields approximate eigenvectors of Laplacian 5. Solve k x k least squares system to give label function

23 Experiments on Toy Data

24 Nystrom Comparison With Nystrom, too few landmark points result in highly unstable eigenvectors

25 Nystrom Comparison Eigenfunctions fail when data has significant dependencies between dimensions

26 Experiments on Real Data

truck Emu Labels (correct/incorrect) provided by Alex

27 Experiments Images from 126 classes downloaded from Internet search engines, total 63,000 images Dump truck Emu Labels (correct/incorrect) provided by Alex Krizhevsky, Vinod Nair & Geoff Hinton, (CIFAR & U. Toronto)

Gist vectors rough substitute for human perceptual distance Apply

28 Input Image Representation Pixels not a convenient representation Use Gist descriptor (Oliva & Torralba, 2001) L2 distance btw. Gist vectors rough substitute for human perceptual distance Apply oriented Gabor filters over different scales Average filter energy in each bin

29 Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after PCA to 64 dimensions PCA MI is mutual information score. 0 = Independent

30 Real 1-D Eigenfunctions of PCA d Gist descriptors Eigenfunction Input Dimension

31 Protocol Task is to re-rank images of each class (class/non-class) Use eigenfunctions computed on all 63,000 images Vary number of labeled examples Measure 15% recall

32 0.7 Total number of images Mean precision at 15% recall averaged over 16 classes Least squares 0.35 SVM 0.3 Chance 0.25 Inf Log 2 number of +ve training examples/class

33 0.7 Total number of images Mean precision at 15% recall averaged over 16 classes Nystrom Least squares 0.35 SVM 0.3 Chance 0.25 Inf Log 2 number of +ve training examples/class

34 0.7 Total number of images Mean precision at 15% recall averaged over 16 classes Eigenfunction Nystrom Least squares 0.35 SVM 0.3 Chance 0.25 Inf Log 2 number of +ve training examples/class

35 0.7 Total number of images Mean precision at 15% recall averaged over 16 classes Eigenfunction Nystrom Least squares 0.35 Eigenvector SVM 0.3 NN Chance 0.25 Inf Log 2 number of +ve training examples/class

36 80 Million Images

37 Running on 80 million images PCA to 32 dims, k=48 eigenfunctions For each class, labels propagating through 80 million images Precompute approximate eigenvectors (~20Gb) Label propagation is fast <0.1secs/keyword

38 Japanese Spaniel 3 positive 3 negative Labels from CIFAR set

39 Airbus, Ostrich, Auto

40 Summary Semi-supervised scheme that can scale to really large problems linear in # points Rather than sub-sampling the data, we take the limit of infinite unlabeled data Assumes input data distribution is separable Can propagate labels in graph with 80 million nodes in fractions of second Related paper in this NIPS by Nadler, Srebro & Zhou See spotlights on Wednesday

Spectral Hashing: Learning to Leverage 80 Million Images

Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral