Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Similar documents
Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Memory Efficient Kernel Approximation

Randomized Algorithms

Random Features for Large Scale Kernel Machines

Approximate Kernel PCA with Random Features

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Approximate Kernel Methods

Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Random Feature Maps for Dot Product Kernels Supplementary Material

Convergence Rates of Kernel Quadrature Rules

9.2 Support Vector Machines 159

Kernel Learning via Random Fourier Representations

Random Methods for Linear Algebra

Advanced Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Randomized Algorithms

10-701/ Recitation : Kernels

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

16 Embeddings of the Euclidean metric

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Fast Random Projections

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Kernel methods, kernel SVM and ridge regression

Kernel Methods. Machine Learning A W VO

GWAS V: Gaussian processes

Kernel Method: Data Analysis with Positive Definite Kernels

The Johnson-Lindenstrauss Lemma

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

TUM 2016 Class 3 Large scale learning by regularization

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Probabilistic & Unsupervised Learning

Homework 4. Convex Optimization /36-725

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

1 Dimension Reduction in Euclidean Space

Accelerated Dense Random Projections

Regularized Least Squares

MTTTS16 Learning from Multiple Sources

Probabilistic Regression Using Basis Function Models

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Least Squares SVM Regression

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Nonparameteric Regression:

Convergence of Eigenspaces in Kernel Principal Component Analysis

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Methods for sparse analysis of high-dimensional data, II

STA414/2104 Statistical Methods for Machine Learning II

Principal Component Analysis

Bayesian Linear Regression [DRAFT - In Progress]

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

The Representor Theorem, Kernels, and Hilbert Spaces

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Supremum of simple stochastic processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

TDT4173 Machine Learning

Linear Models for Regression

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

CPSC 540: Machine Learning

Gaussian Processes (10/16/13)

Kernel Methods and Support Vector Machines

Stochastic optimization in Hilbert spaces

17 Random Projections and Orthogonal Matching Pursuit

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CS 7140: Advanced Machine Learning

Lecture 15: Random Projections

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space

STAT 518 Intro Student Presentation

STAT 200C: High-dimensional Statistics

CSCI-567: Machine Learning (Spring 2019)

Linear Models for Classification

Probabilistic Machine Learning. Industrial AI Lab.

Model Selection for Gaussian Processes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear & nonlinear classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning from Data

Covariance Matrix Simplification For Efficient Uncertainty Management

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

CS-E3210 Machine Learning: Basic Principles

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

CIS 520: Machine Learning Oct 09, Kernel Methods

STAT 200C: High-dimensional Statistics

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

STA 4273H: Sta-s-cal Machine Learning

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Introduction to Gaussian Processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Transcription:

Random Projections Lopez Paz & Duvenaud November 7, 2013

Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013)

Why random projections? Fast, efficient and & distance-preserving dimensionality reduction! w R 40500 1000 x 1 y 1 δ δ(1 ± ɛ) x 2 y 2 w R 40500 1000 R 40500 R 1000 (1 ɛ) x 1 x 2 2 y 1 y 2 2 (1 + ɛ) x 1 x 2 2 This result is formalized in the Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss Lemma The proof is a great example of Erdös probabilistic method (1947). Paul Erdös Joram Lindenstrauss William B. Johnson 1913-1996 1936-2012 1944-12.5 of Foundations of Machine Learning (Mohri et al., 2012)

Auxiliary Lemma 1 Let Q be a random variable following a χ 2 distribution with k degrees of freedom. Then, for any 0 < ɛ < 1/2: Pr[(1 ɛ)k Q (1 + ɛ)k] 1 2e (ɛ2 ɛ 3 )k/4. ( Proof: we start by using Markov s inequality Pr[X > a] E[X] a Pr[Q (1 + ɛ)k] = Pr[e λq e λ(1+ɛ)k ] E[eλQ ] (1 2λ) k/2 =, eλ(1+ɛ)k e λ(1+ɛ)k where E[e λq ] = (1 2λ) k/2 is the m.g.f. of a χ 2 distribution, λ < 1 2. To tight the bound we minimize the right-hand side with λ = Pr[Q (1 + ɛ)k] (1 ɛ 1+ɛ ) k/2 e ɛk/2 = (1 + ɛ)k/2 (e ɛ ) k/2 = ) : ɛ 2(1+ɛ) : ( ) k/2 1 + ɛ. e ɛ

Auxiliary Lemma 1 Using 1 + ɛ e ɛ (ɛ2 ɛ 3 )/2 yields ( ) k/2 1 + ɛ (e ɛ ɛ2 ɛ 3 2 Pr[Q (1 + ɛ)k] e ɛ e ɛ ) = e k 4 (ɛ2 ɛ 3). Pr[Q (1 ɛ)k] is bounded similarly, and the lemma follows by applying the union bound: Pr[(1 ɛ)k Q (1 + ɛ)k] Pr[Q (1 + ɛ)k Q (1 ɛ)k] Pr[Q (1 + ɛ)k] + Pr[Q (1 ɛ)k] = 2e k 4 (ɛ2 ɛ 3 ) Then, Pr[(1 ɛ)k Q (1 + ɛ)k] 1 2e k 4 (ɛ2 ɛ 3 )

Auxiliary Lemma 2 Let x R N, k < N and A R k N with A ij N (0, 1). Then, for any 0 ɛ 1/2: Pr[(1 ɛ) x 2 1 k Ax 2 (1 + ɛ) x 2 ] 1 2e (ɛ2 ɛ 3 )k/4. Proof: let ˆx = Ax. Then, ( N ) 2 [ N E[ˆx 2 j] = E A ji x i = E i i A 2 jix 2 i ] = N x 2 i = x 2. i Note that T j = ˆx j / x N (0, 1). Then, Q = k i T 2 j χ2 k. Remember the previous lemma?

Auxiliary Lemma 2 Remember: ˆx = Ax, T j = ˆx j / x N (0, 1), Q = k i T 2 j χ2 k : Pr[(1 ɛ) x 2 1 k Ax 2 (1 + ɛ) x 2 ] = Pr[(1 ɛ) x 2 ˆx 2 k (1 + ɛ) x 2 ] = Pr[(1 ɛ)k ˆx 2 (1 + ɛ)k] = x 2 [ ] k Pr (1 ɛ)k (1 + ɛ)k = i T 2 j Pr [(1 ɛ)k Q (1 + ɛ)k] 1 2e (ɛ2 ɛ 3 )k/4

The Johnson-Lindenstrauss Lemma 20 log m For any 0 < ɛ < 1/2 and any integer m > 4, let k = ɛ. Then, 2 for any set V of m points in R N f : R N R k s.t. u, v V : (1 ɛ) u v 2 f(u) f(v) 2 (1 + ɛ) u v 2. Proof: Let f = 1 k A, A R k N, k < N and A ij N (0, 1). For fixed u, v V, we apply the previous lemma with x = u v to lower bound the success probability by 1 2e (ɛ2 ɛ 3 )k/4. Union bound again! This time over the m 2 pairs in V with k = and ɛ < 1/2 to obtain: 20 log m ɛ 2 Pr[success] 1 2m 2 e (ɛ2 ɛ 3 )k/4 = 1 2m 5ɛ 3 > 1 2m 1/2 > 0.

JL Experiments Data: 20-newsgroups, from 100.000 features to 300 (0.3%) MATLAB implementation: 1/sqrt(k).*randn(k,N)%*%X.

JL Experiments Data: 20-newsgroups, from 100.000 features to 1.000 (1%) MATLAB implementation: 1/sqrt(k).*randn(k,N)%*%X.

JL Experiments Data: 20-newsgroups, from 100.000 features to 10.000 (10%) MATLAB implementation: 1/sqrt(k).*randn(k,N)%*%X.

JL Conclusions Do you have a huge feature space? Are you wasting too much time with PCA? Random Projections are fast, compact & efficient! Monograph (Vampala, 2004) Sparse Random Projections (Achlioptas, 2003) Random Projections can Improve MoG! (Dasgupta, 2000) Code for previous experiments: http://bit.ly/17ftfbh But... What about non-linear random projections? Ali Rahimi Ben Recht

Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013)

A Familiar Creature T f(x) = α i φ(x; w i ) i=1 Gaussian Process Kernel Regression SVM AdaBoost Multilayer Perceptron... Same model, different training approaches! Things get interesting when φ is non-linear...

A Familiar Solution

A Familiar Solution

A Familiar Monster: The Kernel Trap k(x i, x j ) K

Greedy Approximation of Functions Approx. f(x) = i=1 α iφ(x; w i ) with f T (x) = T i=1 α iφ(x; w i ). Greedy Fitting (α, W ) = min α,w Random Kitchen Sinks Fitting w i,..., w T p(w), T α i φ(; w i ) f i=1 α = min α µ T α i φ(; wi ) f i=1 µ

Greedy Approximation of Functions F {f(x) = i=1 α iφ(x; w i ), w i Ω, α 1 C} f(x) = T i=1 α iφ(x; w i ), w i Ω, α 1 C f T f µ = X (f T (x) f(x)) 2 µ(dx) = ( ) O C T (Jones, 1992)

RKS Approximation of Functions Approx. f(x) = i=1 α iφ(x; w i ) with f T (x) = T i=1 α iφ(x; w i ). Greedy Fitting (α, W ) = min α,w Random Kitchen Sinks Fitting w i,..., w T p(w), T α i φ(; w i ) f i=1 α = min α µ T α i φ(; wi ) f i=1 µ

Just an old idea?

Just an old idea? W. Maasss, T. Natschlaeger, H. Markram, Real-time computing without stable states: A new framework for neural computation based on perturbations, Neural Computation. 14(11), 2531-2560, (2002)

How does RKS work? For functions f(x) = α(w)φ(x; w)dw, define the p norm as: Ω α(w) f p = sup w Ω p(w) and let F p be all f with f p C. Then, for w 1,..., w T p(w), w.p. at least 1 δ, δ > 0, there exist some α s.t. f T satisfies: ( ) C f T f µ = O 1 + 2 log 1 T δ Why? Set α i = 1 T α(w i). Then, the discrepancy is given by f p : f T (x) = 1 T ( With a dataset of size N, an error O T a i (w i )φ(x; w i ) i ) 1 N is added to all bounds.

RKS Approximation of Functions F p {f(x) = α(w)φ(x; w)dw, α(w) Cp(w)} Ω f(x) = T i=1 α iφ(x; w i ) w i p(w) α i C K ( ) f p O T f

Random Kitchen Sinks: Auxiliary Lemma Let X = {x 1,, x K } be i.i.d. rr.vv. drawn from a centered C-radius ball of a Hilbert Space H. Let X = 1 K K k x k. Then, for any δ > 0, with probability at least 1 δ: X E[ X] C ) (1 + 2 log 1 K δ Proof: Show that f(x) = X E[ X] is stable w.r.t. perturbations: f(x) f( X) X X x i x i K 2C K. Second, the variance of the average of i.i.d. random variables is: E[ X E[ X] 2 ] = 1 K (E[ x 2 ] E[x] 2 ). Third, using Jensen s inequality and given that x C: E[f(X)] E[f 2 (X)] = E[ X E[ X] 2 ] C K Fourth, use McDiarmid s inequality and rearrange.

Random Kitchen Sinks Proof Let µ be a measure on X, and f F p. Let w 1,..., w T p(w). Then, w.p. at least 1 δ, with δ > 0, f T (x) = T i β iφ(x; w i ) s.t.: f T f µ C ) (1 + 2 log 1 T δ Proof: Let f i = β i φ(x; w i ), 1 k T and β i = α(wi) p(w i). Then, E[f i] = f : [ ] α(wi ) E[f i ] = E w p(w i ) φ(; w i) = Ω p(w) α(w) φ(; w)dw = f p(w) The claim is mainly completed by describing the concentration of the average f T = 1 T fi around f with the previous lemma.

Approximating Kernels with RKS Bochner s Theorem: A kernel k(x y) on R d is PSD if and only if k(x y) is the Fourier transform of a non-negative measure p(w). k(x y) = p(w)e jw (x y) dw R d 1 T e jw i (x y) (Monte-Carlo, O(T 1/2 )) T = 1 T i=1 T e jw i x }{{} e jw i y }{{} i=1 φ(x;w ) φ(y;w ) = 1 T φ(x; W ) 1 T φ(y; W ) Now solve least squares in the primal in O(n) time!

Random Kitchen Sinks: Implementation % Training function ytest = kitchen_sinks( X, y, Xtest, T, noise) Z = randn(t, size(x,1)); % Sample feature frequencies phi = exp(i*z*x); % Compute feature matrix % Linear regression with observation noise. w = (phi*phi + eye(t)*noise)\(phi*y); % testing ytest = w *exp(i*z*xtest); from http://www.keysduplicated.com/~ali/random-features/ That s fast, approximate GP regression! (with a sq-exp kernel) Or linear regression with [sin(zx), cos(zx)] feature pairs Show demo

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Gram Matrices How fast do we approach the exact Gram matrix? k(x, X) = Φ(X) T Φ(X)

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 3 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 4 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 5 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 10 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 20 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 200 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Random Kitchen Sinks and Posterior How fast do we approach the exact Posterior? 400 features f(x) x GP Posterior Mean GP Posterior Uncertainty Data

Kitchen Sinks in multiple dimensions D dimensions, T random features, N datapoints First: Sample T D random numbers: Z N (0, σ 2 ) each φ j (x) = exp(i[zx] j ) or: Φ(x) = exp(izx) Each feature φ j ( ) is a sine, cos wave varying along the direction given by one row of Z, with varying periods. Can be slow for many features in high dimensions. For example, computing Φ(x) is O(NT D).

But isn t linear regression already fast? D dimensions, T features, N Datapoints Computing features: Φ(x) = 1 T exp(izx) Regression: w = [ Φ(x) T Φ(x) ] 1 Φ(x) T y Train time complexity: O( DT }{{ N } Computing Features Prediction: y = w T Φ(x ) + T 3 }{{} Inverting Covariance Matrix + T }{{ 2 N } ) Multiplication Test time complexity: O( DT }{{ N } + T } 2 {{ N } ) Computing Features Multiplication For images, D is often > 10, 000.

Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013)

Fastfood (Q. Le, T. Sarlós, A. Smola, 2013) Main idea: Approximately compute Zx quickly, by replacing Z with some easy-to-compute matrices. Uses the Subsampled Randomized Hadamard Transform (SRHT) (Sarlós, 2006) Train time complexity: O( log(d)t N }{{} Computing Features Test time complexity: O( + T 3 }{{} Inverting Feature Matrix For images, if D = 10, 000, log 10 (D) = 4. + T }{{ 2 N } ) Multiplication log(d)t N }{{} + T } 2 {{ N } ) Computing Features Multiplication

Fastfood (Q. Le, T. Sarlós, A. Smola, 2013) Main idea: Compute V x Zx quickly, by building V out of easy-to-compute matrices. V has similar properties to the Gaussian matrix Z. Π is a D D permutation matrix G is diagonal random Gaussian. B is diagonal random {+1, 1}. S is diagonal random scaling. H is Walsh-Hadamard Matrix. V = 1 σ SHGΠHB (1) d

Fastfood (Q. Le, T. Sarlós, A. Smola, 2013) Main idea: Compute V x Zx quickly, by building V out of easy-to-compute matrices. V = 1 σ SHGΠHB (2) d H is Walsh-Hadamard Matrix. Multiplication in O(T log(d)). H is orthogonal, can be built recursively. H 3 = 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (3)

Fastfood (Q. Le, T. Sarlós, A. Smola, 2013) Main idea: Compute V x Zx quickly, by building V out of easy-to-compute matrices. V x = 1 σ SHGΠHBx (4) d Intuition: Scramble a single Gaussian random vector many different ways, to create a matrix with similar properties to Z. [Draw on board] HGΠHB produces pseduo-random Gaussian vectors (of identical length) S fixes the lengths to have the correct distribution.

Fastfood results Computing V x: We never store V!

Fastfood results Regression MSE:

The Trouble with Kernels (Smola) Kernel Expansion: f(x) = m i=1 α ik(x i, x) Feature Expansion: f(x) = T i=1 w iφ i (x) D dimensions, T features, N samples Method Train Time Test Time Train Mem Test Mem Naive O(N 2 D) O(ND) O(ND) O(ND) Low Rank O(NT D) O(T D) O(T D) O(T D) Kitchen Sinks O(NT D) O(T D) O(T D) O(T D) Fastfood O(NT log(d)) O(T log(d)) O(T log(d)) O(T ) Can run on a phone

Feature transforms of more interesting kernels In high dimensions, all Euclidian distances become the same, unless data lie on manifold. Usually need structured kernel. can do any stationary kernel (Matérn, rational-quadratic) with Fastfood sum of kernels is concatenation of features: [ ] Φ k (+) (x, x ) = k (1) (x, x ) + k (2) (x, x ) = Φ (+) = (1) (x) Φ (2) (x) product of kernels is outer product of features: k ( ) (x, x ) = k (1) (x, x )k (2) (x, x ) = Φ ( ) (x) = Φ (1) (x)φ (2) (x) T For example, can build translation-invariant kernel: ( ) ( ) f = f (5) D D k ((x 1, x 2,..., x D ), (x 1, x 2,..., x D)) = k(x j, x i+jmod D) (6) i=1 j=1

Takeaways Random Projections Random Features Preserve Euclidian distances while reducing dimensionality Allow for nonlinear mappings RKS can approximate GP posterior quickly Fastfood can compute T nonlinear basis functions in O(T log D) time. Can operate on structured kernels. Gourmet cuisine (exact inference) is nice, but often fastfood is good enough.