Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Similar documents
Memory Efficient Kernel Approximation

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

10-701/ Recitation : Kernels

Fastfood Approximating Kernel Expansions in Loglinear Time

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels

arxiv: v1 [cs.lg] 13 Aug 2014

Subspace Embeddings for the Polynomial Kernel

Randomized Algorithms

December 11, 2017 DRAFT. Thesis Proposal. Incorporating Structural Bias into Neural Networks. Zichao Yang. Dec 2017

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps

Regularized Least Squares

PAC-learning, VC Dimension and Margin-based Bounds

Advanced Introduction to Machine Learning

Kernel Learning via Random Fourier Representations

Random Features for Large Scale Kernel Machines

Accelerated Dense Random Projections

PAC-learning, VC Dimension and Margin-based Bounds

Statistical Machine Learning from Data

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

SDRNF: generating scalable and discriminative random nonlinear features from data

Scalable machine learning for massive datasets: Fast summation algorithms

Stochastic optimization in Hilbert spaces

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

A Distributed Solver for Kernelized SVM

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Support Vector Machine (SVM) and Kernel Methods

arxiv: v1 [cs.lg] 19 Dec 2014

Kaggle.

Fastfood - Approximating Kernel Expansions in Loglinear Time

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Approximate Kernel PCA with Random Features

Towards Deep Kernel Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Probabilistic and Statistical Learning. Alexandros Iosifidis Academy of Finland Postdoctoral Research Fellow (term )

Efficient HIK SVM Learning for Image Classification

Classification of Sparse and Irregularly Sampled Time Series with Mixtures of Expected Gaussian Kernels and Random Features

Support Vector Machine (SVM) and Kernel Methods

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Nonlinearity & Preprocessing

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

What is semi-supervised learning?

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Review: Support vector machines. Machine learning techniques and image analysis

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

Approximate Kernel Methods

Compressed Fisher vectors for LSVR

Bayesian Support Vector Machines for Feature Ranking and Selection

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning : Support Vector Machines

Efficient Complex Output Prediction

9.2 Support Vector Machines 159

Approximate Principal Components Analysis of Large Data Sets

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Large-Scale FPGA implementations of Machine Learning Algorithms

This ensures that we walk downhill. For fixed λ not even this may be the case.

Support Vector Machine (continued)

Sketching Structured Matrices for Faster Nonlinear Regression

Workshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers

Classifier Complexity and Support Vector Classifiers

Sparse Support Vector Machines by Kernel Discriminant Analysis

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

L5 Support Vector Classification

Linear & nonlinear classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Probabilistic Class-Specific Discriminant Analysis

Large Scale Machine Learning with Stochastic Gradient Descent

Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning

Stochastic Variational Deep Kernel Learning

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines

Statistical Pattern Recognition

Machine Learning, Midterm Exam

Linear & nonlinear classifiers

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS798: Selected topics in Machine Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning Basics

ECS289: Scalable Machine Learning

Kernels and the Kernel Trick. Machine Learning Fall 2017

Probabilistic Regression Using Basis Function Models

Neural networks and support vector machines

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Heteroscedastic Gaussian Process Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines.

Machine Learning 2: Nonlinear Regression

ECS289: Scalable Machine Learning

Transcription:

Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Large Scale Problem: ImageNet Challenge Large scale data Number of training examples m = 1,200,000 in ILSVRC1000 dataset. Dimensions of encoded features for most algorithm, d is more than 20,000. (You can get better performance with bigger dictionary dimension, but you might have memory limit issue.) Number of support vectors: usually n > 0.1*m

SVM Tradeoffs Linear Kernel Non-linear Kernel Training speed Very fast Very slow Training scalability Very high Low Testing speed Very fast Very slow Testing accuracy Lower Higher How to get all yellow characteristics? Additive Kernels (e.g. Efficient Additive Kernels via Explicit Feature Maps, PAMI 2011). New direction: Approximate Kernel with fake random Gaussian Matrices (Fastfood).

When Kernel Methods meet Large Scale Problem In kernel methods, for large scale problems, computing the decision function is expensive, especially at prediction time. So shall we give up nonlinear kernel methods at all? No, we have better approximation solution. Turn to linear SVM + (features + complicated encoding(llc, Fisher coding, group saliency coding) + sophisticated pooling (max pooling, average pooling, learned pooling)), and now neural network.

High-dimensional Problem vs Kernel approximation Kernel expansion f x = w, φ(x) = d is the input feature dimension, n is the number of nonlinear basis functions m is the number of samples. CPU Training RAM Training CPU Test RAM Test Naive O(m 2 d) O(md) O(md) O(md) n i=1 α i φ x i φ(x) = m i=1 α i k(x i, x) Reduced set O(m 2 d) O(md) O(nd) O(nd) Low rank O(mnd) O(nd) O(nd) O(nd) Random Kitchen Sinks (RKS) O(mnd) O(nd) O(nd) O(nd) Fastfood O(mn log d) O(n log d) O(n log d) O(n)

High-dimensional Problem vs Fastfood Kernel expansion f x = w, φ(x) = α i φ x i φ(x) = d is the input feature dimension, e.g. d= 20,000. n is the number of nonlinear basis functions, e.g. n = 120,00. m is the number of samples, e.g. m = 1,200,000. CPU Training RAM Training CPU Test RAM Test Naive O(m 2 d) O(md) O(md) O(md) n i=1 m i=1 α i k(x i, x) Reduced set O(m 2 d) O(md) O(nd) O(nd) Low rank O(mnd) O(nd) O(nd) O(nd) Random Kitchen Sinks (RKS) O(mnd) O(nd) O(nd) O(nd) Fastfood O(mn log d) O(n log d) O(n log d) O(n)

Random Kitchen Sinks (Rahimi & Recht, NIPS 2007) Given a training set x i, y i i=1 m, the task is to fit a decision function f: χ R that minimize the empirical risk R emp f = 1 m l f x i, y i i=1 Wherel(f x i, y i ) denotes the loss function, e.g. Hinge loss, etc. Decision function is f x = w, φ(x) = n i=1 α i m φ x i φ(x) = m i=1 α i k(x i, x)

Random Kitchen Sinks (Rahimi & Recht, NIPS 2007) Most algorithm produce an approximate minimizer of the empirical risk by optimize over: min R emp ( α w j φ(x; w j )) w 1, w n j

Random Kitchen Sinks (Rahimi & Recht, NIPS 2007) RKS attempts to approximate the decision function f via m f x = α j φ(x; w j ) j=1 Where φ(x; w j ) is a feature function obtained via randomization.

Random Kitchen Sinks (Rahimi & Recht, NIPS 2007) Rather than computing Gaussian RBF kernel, k x, x = exp( x x 2 /(2σ 2 )) This method computek x, x = exp(i Zx j ) by drawing z i from a normal distribution.

Randomization for Approximate Gaussian RBF kernel feature maps Input: Input data x i i=1 m,we have n basis functions, scale parameter is σ 2 Output: Approximate Gaussian RBF kernel feature map φ x. 1. Draw each entry of Z R d iid from Normal(0, σ 2 ) 2. For i = 1 to m do 3. For j = 1 to n do 4. Compute the empirical feature map 1 exp (i Zx n i j) 5. End for 6. End for Complexity: O(mnd)

Fastfood The dimension of feature is d. The number of basis functions is n. Gaussian matrix cost O(nd) per multiplication. Assume d = 2 l (Pad the vector with zeros until d = 2 l holds), The goal is to approximate Z via a product of diagonal and simple matrices: Z = 1 σ d SHGΠHB

Fastfood Assume d = 2 l (Pad the vector with zeros until d = 2 l holds), The goal is to approximate Z via a product of diagonal and simple matrices: Z = 1 σ d SHGΠHB S random diagonal scaling matrix H Walsh-Hadamard matrix admitting O(d log(d)) multiply H 2d =, H d H d - and H H d H 1 = 1, H 2 =, 1 1 d 1 1 - G random diagonal Gaussian matrix Π 1,1 d d permutation matrix B a matrix which has random* 1,1+ entries on its diagonal Multiplication now is O(d log d), storage is O(d). Draw independent blocks. n d

Fastfood When n > d, We replicate Z = 1 σ d SHGΠHB for n/d independent random amtrics Z i and stack them via Z T =,Z 1, Z 2,, Z n/d - until we have enough dimensions.

Walsh-Hadamard transform The product of a Boolean function and a Walsh matrix is its Walsh spectrum[wiki]: (1,0,1,0,0,1,1,0) * H(8) = (4,2,0, 2,0,2,0,2)

Fast Walsh-Hadamard transform Fast Walsh Hadamard transform This is a faster way to calculate the Walsh spectrum of (1,0,1,0,0,1,1,0).

Key Observations 1. When combined with diagonal Gaussian matrices, Hadamard matrices exhibit very similar to dense Gaussian random matrices. 2. Hadamard matrices and diagonal matrices are inexpensive to multiply and store.

Key Observations 1. When combined with diagonal Gaussian matrices, Hadamard matrices exhibit very similar to dense Gaussian random matrices. 2. Hadamard matrices and diagonal matrices are inexpensive to multiply (O(dlog(d)) and store O(d). Code: For C++/C user, a library called provides extremely fast Fast Hadamard transform. For Matlab user, one line code y = fwht(x,n,ordering)

Experiments d n Fastfood RKS Speedup RAM 1024 16384 0.00058s 0.0139s 24x 256x 4096 32768 0.00136s 0.1224s 90x 1024x 8192 65536 0.00268s 0.5360s 200x 2048x Runtime, speed, and memory improvements of Fastfood relative to random kitchen sinks (RKS). d -input feature dimension, e.g. CIFAR-10, a tiny image 32*32*3 has 3072 dimensions. n n nonlinear basis functions.

Summary It is possible to compute n nonlinear basis functions in O(nlog d) time. Kernel methods become more practical for problems that have large datasets and/or require real-time prediction. With Fastfood, we would be able to compute nonlinear feature map for large-scale problem.