Workshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers

Size: px

Start display at page:

Download "Workshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers"

Dylan Godwin Nelson
6 years ago
Views:

1 Worshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers Subhransu Maji Research Assistant Professor Toyota Technological Institute at Chicago

2 Generalized Additive Models f(x 1,x 2,...,x n )=f 1 (x 1 )+f 2 (x 2 )+...+ f n (x n ) Why use them? Efficiency : can be efficiently evaluated Interpretability : Simple generalization of linear classifiers, i.e., may lead to models that are interpretable Well nown in the statistics community Generalized Additive Models (Hastie & Tibshirani 90) However traditional learning algorithms do not scale well (e.g. bacfitting algorithm )

05] Histogram based similarity measures are typically additive K min (x, y) = min(x i,y i ) K 2(x, y) = 2x i y i x i

3 Additive ernels in computer vision Images are represented as histograms of low level features such as color and texture [Swain and Ballard 01, Odone et al. 05] Histogram based similarity measures are typically additive K min (x, y) = min(x i,y i ) K 2(x, y) = 2x i y i x i + y i Other examples of additive ernels based on approximate correspondence : Pyramid Match Kernel, Grauman and Darrell, CVPR 05 Spatial Pyramid Match Kernel, Lazebni, Schmidt and Ponce, CVPR 06

4 Directly learning additive classifiers f(x 1,x 2,...,x n )=f 1 (x 1 )+f 2 (x 2 )+...+ f n (x n ) A SVM lie optimization framewor min f2f l y,f(x ) + R(f) x 2 R d y 2{+1, 1} Loss function on the data e.g. hinge loss function Regularization e.g. derivative norm l y,f(x ) = max(0, 1 y f(x ))

5 Strategy for learning additive classifiers min f2f l y,f(x ) + R(f) x 2 R d y 2{+1, 1}

6 Strategy for learning additive classifiers min f2f l y,f(x ) + R(f) x 2 R d y 2{+1, 1} f i = j w i j i j Basis expansion

7 Strategy for learning additive classifiers min f2f l y,f(x ) + R(f) x 2 R d y 2{+1, 1} f i = j w i j i j Smoothness penalty on the weights Basis expansion

8 Strategy for learning additive classifiers min f2f l y,f(x ) + R(f) x 2 R d y 2{+1, 1} f i = j w i j i j Smoothness penalty on the weights Basis expansion Search for representations of the function and regularization for which the optimization can be efficiently solved In particular we want to leverage the latest methods for learning linear classifiers (almost linear time algorithms)

9 Spline embeddings : motivation f(x 1,x 2,...,x n )=f 1 (x 1 )+f 2 (x 2 )+...+ f n (x n ) i w i i Represent each function using a uniformly spaced spline basis Motivated by our earlier analysis that splines approximate additive classifiers well Popularized by Eilers and Marx (P-Splines 02, 04) for additive modeling Well nown in graphics (DeBoor 01) Question: What is the regularization on w? Figure from Eilers & Marx 04

10 Spline embeddings : representation Representation f(x) =w T (x) Linear'B)Spline' Quadra0c'B)Spline' Cubic'B)Spline' Project data onto a uniformly spaced spline basis

11 Spline embeddings : regularization Representation f(x) =w T (x) Linear'B)Spline' Quadra0c'B)Spline' Cubic'B)Spline' Regularization R(f) =w T Hw H = D T D D 0 = I Penalize differences between adjacent weights Ensures that the learned function is smooth Similar to Penalized Splines of Eilers and Marx 02

12 Spline embeddings : optimization Representation f(x) =w T (x) Linear'B)Spline' Quadra0c'B)Spline' Cubic'B)Spline' Regularization R(f) =w T Hw H = D T D D 0 = I Optimization c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x )

13 Spline embeddings : linear case Representation f(x) =w T (x) Linear'B)Spline' Quadra0c'B)Spline' Cubic'B)Spline' Regularization zero order differences [Maji and Berg, ICCV 09] D 0 = I Reduces to a standard linear SVM Projected features are sparse. At most non-zero entries for a basis of degree. Optimization c(w) = 2 w T w + 1 n max 0, 1 y w T (x )

14 Spline embeddings : general case Representation f(x) =w T (x) Linear'B)Spline' Quadra0c'B)Spline' Cubic'B)Spline' Regularization Penalize first order differences. Non-standard SVM due to regularization [Eilers & Marx 02] Can offer better smoothness when some bins have few data points Optimization c(w) = 2 w T D T 1 D 1 w + 1 n max 0, 1 y w T (x )

15 Spline embeddings : linearization... Original Problem c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x )

16 Spline embeddings : linearization... D T 1 = C A Original Problem c(w) = 2 w T D T d D d w + 1 n Upper Triangular max 0, 1 y w T (x ) Re-Parameterization of the weight c(w) = 2 w T w + 1 n max 0, 1 y w T D T 1 (x )

17 Spline embeddings : linearization... = D T 1... D T 1 = C A Equivalently a change of basis Upper Triangular c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x ) Re-Parameterization of the features c(w) = 2 w T w + 1 n max 0, 1 y w T D T 1 (x )

18 Spline embeddings : linearization Implicit ernel... = D T 1... Equivalently a change of basis K 1 1 = T c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x ) Re-Parameterization of the features c(w) = 2 w T w + 1 n max 0, 1 y w T D T 1 (x )

19 Spline embeddings : visualizing the ernel Various basis and regularization Fixing order(regularization)=1, and varying degree(basis) Smooth variants of the min ernel

20 Spline embeddings : visualizing the basis Various basis and regularization i(x) =(x i ) r + The basis are equivalent to truncated polynomial basis if: degree(basis) - order(regularization) = 1 [DeBoor 01]

21 Solving the optimization efficiently c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x ) Original problem : non-standard regularization

22 Solving the optimization efficiently c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x ) Original problem : non-standard regularization c(w) = 2 w T w + 1 n max 0, 1 y w T D T d (x ) Modified problem : dense features (memory bottlenec)

23 Solving the optimization efficiently c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x ) Original problem : non-standard regularization c(w) = 2 w T w + 1 n max 0, 1 y w T D T d (x ) Modified problem : dense features (memory bottlenec) Solution : implicit representation maintain: classification: f(x) =w T d (x) O(d) vs O(n)

24 Solving the optimization efficiently c(w) = 2 w T D T d D d w + 1 n max 0, 1 y w T (x ) Original problem : non-standard regularization c(w) = 2 w T w + 1 n max 0, 1 y w T D T d (x ) Modified problem : dense features (memory bottlenec) Solution : implicit representation maintain: classification: f(x) =w T d (x) O(d) vs O(n) update:

25 maintain: Solving the optimization efficiently Computing the updates classification: f(x) =w T d (x) update:

26 maintain: Solving the optimization efficiently Computing the updates classification: f(x) =w T d (x) update: Given n linearly spaced basis of degree d, computing updates to wd taes O(n 2 ) time.

27 maintain: Solving the optimization efficiently Computing the updates classification: f(x) =w T d (x) update: Given n linearly spaced basis of degree d, computing updates to wd taes O(n 2 ) time. However, one can compute it on O(nd) by exploiting the structure of D (see paper below) S. Maji, Linearized Smooth Addi4ve Classifiers, ECCV WS 2012

28 maintain: Solving the optimization efficiently Computing the updates classification: f(x) =w T d (x) update: Given n linearly spaced basis of degree d, computing updates to wd taes O(n 2 ) time. However, one can compute it on O(nd) by exploiting the structure of D (see paper below) Hence these classifiers can be trained with very low memory overhead, without compromising training time S. Maji, Linearized Smooth Addi4ve Classifiers, ECCV WS 2012

29 Spline embeddings : computational tradeoffs D 2 regularization order D 1 D 0 linear quadratic cubic spline degree

30 Spline embeddings : computational tradeoffs D 2 regularization order D 1 Standard linear solver Accuracy increases Test time increases linearly Training time increases linearly D s, 90.23% 7.26s, 90.34% 10.08s, 90.39% linear quadratic cubic spline degree Experiments on DC pedestrian dataset ( n = 20) IKSVM training time : 360s

31 Spline embeddings : computational tradeoffs D s 89.06% Custom solver for order > 1 Accuracy peas at D1 regularization order D 1 D s 91.20% 5.96s 90.23% This suggests that first order smoothness is sufficient Training time increases linearly Test time constant linear quadratic cubic spline degree Experiments on DC pedestrian dataset ( n = 20) IKSVM training time : 360s

32 Spline embeddings : computational tradeoffs D 2 Useful regime for most datasets regularization order D 1 D 0 linear quadratic cubic spline degree

33 Spline embeddings : computational tradeoffs D 2 Useful regime for most datasets regularization order D 1 most accurate closely approximates IKSVM fastest D 0 linear quadratic cubic spline degree

34 Spline embeddings : computational tradeoffs D 2 Useful regime for most datasets regularization order D 1 most accurate closely approximates IKSVM fastest D 0 linear quadratic cubic spline degree Experiments on MNIST, INRIA pedestrians, Caltech 101, DC pedestrians, etc All these are still an order of magnitude faster than traditional SVM solver. These have the same memory overhead as linear SVMs since the features are computed online.

35 General basis expansion f(x 1,x 2,...,x n )=f 1 (x 1 )+f 2 (x 2 )+...+ f n (x n ) i w i i local splines are the basis In general can choose any orthonormal basis

36 Fourier embeddings : representation Representation f(x) = w i i (x) Z b a 1(x), 1 (x),..., n (x) i(x) j (x) (x)dx = i,j An orthonormal basis Examples: Trigonometric functions, Wavelets, etc.

37 Fourier embeddings : regularization Representation f(x) = w i i (x) Z b a 1(x), 1 (x),..., n (x) i(x) j (x) (x)dx = i,j An orthonormal basis Regularization : penalize d th order derivative Z b a f d (x) 2 (x)dx = R(f) =w T Hw Z b a H i,j = w i w j d i (x) d j (x) (x)dx Z b a d i (x) d j (x)w(x)dx Similar to the spline case (different basis) Requires the basis to be differentiable

38 Fourier embeddings : optimization Representation f(x) = w i i (x) Z b a 1(x), 1 (x),..., n (x) i(x) j (x) (x)dx = i,j An orthonormal basis Regularization : penalize d th order derivative Z b a f d (x) 2 (x)dx = R(f) =w T Hw Z b a H i,j = w i w j d i (x) d j (x) (x)dx Z b a d i (x) d j (x)w(x)dx Optimization c(w) = 2 w T Hw + 1 n max 0, 1 y w T (x )

39 Fourier embeddings : optimization Optimization c(w) = 2 w T Hw + 1 n max 0, 1 y w T (x ) is not diagonal - cannot directly use fast linear solvers In general it is not structured either (unlie splines)

40 Fourier embeddings : optimization Optimization c(w) = 2 w T Hw + 1 n max 0, 1 y w T (x ) is not diagonal - cannot directly use fast linear solvers In general it is not structured either (unlie splines) Z b a Practical solution Pic orthogonal basis with orthogonal derivatives Cross terms disappear, i.e., is diagonal again f d (x) 2 (x)dx = Z b a w i w j d i (x) d j (x) (x)dx / w 2 i

41 Fourier embeddings : optimization Optimization c(w) = 2 w T Hw + 1 n max 0, 1 y w T (x ) is not diagonal - cannot directly use fast linear solvers In general it is not structured either (unlie splines) Z b a Practical solution Pic orthogonal basis with orthogonal derivatives Cross terms disappear, i.e., is diagonal again f d (x) 2 (x)dx = Z b a w i w j d i (x) d j (x) (x)dx / w 2 i Examples: Trigonometric functions, One of Jacobi, Laguerre or Hermite polynomials M. Webster, Orthogonal polynomials with orthogonal derivafves. Mathema'sche Zeitschri., 39: , 1935

42 Fourier Embeddings : Two practical ones Two families of orthogonal basis with orthogonal derivatives Embeddings that penalize the first and second order derivatives Learning : project data onto the first few basis and use a linear solver such as LIBLINEAR Fourier features are low dimensional and dense, as opposed to spline features which are high dimensional but sparse.

43 Comparison of various additive classifiers DC pedestrian dataset

44 Comparison of various additive classifiers MNIST dataset

45 Comparison of various additive classifiers

46 Software Code to train large scale additive classifiers. Provides functions: train : input (y,x) and outputs additive classifiers choice of various encodings, regularizations. encodings are computed online implements efficient weight updates for splines features classify : taes a learned classifier and features, outputs decision values encode : returns encoded features which can be directly used with any linear solver Download at:

47 Conclusions We discussed methods to directly learn additive classifiers based on a regularized loss minimization We proposed two inds of basis for which the learning problem can be efficiently solved, Spline and Fourier embeddings Spline embeddings are sparse, easy to compute, and can be used to learn classifiers with almost no memory overhead, compared to learning linear classifiers. Fourier embeddings (Trigonometric and Hermite) are low dimensional, but are relatively expensive to compute, hence are useful in setting where features can be stored in memory More experimental details and code can be found on the author s website (ttic.uchicago.edu/~smaji)

Max-Margin Additive Classifiers for Detection

Max-Margin Additive Classifiers for Detection Subhransu Maji and Alexander C. Berg Sam Hare VGG Reading Group October 30, 2009 Introduction CVPR08: SVMs with additive kernels can be evaluated efficiently.