The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps

Size: px

Start display at page:

Download "The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps"

Martha Nelson
5 years ago
Views:

1 The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps Kave Eshghi and Mehran Kafai Hewlett Packard Labs 1501 Page Mill Rd. Palo Alto, CA {kave.eshghi, Abstract Kernel methods have been shown to be effective for many machine learning tasks such as classification, clustering and regression. In particular, support vector machines with the RBF kernel have proved to be powerful classification tools. The standard way to apply kernel methods is to use the kernel trick, where the inner product of the vectors in the feature space is computed via the kernel function. Using the kernel trick for SVMs, however, leads to training that is quadratic in the number of input vectors and classification that is linear with the number of support vectors. We introduce a new kernel, the CRO Concomitant Rank Order kernel that approximates the RBF kernel for unit length input vectors. We also introduce a new randomized feature map, based on concomitant rank order hashing, that produces sparse, high dimensional feature vectors whose inner product asymptotically equals the CRO kernel. Using the Hadamard transform for computing the CRO hashes ensures that the cost of computing feature vectors is very low. Thus, for unit length input vectors, we get the accuracy of the RBF kernel with the efficiency of a sparse high dimensional linear kernel. We show the efficacy of our approach using a number of standard datasets. I. INTRODUCTION Kernel methods have been shown to be effective for many machine learning tasks such as classification, clustering and regression. The theory behind kernel methods relies on a mapping between the input space and the feature space such that the inner product of the vectors in the feature space can be computed via the kernel function, aka the kernel trick. The kernel trick is used because a direct mapping to the feature space is expensive or, in the case of the RBF kernel, impossible, since the feature space is infinite dimensional. The canonical example is Support Vector Machine SVM classification with the Gaussian kernel. It has been shown that for many types of data, their classification accuracy far surpasses linear SVMs. For example, for the MNIST [1] handwritten number recognition dataset, SVM with the RBF kernel achieves accuracy of 98.6%, whereas linear SVM can only achieve 9.7%. For SVMs, the main drawback of the kernel trick is that both training and classification can be expensive. Training is expensive because the kernel function must be applied for each pair of the training samples, making the training task at least quadratic with the number of training samples. Classification is expensive because for each classification task the kernel function must be applied for each of the support vectors, whose number may be large. As a result, kernel SVMs are rarely used when the number of training instances is large or for online applications where classification must happen very fast. Many approaches have been proposed in the literature to overcome these efficiency problems with non-linear kernel SVMs. The situation is very different with sparse, high dimensional input vectors and linear kernels. For this class of problems, there are efficient algorithms for both training and classification. One implementation is LIBLINEAR [], from the same group that implemented LIBSVM [3]. When the data fits this model, e.g. for text classification, these algorithms are very effective. But when the data does not fit this model, e.g. for image classification, these algorithms are not particularly efficient and the classification accuracy is low. We introduce a new kernel K γ A, B defined as K γ A, B = Φ γ, γ, cosa, B 1 where γ is a constant, and Φ x, y, ρ is the CDF of the standard bivariate normal distribution with correlation ρ. We also introduce the randomized feature map F γ,q A, where Q R U N is an instance of an iid matrix of standard normal random variables. We prove that Fγ,Q A F γ,q B E U = K γ A, B To perform the mapping from the input space to the feature space, we use a variant of the concomitant rank order hash function [4], [5]. Relying on a result first presented in [5], we use a random permutation followed by a Hadamard transform to compute the random projection that is at the heart of this operation. The resulting algorithm for computing the feature map is highly efficient. The proposed kernel K and feature map F have interesting properties: The kernel approximates the RBF kernel on the unit sphere. The feature map is sparse and high dimensional.

2 The feature map can be computed very efficiently. Thus, for the class of problems where this kernel is effective, we have the best of both worlds: the accuracy of the RBF kernel, and the efficiency of the sparse, high dimensional linear models. Along the way, we prove a new result in the theory of concomitant rank order statistics for bivariate normal distributions, given in theorem in the Appendix. We show the efficacy of our approach, in terms of classification accuracy, training time, and classification time on a number of standard datasets. We also make a detailed comparison with alternative approaches for randomized mapping to the feature space presented in [6], [7], [8]. II. RELATED WORK Reducing the training and classification cost of non-linear SVMs has attracted a great deal of attention in the literature. Joachims et. al. [9] use basis vectors other than support vectors to find sparse solutions that speed up training and prediction. Segata et. al. [10] use local SVMs on redundant neighborhoods and choose the appropriate model at query time. In this way, they divide the large SVM problem into many small local SVM problems. Tsang et. al. [11] re-formulate the kernel methods as minimum enclosing ball MEB problems in computational geometry, and solve them via an efficient approximate MEB algorithm, leading to the idea of core sets. Nandan et. al. [1] choose a subset of the training data, called the representative set, to reduce the training time. This subset is chosen using an algorithm based on convex hulls and extreme points. A number of approaches compute approximations to the feature vectors and use linear SVM on these vectors. Chang et. al. [13] do an explicit mapping of the input vectors into low degree polynomial feature space, and then apply fast linear SVMs for classification. Vedaldi et. al. [14] introduce explicit feature maps for the additive kernels, such as the intersection, Hellinger s, and χ. Weinberger et. al. [15] use hashing to reduce the dimensionality of the input vectors. Litayem et. al. [16] use hashing to reduce the size of the input vectors and speed up the prediction phase of linear SVM. Su et. al. [17] use sparse projection to reduce the dimensionality of the input vectors while preserving the kernel function. Rahimi et. al. [6] map the input data to a randomized lowdimensional feature space using sinusoids randomly drawn from the Fourier transform of the kernel function. Quoc et. al. [7] replace the random matrix proposed in [6] with an approximation that allows for fast multiplication. Pham et. al. [8] introduce Tensor Sketching, a method for approximating polynomial kernels which relies on fast convolution of Count Sketches. Both [7] and [8] improve upon Rahimi s work [6] in terms of time and storage complexity [18]. Raginsky et. al. [19] compute locality sensitive hashes where the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel. They use the results in [6] for this purpose. III. THE CONCOMITANT RANK ORDER CRO KERNEL A. Notation AND FEATURE MAP 1 Φx: We use Φx to denote the CDF of the standard normal distribution N 0, 1, and φx to denote its PDF. φx = 1 π e x 3 Φx = x φu du 4 Φ x, y, ρ: We use Φ x, y, ρ to denote the CDF of the standard bivariate normal distribution 0 1 ρ N, 0 ρ 1 and φ x, y, ρ to denote its PDF. 1 φ x, y, ρ = π 1 ρ e x Φ x, y, ρ = x y y +ρxy 1 ρ 5 φ u, v, ρ du dv 6 3 Rx, A: We use Rx, A to denote the rank of x in A, defined as follows: Definition 1. For the scalar x and vector A R N, Rx, A is the count of the of elements of A which are less than or equal to x. B. The CRO Kernel Definition. Let A, B R N. Let γ R. Then the kernel K γ A, B is defined as: K γ A, B = Φ γ, γ, cosa, B 7 In order for K γ A, B to be admissible as a kernel for support vector machines, it needs to satisfy the Mercer conditions. Theorem 3 in the Appendix proves this. In proposition 4 in the Appendix we derive the following equation for K γ A, B: K γ A, B = Φ x + cosa,b 0 exp γ 1+ρ π dρ 8 1 ρ C. CRO Kernel Approximates RBF Kernel on the Unit Sphere Let α = Φγ 9 From the definition of bivariate normal distribution we can show that Φ γ, γ, 0 = α 10 Φ γ, γ, 1 = α 11

3 Thus logφ γ, γ, 0 = logα 1 logφ γ, γ, 1 = logα 13 If we linearly approximate logφ γ, γ, ρ between ρ = 0 and ρ = 1, we get the following: logφ γ, γ, ρ logα ρ 14 Figure 1 shows the two sides of eq. 14 as ρ goes from 0 to 1. The choice of γ =.4617 is for a typical use case of the kernel. α is related to γ through eq log[φ γ, γ, ρ ] logα ρ ρ Fig. 1. Comparison of the two sides of eq. 14 for γ =.4617 From eq. 14 we get Φ γ, γ, ρ e logα ρ 15 αe logα1 ρ 16 Let ρ = cosa, B. Then from Definition and eq. 16 we get K γ A, B αe logα1 cosa,b 17 Let A, B R N be on the unit sphere, i.e. Then A = B = 1 18 A B = A + B A B cosa, B 19 = 1 cosa, B 0 Replacing the rhs of eq. 0 in eq. 17 we get K γ A, B αe logα A B 1 But the rhs of eq. 1 is the definition of an RBF kernel with parameter logα. Notice that the purpose of the comparison with the RBF kernel is to give us an intuition about K γ A, B, to enable us to make meaningful comparisons with implementations that use RBF. To this end, how close is the approximation to the RBF kernel? The following diagram shows the two sides of eq. 1 as cosa, B goes from 0 to 1. Notice that when cosa, B Fig.. Comparison of the two sides of eq. 1 for γ =.4617 is negative then the values of both kernels are very close to zero, so we ignore such cases. We will now describe the feature map that allows us to use the kernel K γ A, B for tasks such SVM training. IV. THE CONCOMITANT RANK ORDER CRO HASH FUNCTION AND FEATURE MAP A. The CRO hash function Definition 3. Let U, τ be positive integers, with τ < U. Let A R N and Q R U N. Let P = QA. Then H Q,τ A = {j : RP [j], P τ} where R is the rank function defined in Definition 1. We call H Q,τ A the CRO hash set of A with respect to Q and τ. In words, the hash set of A with respect to Q and τ is the set of indexes of those elements of P whose rank is less than or equal to τ. If P does not have any repetitions, the following Matlab code returns the hash set: [, ix] = sortp; H = ix[1:tau]; According to the above definition, the universe from which the hashes are drawn is 1... U, where U is the number of rows of Q. Theorem 1. Let M be a U N matrix of iid N 0, 1 random variables. Let Q be an instance of M. Let A, B R N. Let 0 < λ < 1 be a constant. Let τ = λu 1. Let γ = Φ 1 τ U 1 3 Then lim E HQ,τ A H Q,τ B = Φ γ, γ, cosa, B U 4 Proof: In the Appendix. B. The CRO Feature Map In this section we introduce the feature map F γ,q A : R N R U 5

4 This is the function that maps vectors from the input space to sparse, high dimensional vectors in the feature space. Definition 4. Let A, B, Q, γ, τ, U be defined as in theorem 1. The feature map F γ,q A : R N R U is defined as follows: 1 if j H Q,τ A F γ,q A [j] = 6 0 if j / H Q,τ A Proposition 1 establishes the relationship between the feature map F γ,q A and the kernel K γ A, B. Proposition 1. Fγ,Q A F γ,q B E = K γ A, B 7 U Proof: It is a simple matter to show that F γ,q A F γ,q B = H Q,τ A H Q,τ B 8 By theorem 1 Fγ,Q A F γ,q B E = Φ γ, γ, cosa, B 9 U Thus, from Definition and eq. 9 Fγ,Q A F γ,q B E = K γ A, B 30 U C. The Feature Map is Binary and Sparse From Definition 3 and Definition 4 it follows that The total number of elements in F γ,q A is U. The number of non-zero elements in F γ,q A is τ. All the non-zero elements of F γ,q A are 1. Recall that α = τ. We call α the sparsity of the feature map. As discussed previously, K γ A, B approximates the RBF kernel αe logα A B on the unit sphere, where α is the sparsity of the feature map. It follows that the relationship between the RBF parameter and the sparsity of the feature map is exponential. We can interpret eq. 30 as follows: F γ,q A and F γ,q B are the projections of A and B into the feature space whose expected inner product is UK γ A, B. What is more, F γ,q A and F γ,q B are sparse and high dimensional. Thus, as long as we can efficiently compute F γ,q A and F γ,q B, we can use algorithms optimized for sparse linear computations on binary vectors in the feature space. As an example, when the task we are trying to perform is to train a support vector machine, instead of using the kernel trick in the input space, we can use linear SVM with sparse high dimensional vectors in the feature space using highly efficient sparse linear SVM solvers such as LIBLINEAR [] to do the training and classification. D. Computing the CRO Hash Set The computation of the projection vector P = QA requires U N operations, which can be very expensive when N is large. To avoid this, we use the scheme described in [5] to compute the CRO hash set for the input vectors. The CRO hash function maps input vectors to a set of hashes chosen from a universe 1... U, where U is a large integer. We use τ to denote the number of hashes that we require per input vector. Let A R N be the input vector. The hash function takes as a second input a random permutation Π of 1... U. It should be emphasized that the random permutation Π is chosen once and used for hashing all input vectors. Table I shows the procedure for computing the CRO hash set. Here we use A to represent the vector A multiplied by 1. We use A, B, C... to represent concatenation of vectors A, B, C etc. TABLE I COMPUTING THE CRO HASH SET FOR INPUT VECTOR A 1 Let Â = A, A Create a repeated input vector A as follows: A = Â, Â,..., Â }{{}}{{} 000 where d = U div Â and r = d r U mod Â. div represents integer division. Thus A = dn + r = U. 3 Apply the random permutation Π to A to get permuted input vector V. 4 Calculate the Hadamard Transform of V to get S. If an efficient implementation of the Hadamard Transform is not available, we can use another orthogonal transform, for example the DCT transform. 5 Find the indices of the smallest τ members of S. These indices are the hash set of the input vector A. Table II presents an implementation of the CRO hash function in Matlab. V. EXPERIMENTS We present the experiment results in two sections. In Section V-A we compare the proposed CRO feature map with other approaches that perform random feature mappings with respect to the transformation time. In Section V-B we evaluate CROSVM CRO kernel + linear SVM and compare it to LIBLINEAR [], LIBSVM [3], Fastfood [7] and Tensor Sketch [8]. We ve implemented a highly optimized Walsh-Hadamard transform function which we call within the CRO hash function. The built-in Matlab implementation of the Walsh- Hadamard transform fwht is slow so we recommend using the DCT function or a faster implementation of the WHT function.

5 TABLE II MATLAB CODE FOR THE CRO HASH FUNCTION function hashes = CROHashA,U,P,tau % A is the input vector. % U is the size of the hash universe. % P is a random permutation of 1:U %% chosen once and for all and used %% in all hash calculations. % tau is the desired number of hashes E=zeros1,U; AHat = [A,-A]; N=lengthAHat; d=flooru/n; for i=0:d-1 Ei*N+1:i+1*N=AHat; end Q=EP; % If an efficient implementation of % the Walsh-Hadamard transform is % available, we can use it instead, i.e. % S=fwhtQ; efficiency. Fastfood generates an approximate mapping for the RBF kernel using the Walsh-Hadamard transform. Tensor Sketch approximates polynomial kernels using tensor sketch convolution. Classification, or learning in general, using random feature mapping approaches requires three main steps: transformation, training, and testing. In this section we compare the CRO kernel with Fastfood [7] and Tensor Sketch [8] in terms of transformation time. Figure 3 compares the CRO kernel, Fastfood, and Tensor Sketch in terms of transformation time with increasing input space dimensionality d. The input size n is equal to 10, 000 and the feature dimensionality D is set to The results show that Tensor sketch transformation time increases linearly with the increase whereas Fastfood and the CRO kernel show logarithmic dependency to d. Transformation via the CRO kernel is faster than Fastfood for the given parameter values for all experimented values of d. S=dctQ; [,ix]=sorts; hashes=ix1:tau; A. Transformation time comparison Random feature mappings introduced by Rahimi et. al. [6] works as follows: A projection matrix P is created where each row of P is drawn randomly from the Fourier transform of the kernel function. To calculate the feature vector for the input vector A, A p = P A is calculated. The feature vector ΦA is a vector whose length is twice the length of A p, and for each coordinate i in A p ΦA[i] = cosa p [i] and ΦA[i + 1] = sina p [i]. The authors show that the expected inner product of the feature vectors is the kernel function applied to the input vectors. Our approach has the following two advantages over the work by Rahimi et. al. [6]: 1 We can use the Hadamard transform to compute the feature vectors, whereas they need to do the projection through multiplying the input vector with the projection matrix. This can be more expensive if the input vector is relatively high dimensional. We generate a sparse feature vector only τ entries out of U in the feature vector are non-zero, where τ << U, whereas their feature vector is dense. Having sparse feature vectors can be advantageous in many circumstances, for example many training algorithms converge much more quickly with sparse feature vectors. More recently, Fastfood [7] and Tensor Sketch [8] proposed approaches for random feature mappings which improved upon the work by Rahimi et. al. [6] in terms of speed and Fig. 3. Transformation time with increasing input space dimensionality d Figure 4 illustrates the linear dependency between transformation time and input size n for all three approaches. For this experiment we used face image data with a dimensionality d=104. Also, the feature space dimensionality D was set to The results show that the transformation time for Fastfood increases at a higher linear rate compared to Tensor Sketch and the CRO kernel. The results show that transformation via the CRO kernel is 15% faster than Tensor Sketch. Figure 5 presents the transformation time comparison between the three approaches with increasing values of feature space dimensionality D. All three approaches show linear increase in transformation time when increasing feature space dimensionality D; however, transformation using the CRO kernel has a smaller average rate of change. B. Evaluation of CROSVM CROSVM is the combination of the CRO kernel and linear SVM. We evaluated CROSVM on three publicly available datasets listed in Table IV, all downloaded from the LIB- SVM [3] website.

6 TABLE III CLASSIFICATION ACCURACY AND PROCESSING TIME COMPARISON SECONDS LIBLINEAR LIBSVM RBF CROSVM dataset training testing accuracy training testing accuracy training testing accuracy w8a % % % covtype % % % Fig. 4. Transformation time in seconds with increasing input size n delivers the fastest transformation time; however, the main advantage of CROSVM is related to the training time. The output of the CRO kernel is a sparse vector whereas Fastfood and Tensor Sketch generate dense vectors. As a result, the training time for CROSVM is significantly less than Fastfood and Tensor Sketch. CROSVM performs similar to LIBSVM in terms of classification accuracy but in less time. CROSVM achieves 98.54% classification accuracy whereas LIBSVM achieves 98.57%; however, CROSVM takes 1.6 seconds for SVM training compared to 465 seconds for LIBSVM. LIBLINEAR has the fastest training and testing time; however, the error rate is higher compared to other approaches. TABLE V CLASSIFICATION RESULTS ON MNIST DATASET method trans. time training time testing time error rate LIBLINEAR % CROSVM % LIBSVM RBF % Fastfood % Tensor Sketch % Fig. 5. Transformation time as a function of feature space dimensionality D TABLE IV DATASET INFORMATION dataset training size testing size attributes mnist 60,000 10, w8a 49,749 14, covtype 500,000 81,01 54 The largest dataset is the Covertype dataset covtype which includes 581,01 instances and 54 attributes. We chose these datasets because the RBF kernel shows significant improvement in accuracy over the linear kernel. Thus, for example, we did not include the adult dataset, since for this dataset linear SVMs are as good as non-linear ones. Table V presents the classification results on the MNIST dataset [1] for LIBLINEAR, LIBSVM, CROSVM, Fastfood, and Tensor Sketch. LIBLINEAR is used as the linear classifier for CROSVM, Fastfood, and Tensor Sketch. In comparison with Fastfood and Tensor Sketch, CROSVM Table III presents the classification accuracy and training and testing time for LIBSVM, LIBLINEAR and CROSVM on w8a and covtype. The time reported in Table III for CROSVM includes only the time required for training and predicting. Table VI presents the processing time breakdown for the entire CROSVM process which includes mapping the input vector into the sparse, high dimensional feature vectors, and then performing linear SVM. The total time required for CROSVM is still much less compared to LIBSVM with RBF kernel. TABLE VI PROCESSING TIME BREAKDOWN FOR CROSVM Training time seconds Testing time seconds dataset hash liblinear train total hash liblinear predict total w8a covtype On the covtype dataset CROSVM achieves higher accuracy than LIBSVM but it requires less time for training and testing. The accuracy of CROSVM is not very sensitive to the values of U and τ. Table VII shows the CROSVM cross-validation accuracy on the MNIST dataset for different values of log U and τ. We used the grid function from the LIBSVM toolbox to find the best cost c and gamma g parameters values for experiments with LIBSVM. For LIBLINEAR, we chose the value for parameter s solver type which achieves the

7 TABLE VII MNIST CROSS VALIDATION ACCURACY WITH CROSVM τ logu highest accuracy; s=1 is L-regularized L-loss support vector classification dual and s = is L-regularized L-loss support vector classification primal. For CROSVM, we performed a cross-validation search for each dataset to find the optimal values for parameters U and τ. Table VIII presents the parameter values used in the experiments for each dataset. TABLE VIII EXPERIMENT PARAMETERS FOR EACH DATASET LIBLINEAR LIBSVM CROSVM dataset s c g log U τ mnist w8a covtype C. Space Complexity The space complexity of our approach is Onτ, whereas Fastfood and Tensor Sketch both have a space complexity of OnD. For example in the results reported in Table V τ = 1000 and D = VI. CONCLUSION We introduced a highly efficient feature map that transforms vectors in the input space to sparse, high dimensional vectors in the feature space where the inner product in the feature space is the CRO kernel. We showed that the CRO kernel approximates the RBF kernel for unit length input vectors. This allows us to use very efficient linear SVM algorithms that have been optimized for high dimensional sparse vectors. The results show that we can achieve the same accuracy as non-linear RBF SVMs with linear training time and constant classification time. This approach can enable many new timesensitive applications where accuracy does not need to be sacrificed for training and classification speed. REFERENCES [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , Nov [] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research, vol. 9, pp , June 008. [3] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines, ACM Trans. on Intelligent Systems and Technology, 011. [4] K. Eshghi and S. Rajaram, Locality sensitive hash functions based on concomitant rank order statistics, in KDD, 008, pp [5] M. Kafai, K. Eshghi, and B. Bhanu, Discrete cosine transform localitysensitive hashes for face retrieval, IEEE Trans. on Multimedia, no. 4, pp , June 014. [6] A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in neural info. proc. systems, 007, pp [7] Q. Le, T. Sarlós, and A. Smola, Fastfood: Approximate kernel expansions in loglinear time, in Int. Conf. on Machine Learning, 013. [8] N. Pham and R. Pagh, Fast and scalable polynomial kernels via explicit feature maps, in 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining KDD, 013, pp [9] T. Joachims and C.-N. J. Yu, Sparse kernel svms via cutting-plane training, Mach. Learn., vol. 76, no. -3, pp , Sep [10] N. Segata and E. Blanzieri, Fast and scalable local kernel machines, Journal of Machine Learning Research, vol. 11, pp , 010. [11] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, Core vector machines: Fast svm training on very large data sets, in Journal of Machine Learning Research, 005, pp [1] M. Nandan, P. P. Khargonekar, and S. S. Talathi, Fast svm training using approximate extreme points, Journal of Machine Learning Research, vol. 15, no. 1, pp , 014. [13] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, Training and testing low-degree polynomial data mappings via linear svm, J. of Machine Learning Research, vol. 11, pp , Aug [14] A. Vedaldi and A. Zisserman, Efficient additive kernels via explicit feature maps, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp , 01. [15] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature hashing for large scale multitask learning, in Int. Conf. on Machine Learning ICML. ACM, 009, pp [16] S. Litayem, A. Joly, and N. Boujemaa, Hash-based support vector machines approximation for large scale prediction. in British Machine Vision Conference, 01, pp [17] Y.-C. Su, T.-H. Chiu, Y.-H. Kuo, C.-Y. Yeh, and W. Hsu, Scalable mobile visual classification by kernel preserving projection over highdimensional features, IEEE Trans. on Multimedia, vol. 16, no. 6, pp , Oct [18] J. von Tangen Sivertsen, Scalable learning through linearithmic time kernel approximation techniques, Master s thesis, IT University of Copenhagen, 014. [19] M. Raginsky and S. Lazebnik, Locality-sensitive binary codes from shift-invariant kernels, in Advances in neural information processing systems, 009, pp [0] M. M. Siddiqui, Distribution of quantiles in samples from a bivariate population, Journal of Research of the National Institute of Standards and Technology, [1] A. J. Smola and B. Schölkopf, A tutorial on support vector regression, Statistics and Computing, vol. 14, no. 3, pp. 199, Aug [] O. A. Vasicek, A series expansion for the bivariate normal integral, Journal of Computational Finance, Jul [3] M. Sibuya, Bivariate extreme statistics, i, Annals of the Institute of Statistical Mathematics, vol. 11, no., pp , A. Concomitant Rank VII. APPENDIX Theorem 1. Let M be a U N matrix of iid N 0, 1 random variables. Let Q be an instance of M. Let A, B R N. Let 0 < λ < 1 be a constant. Let τ = λu 1. Let γ = Φ 1 τ. Then lim E HQ,τ A H Q,τ B = Φ γ, γ, cosa, B U 31 Proof: First, we note that for any scalar s > 0 and input vector A, H Q,τ sa = H Q,τ A. This is because multiplying the input vector by a positive scalar simply scales the elements in QA, and does not change the order of the elements.

8 We also note that for any scalars s 1 0, s 0, cosa, B = coss 1 A, s B. A Let A = A and B = B U B. We will prove the U following: lim HQ,τ A H Q,τ B = Φ γ, γ, cosa, B U 3 Which, by the argument above, is sufficient to prove eq. 31. Consider the vectors Let x = QA 33 y = QB 34 ρ = cosa, B 35 x1 x xu It is possible to prove that,,..., are iid y 1 y y U samples from a bivariate normal distribution, where for all 1 i U, xi D 0 1 ρ N, 36 y i 0 ρ 1 Now consider the following question: for any 1 i U, what is the probability that i H Q,τ A H Q,τ B From Definition 3 and the definition of x and y it follows that i H Q,τ A Rx i, x τ 37 i H Q,τ B Ry i, y τ 38 Where R is the rank function introduced in Definition 1. Thus i H Q,τ A H Q,τ B By theorem Rx i, x τ Ry i, y τ 39 lim P [Rx i, x τ Ry i, y τ] = Φ γ, γ, ρ 40 Substituting in eq. 39 lim P [ i H Q,τ A H Q,τ B ] = Φ γ, γ, ρ 41 Since 1 i U, from eq. 41 it follows that lim E [ H Q,τ A H Q,τ B ] = UΦ γ, γ, ρ 4 lim E HQ,τ A H Q,τ B = Φ γ, γ, ρ 43 U Theorem. Let U be a positive integer. Let 0 < λ < 1 be a constant. Let t = λu 1. Let γ = Φ 1 t. Let x1 x xu S =,,..., y 1 y y U be a sample from Φ x, y, ρ. Let x = x 1, x,..., x U T, and y = y 1, y,..., y U T. Let 1 i U. Then lim P [Rx i, x t Ry i, y t] = Φ γ, γ, ρ 44 Proof: First, we note that since x and y are samples from a normal distribution, the probability that they have duplicate elements is zero. So we will assume that they do not have duplicate elements. Consider the sample S = x1 which is S with y 1,..., xi xi 1 xi+1 xu,,..., y i 1 y i+1 y U taken out. Let y i x = x 1,... x i 1, x i+1,..., x U T y = y 1,... y i 1, y i+1,..., y U T Since there are no duplicates in x or y, there are unique elements ˆx and ŷ in x and y such that Rˆx, x = t Rŷ, y = t It is not too difficult to show that Rx i, x t Ry i, y t x i < ˆx y i < ŷ 45 That is, P [Rx i, x t Ry i, y t] = P [x i < ˆx y i < ŷ] 46 Let q = Φ 1 λ 47 Then, when we apply proposition 3 to S, we get the following: q ˆx ŷ D N q, s r r s 48 with s and r constants solely dependent on λ. Now, by assumption, xi D 0 1 ρ N, 49 y i 0 ρ 1 xi ˆx We know that and are instances of independent y i ŷ random variables with distributions eq. 48 and eq. 49. Using standard procedure for the subtraction of two independent bivariate normal random variables, we can derive xi ˆx y i ŷ D N Now, by definition, q q 1 + s, ρ + r ρ + r 1 + s 50 t = λu 1 51

9 Thus for some 0 < δ < 1 lim Φ 1 It is also clear that lim lim t = λu 1 + δ 5 t U 1 = λ + δ 53 U 1 t U 1 = λ 54 = Φ 1 λ 55 t U 1 lim γ = q 56 lim ρ s U 1 = 1 57 r U 1 = ρ 58 Thus, from eq. 50, eq. 56, eq. 57 and eq. 58 we can derive, xi ˆx D γ 1 ρ y i ŷ N, 59 γ ρ 1 Thus by proposition lim i ˆx < 0 y i ŷ < 0] = Φ γ, γ, ρ 60 lim i < ˆx y i < ŷ] = Φ γ, γ, ρ 61 Substituting the left hand side of eq. 61 in the right hand side of eq. 46 we get lim P [Rx i, x t Ry i, y t] = Φ γ, γ, ρ 6 Proposition. Let x y D q N, q 1 ρ ρ 1 63 Then P x < 0 y < 0 = Φ q, q, ρ u x q Proof: Let =. Then from eq. 63 it easily v y q follows that u v D 0 1 ρ N, 0 ρ 1 Thus, by eq. 64 and definition of Φ 64 P [u < q v < q] = Φ q, q, ρ 65 By definition, u = x q and v = y q. Substituting in eq. 65 we get P [x q < q y q < q] = Φ q, q, ρ 66 P [x < 0 y < 0] = Φ q, q, ρ 67 x1 x xn Proposition 3. Let,,..., be a sample y 1 from Φ x, y, ρ. Let 0 < λ < 1 be a constant. Let y t = λn y n q = Φ 1 λ λ1 λ s = φq r = Φq, q, ρ λ φ q Let X 1:n, X :n,..., X n:n be the order statistics on x 1, x,..., x n and Y 1:n, Y :n,..., Y n:n the order statistics on y 1, y,..., y n. Then, as n, Φ Xt:n D N 1 λ s r Y t:n Φ 1, n n s 68 λ Proof: This follows from the theorem on page 148 in [0] with appropriate substitutions. r n n B. K γ A, B is an Admissible Support Vector Kernel Theorem 3. Let A, B R n. Then K γ A, B satisfies the Mercer s conditions, and is an admissible SV kernel. Proof: According to criteria in Theorem 8 in [1], to prove that K γ A, B satisfies Mercer s conditions, it is sufficient to prove that there is a convergent series K γ A, B = α n A B n 69 where α n 0 for all n. By definition, n=0 K γ A, B = Φ γ, γ, cosa, B 70 Using tetrachoric series expansion for Φ [] we can show that Φ γ, γ, ρ = Φ γ + φ He k γ γ ρ k+1 71 k + 1! k=0 = Φ [φγhe k 1 γ] γ + ρ k 7 k! k=1 where He k x is the k th Hermite polynomial. Let Then we have ρ = cosa, B 73 1 z = A B 74 ρ = za B 75 Substituting in eq. 70 and eq. 7 and simplifying we get K γ A, B = Φ [φγhe k 1 γ] γ + za B k k! = Φ γ + k=1 76 [φγhe k 1 γ] z k A B k k! k=1 77

10 Since Φ γ 0 and for all k [φγhe k 1 γ] z k 0 78 k! eq. 77 is sufficient to prove the theorem. C. Formula for The CRO Kernel Proposition 4. K γ A, B = Φ x + Proof: Let By definition, Now, as proved in [3] i.e. Thus Let cosa,b 0 exp γ 1+ρ π 1 ρ dρ 79 Cx, ρ = Φx, x, ρ 80 K γ A, B = Cγ, cosa, B 81 d dρ Φ x, y, ρ = φ x, y, ρ 8 d exp x y +ρxy dρ Φ 1 ρ x, y, ρ = π 83 1 ρ x x +ρx d exp dρ Cx, ρ = 1 ρ π 84 1 ρ exp x 1 ρ 1 ρ = π 85 1 ρ ρ 1 86 Then eq. 85 can be simplified as x d exp dρ Cx, ρ = 1+ρ π 87 1 ρ We know that Cx, 0 = Φ x. Thus r exp Cx, r = Φ x + From eq. 81 and eq. 88 we get K γ A, B = Φ x + 0 cosa,b 0 x 1+ρ π dρ 88 1 ρ exp γ 1+ρ π dρ 89 1 ρ Notice that condition 86 is satisfied inside the integral in eq. 89

KERNEL methods have been shown to be effective for many

CROification: Accurate Kernel Classification with the Efficiency of Sparse Linear SVM Mehran Kafai, Member, IEEE and Kave Eshghi 1 Abstract Kernel methods have been shown to be effective for many machine