The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps

Size: px
Start display at page:

Download "The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps"

Transcription

1 The CRO Kernel: Using Concomitant Rank Order Hashes for Sparse High Dimensional Randomized Feature Maps Kave Eshghi and Mehran Kafai Hewlett Packard Labs 1501 Page Mill Rd. Palo Alto, CA {kave.eshghi, Abstract Kernel methods have been shown to be effective for many machine learning tasks such as classification, clustering and regression. In particular, support vector machines with the RBF kernel have proved to be powerful classification tools. The standard way to apply kernel methods is to use the kernel trick, where the inner product of the vectors in the feature space is computed via the kernel function. Using the kernel trick for SVMs, however, leads to training that is quadratic in the number of input vectors and classification that is linear with the number of support vectors. We introduce a new kernel, the CRO Concomitant Rank Order kernel that approximates the RBF kernel for unit length input vectors. We also introduce a new randomized feature map, based on concomitant rank order hashing, that produces sparse, high dimensional feature vectors whose inner product asymptotically equals the CRO kernel. Using the Hadamard transform for computing the CRO hashes ensures that the cost of computing feature vectors is very low. Thus, for unit length input vectors, we get the accuracy of the RBF kernel with the efficiency of a sparse high dimensional linear kernel. We show the efficacy of our approach using a number of standard datasets. I. INTRODUCTION Kernel methods have been shown to be effective for many machine learning tasks such as classification, clustering and regression. The theory behind kernel methods relies on a mapping between the input space and the feature space such that the inner product of the vectors in the feature space can be computed via the kernel function, aka the kernel trick. The kernel trick is used because a direct mapping to the feature space is expensive or, in the case of the RBF kernel, impossible, since the feature space is infinite dimensional. The canonical example is Support Vector Machine SVM classification with the Gaussian kernel. It has been shown that for many types of data, their classification accuracy far surpasses linear SVMs. For example, for the MNIST [1] handwritten number recognition dataset, SVM with the RBF kernel achieves accuracy of 98.6%, whereas linear SVM can only achieve 9.7%. For SVMs, the main drawback of the kernel trick is that both training and classification can be expensive. Training is expensive because the kernel function must be applied for each pair of the training samples, making the training task at least quadratic with the number of training samples. Classification is expensive because for each classification task the kernel function must be applied for each of the support vectors, whose number may be large. As a result, kernel SVMs are rarely used when the number of training instances is large or for online applications where classification must happen very fast. Many approaches have been proposed in the literature to overcome these efficiency problems with non-linear kernel SVMs. The situation is very different with sparse, high dimensional input vectors and linear kernels. For this class of problems, there are efficient algorithms for both training and classification. One implementation is LIBLINEAR [], from the same group that implemented LIBSVM [3]. When the data fits this model, e.g. for text classification, these algorithms are very effective. But when the data does not fit this model, e.g. for image classification, these algorithms are not particularly efficient and the classification accuracy is low. We introduce a new kernel K γ A, B defined as K γ A, B = Φ γ, γ, cosa, B 1 where γ is a constant, and Φ x, y, ρ is the CDF of the standard bivariate normal distribution with correlation ρ. We also introduce the randomized feature map F γ,q A, where Q R U N is an instance of an iid matrix of standard normal random variables. We prove that Fγ,Q A F γ,q B E U = K γ A, B To perform the mapping from the input space to the feature space, we use a variant of the concomitant rank order hash function [4], [5]. Relying on a result first presented in [5], we use a random permutation followed by a Hadamard transform to compute the random projection that is at the heart of this operation. The resulting algorithm for computing the feature map is highly efficient. The proposed kernel K and feature map F have interesting properties: The kernel approximates the RBF kernel on the unit sphere. The feature map is sparse and high dimensional.

2 The feature map can be computed very efficiently. Thus, for the class of problems where this kernel is effective, we have the best of both worlds: the accuracy of the RBF kernel, and the efficiency of the sparse, high dimensional linear models. Along the way, we prove a new result in the theory of concomitant rank order statistics for bivariate normal distributions, given in theorem in the Appendix. We show the efficacy of our approach, in terms of classification accuracy, training time, and classification time on a number of standard datasets. We also make a detailed comparison with alternative approaches for randomized mapping to the feature space presented in [6], [7], [8]. II. RELATED WORK Reducing the training and classification cost of non-linear SVMs has attracted a great deal of attention in the literature. Joachims et. al. [9] use basis vectors other than support vectors to find sparse solutions that speed up training and prediction. Segata et. al. [10] use local SVMs on redundant neighborhoods and choose the appropriate model at query time. In this way, they divide the large SVM problem into many small local SVM problems. Tsang et. al. [11] re-formulate the kernel methods as minimum enclosing ball MEB problems in computational geometry, and solve them via an efficient approximate MEB algorithm, leading to the idea of core sets. Nandan et. al. [1] choose a subset of the training data, called the representative set, to reduce the training time. This subset is chosen using an algorithm based on convex hulls and extreme points. A number of approaches compute approximations to the feature vectors and use linear SVM on these vectors. Chang et. al. [13] do an explicit mapping of the input vectors into low degree polynomial feature space, and then apply fast linear SVMs for classification. Vedaldi et. al. [14] introduce explicit feature maps for the additive kernels, such as the intersection, Hellinger s, and χ. Weinberger et. al. [15] use hashing to reduce the dimensionality of the input vectors. Litayem et. al. [16] use hashing to reduce the size of the input vectors and speed up the prediction phase of linear SVM. Su et. al. [17] use sparse projection to reduce the dimensionality of the input vectors while preserving the kernel function. Rahimi et. al. [6] map the input data to a randomized lowdimensional feature space using sinusoids randomly drawn from the Fourier transform of the kernel function. Quoc et. al. [7] replace the random matrix proposed in [6] with an approximation that allows for fast multiplication. Pham et. al. [8] introduce Tensor Sketching, a method for approximating polynomial kernels which relies on fast convolution of Count Sketches. Both [7] and [8] improve upon Rahimi s work [6] in terms of time and storage complexity [18]. Raginsky et. al. [19] compute locality sensitive hashes where the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel. They use the results in [6] for this purpose. III. THE CONCOMITANT RANK ORDER CRO KERNEL A. Notation AND FEATURE MAP 1 Φx: We use Φx to denote the CDF of the standard normal distribution N 0, 1, and φx to denote its PDF. φx = 1 π e x 3 Φx = x φu du 4 Φ x, y, ρ: We use Φ x, y, ρ to denote the CDF of the standard bivariate normal distribution 0 1 ρ N, 0 ρ 1 and φ x, y, ρ to denote its PDF. 1 φ x, y, ρ = π 1 ρ e x Φ x, y, ρ = x y y +ρxy 1 ρ 5 φ u, v, ρ du dv 6 3 Rx, A: We use Rx, A to denote the rank of x in A, defined as follows: Definition 1. For the scalar x and vector A R N, Rx, A is the count of the of elements of A which are less than or equal to x. B. The CRO Kernel Definition. Let A, B R N. Let γ R. Then the kernel K γ A, B is defined as: K γ A, B = Φ γ, γ, cosa, B 7 In order for K γ A, B to be admissible as a kernel for support vector machines, it needs to satisfy the Mercer conditions. Theorem 3 in the Appendix proves this. In proposition 4 in the Appendix we derive the following equation for K γ A, B: K γ A, B = Φ x + cosa,b 0 exp γ 1+ρ π dρ 8 1 ρ C. CRO Kernel Approximates RBF Kernel on the Unit Sphere Let α = Φγ 9 From the definition of bivariate normal distribution we can show that Φ γ, γ, 0 = α 10 Φ γ, γ, 1 = α 11

3 Thus logφ γ, γ, 0 = logα 1 logφ γ, γ, 1 = logα 13 If we linearly approximate logφ γ, γ, ρ between ρ = 0 and ρ = 1, we get the following: logφ γ, γ, ρ logα ρ 14 Figure 1 shows the two sides of eq. 14 as ρ goes from 0 to 1. The choice of γ =.4617 is for a typical use case of the kernel. α is related to γ through eq log[φ γ, γ, ρ ] logα ρ ρ Fig. 1. Comparison of the two sides of eq. 14 for γ =.4617 From eq. 14 we get Φ γ, γ, ρ e logα ρ 15 αe logα1 ρ 16 Let ρ = cosa, B. Then from Definition and eq. 16 we get K γ A, B αe logα1 cosa,b 17 Let A, B R N be on the unit sphere, i.e. Then A = B = 1 18 A B = A + B A B cosa, B 19 = 1 cosa, B 0 Replacing the rhs of eq. 0 in eq. 17 we get K γ A, B αe logα A B 1 But the rhs of eq. 1 is the definition of an RBF kernel with parameter logα. Notice that the purpose of the comparison with the RBF kernel is to give us an intuition about K γ A, B, to enable us to make meaningful comparisons with implementations that use RBF. To this end, how close is the approximation to the RBF kernel? The following diagram shows the two sides of eq. 1 as cosa, B goes from 0 to 1. Notice that when cosa, B Fig.. Comparison of the two sides of eq. 1 for γ =.4617 is negative then the values of both kernels are very close to zero, so we ignore such cases. We will now describe the feature map that allows us to use the kernel K γ A, B for tasks such SVM training. IV. THE CONCOMITANT RANK ORDER CRO HASH FUNCTION AND FEATURE MAP A. The CRO hash function Definition 3. Let U, τ be positive integers, with τ < U. Let A R N and Q R U N. Let P = QA. Then H Q,τ A = {j : RP [j], P τ} where R is the rank function defined in Definition 1. We call H Q,τ A the CRO hash set of A with respect to Q and τ. In words, the hash set of A with respect to Q and τ is the set of indexes of those elements of P whose rank is less than or equal to τ. If P does not have any repetitions, the following Matlab code returns the hash set: [, ix] = sortp; H = ix[1:tau]; According to the above definition, the universe from which the hashes are drawn is 1... U, where U is the number of rows of Q. Theorem 1. Let M be a U N matrix of iid N 0, 1 random variables. Let Q be an instance of M. Let A, B R N. Let 0 < λ < 1 be a constant. Let τ = λu 1. Let γ = Φ 1 τ U 1 3 Then lim E HQ,τ A H Q,τ B = Φ γ, γ, cosa, B U 4 Proof: In the Appendix. B. The CRO Feature Map In this section we introduce the feature map F γ,q A : R N R U 5

4 This is the function that maps vectors from the input space to sparse, high dimensional vectors in the feature space. Definition 4. Let A, B, Q, γ, τ, U be defined as in theorem 1. The feature map F γ,q A : R N R U is defined as follows: 1 if j H Q,τ A F γ,q A [j] = 6 0 if j / H Q,τ A Proposition 1 establishes the relationship between the feature map F γ,q A and the kernel K γ A, B. Proposition 1. Fγ,Q A F γ,q B E = K γ A, B 7 U Proof: It is a simple matter to show that F γ,q A F γ,q B = H Q,τ A H Q,τ B 8 By theorem 1 Fγ,Q A F γ,q B E = Φ γ, γ, cosa, B 9 U Thus, from Definition and eq. 9 Fγ,Q A F γ,q B E = K γ A, B 30 U C. The Feature Map is Binary and Sparse From Definition 3 and Definition 4 it follows that The total number of elements in F γ,q A is U. The number of non-zero elements in F γ,q A is τ. All the non-zero elements of F γ,q A are 1. Recall that α = τ. We call α the sparsity of the feature map. As discussed previously, K γ A, B approximates the RBF kernel αe logα A B on the unit sphere, where α is the sparsity of the feature map. It follows that the relationship between the RBF parameter and the sparsity of the feature map is exponential. We can interpret eq. 30 as follows: F γ,q A and F γ,q B are the projections of A and B into the feature space whose expected inner product is UK γ A, B. What is more, F γ,q A and F γ,q B are sparse and high dimensional. Thus, as long as we can efficiently compute F γ,q A and F γ,q B, we can use algorithms optimized for sparse linear computations on binary vectors in the feature space. As an example, when the task we are trying to perform is to train a support vector machine, instead of using the kernel trick in the input space, we can use linear SVM with sparse high dimensional vectors in the feature space using highly efficient sparse linear SVM solvers such as LIBLINEAR [] to do the training and classification. D. Computing the CRO Hash Set The computation of the projection vector P = QA requires U N operations, which can be very expensive when N is large. To avoid this, we use the scheme described in [5] to compute the CRO hash set for the input vectors. The CRO hash function maps input vectors to a set of hashes chosen from a universe 1... U, where U is a large integer. We use τ to denote the number of hashes that we require per input vector. Let A R N be the input vector. The hash function takes as a second input a random permutation Π of 1... U. It should be emphasized that the random permutation Π is chosen once and used for hashing all input vectors. Table I shows the procedure for computing the CRO hash set. Here we use A to represent the vector A multiplied by 1. We use A, B, C... to represent concatenation of vectors A, B, C etc. TABLE I COMPUTING THE CRO HASH SET FOR INPUT VECTOR A 1 Let  = A, A Create a repeated input vector A as follows: A = Â, Â,...,  }{{}}{{} 000 where d = U div  and r = d r U mod Â. div represents integer division. Thus A = dn + r = U. 3 Apply the random permutation Π to A to get permuted input vector V. 4 Calculate the Hadamard Transform of V to get S. If an efficient implementation of the Hadamard Transform is not available, we can use another orthogonal transform, for example the DCT transform. 5 Find the indices of the smallest τ members of S. These indices are the hash set of the input vector A. Table II presents an implementation of the CRO hash function in Matlab. V. EXPERIMENTS We present the experiment results in two sections. In Section V-A we compare the proposed CRO feature map with other approaches that perform random feature mappings with respect to the transformation time. In Section V-B we evaluate CROSVM CRO kernel + linear SVM and compare it to LIBLINEAR [], LIBSVM [3], Fastfood [7] and Tensor Sketch [8]. We ve implemented a highly optimized Walsh-Hadamard transform function which we call within the CRO hash function. The built-in Matlab implementation of the Walsh- Hadamard transform fwht is slow so we recommend using the DCT function or a faster implementation of the WHT function.

5 TABLE II MATLAB CODE FOR THE CRO HASH FUNCTION function hashes = CROHashA,U,P,tau % A is the input vector. % U is the size of the hash universe. % P is a random permutation of 1:U %% chosen once and for all and used %% in all hash calculations. % tau is the desired number of hashes E=zeros1,U; AHat = [A,-A]; N=lengthAHat; d=flooru/n; for i=0:d-1 Ei*N+1:i+1*N=AHat; end Q=EP; % If an efficient implementation of % the Walsh-Hadamard transform is % available, we can use it instead, i.e. % S=fwhtQ; efficiency. Fastfood generates an approximate mapping for the RBF kernel using the Walsh-Hadamard transform. Tensor Sketch approximates polynomial kernels using tensor sketch convolution. Classification, or learning in general, using random feature mapping approaches requires three main steps: transformation, training, and testing. In this section we compare the CRO kernel with Fastfood [7] and Tensor Sketch [8] in terms of transformation time. Figure 3 compares the CRO kernel, Fastfood, and Tensor Sketch in terms of transformation time with increasing input space dimensionality d. The input size n is equal to 10, 000 and the feature dimensionality D is set to The results show that Tensor sketch transformation time increases linearly with the increase whereas Fastfood and the CRO kernel show logarithmic dependency to d. Transformation via the CRO kernel is faster than Fastfood for the given parameter values for all experimented values of d. S=dctQ; [,ix]=sorts; hashes=ix1:tau; A. Transformation time comparison Random feature mappings introduced by Rahimi et. al. [6] works as follows: A projection matrix P is created where each row of P is drawn randomly from the Fourier transform of the kernel function. To calculate the feature vector for the input vector A, A p = P A is calculated. The feature vector ΦA is a vector whose length is twice the length of A p, and for each coordinate i in A p ΦA[i] = cosa p [i] and ΦA[i + 1] = sina p [i]. The authors show that the expected inner product of the feature vectors is the kernel function applied to the input vectors. Our approach has the following two advantages over the work by Rahimi et. al. [6]: 1 We can use the Hadamard transform to compute the feature vectors, whereas they need to do the projection through multiplying the input vector with the projection matrix. This can be more expensive if the input vector is relatively high dimensional. We generate a sparse feature vector only τ entries out of U in the feature vector are non-zero, where τ << U, whereas their feature vector is dense. Having sparse feature vectors can be advantageous in many circumstances, for example many training algorithms converge much more quickly with sparse feature vectors. More recently, Fastfood [7] and Tensor Sketch [8] proposed approaches for random feature mappings which improved upon the work by Rahimi et. al. [6] in terms of speed and Fig. 3. Transformation time with increasing input space dimensionality d Figure 4 illustrates the linear dependency between transformation time and input size n for all three approaches. For this experiment we used face image data with a dimensionality d=104. Also, the feature space dimensionality D was set to The results show that the transformation time for Fastfood increases at a higher linear rate compared to Tensor Sketch and the CRO kernel. The results show that transformation via the CRO kernel is 15% faster than Tensor Sketch. Figure 5 presents the transformation time comparison between the three approaches with increasing values of feature space dimensionality D. All three approaches show linear increase in transformation time when increasing feature space dimensionality D; however, transformation using the CRO kernel has a smaller average rate of change. B. Evaluation of CROSVM CROSVM is the combination of the CRO kernel and linear SVM. We evaluated CROSVM on three publicly available datasets listed in Table IV, all downloaded from the LIB- SVM [3] website.

6 TABLE III CLASSIFICATION ACCURACY AND PROCESSING TIME COMPARISON SECONDS LIBLINEAR LIBSVM RBF CROSVM dataset training testing accuracy training testing accuracy training testing accuracy w8a % % % covtype % % % Fig. 4. Transformation time in seconds with increasing input size n delivers the fastest transformation time; however, the main advantage of CROSVM is related to the training time. The output of the CRO kernel is a sparse vector whereas Fastfood and Tensor Sketch generate dense vectors. As a result, the training time for CROSVM is significantly less than Fastfood and Tensor Sketch. CROSVM performs similar to LIBSVM in terms of classification accuracy but in less time. CROSVM achieves 98.54% classification accuracy whereas LIBSVM achieves 98.57%; however, CROSVM takes 1.6 seconds for SVM training compared to 465 seconds for LIBSVM. LIBLINEAR has the fastest training and testing time; however, the error rate is higher compared to other approaches. TABLE V CLASSIFICATION RESULTS ON MNIST DATASET method trans. time training time testing time error rate LIBLINEAR % CROSVM % LIBSVM RBF % Fastfood % Tensor Sketch % Fig. 5. Transformation time as a function of feature space dimensionality D TABLE IV DATASET INFORMATION dataset training size testing size attributes mnist 60,000 10, w8a 49,749 14, covtype 500,000 81,01 54 The largest dataset is the Covertype dataset covtype which includes 581,01 instances and 54 attributes. We chose these datasets because the RBF kernel shows significant improvement in accuracy over the linear kernel. Thus, for example, we did not include the adult dataset, since for this dataset linear SVMs are as good as non-linear ones. Table V presents the classification results on the MNIST dataset [1] for LIBLINEAR, LIBSVM, CROSVM, Fastfood, and Tensor Sketch. LIBLINEAR is used as the linear classifier for CROSVM, Fastfood, and Tensor Sketch. In comparison with Fastfood and Tensor Sketch, CROSVM Table III presents the classification accuracy and training and testing time for LIBSVM, LIBLINEAR and CROSVM on w8a and covtype. The time reported in Table III for CROSVM includes only the time required for training and predicting. Table VI presents the processing time breakdown for the entire CROSVM process which includes mapping the input vector into the sparse, high dimensional feature vectors, and then performing linear SVM. The total time required for CROSVM is still much less compared to LIBSVM with RBF kernel. TABLE VI PROCESSING TIME BREAKDOWN FOR CROSVM Training time seconds Testing time seconds dataset hash liblinear train total hash liblinear predict total w8a covtype On the covtype dataset CROSVM achieves higher accuracy than LIBSVM but it requires less time for training and testing. The accuracy of CROSVM is not very sensitive to the values of U and τ. Table VII shows the CROSVM cross-validation accuracy on the MNIST dataset for different values of log U and τ. We used the grid function from the LIBSVM toolbox to find the best cost c and gamma g parameters values for experiments with LIBSVM. For LIBLINEAR, we chose the value for parameter s solver type which achieves the

7 TABLE VII MNIST CROSS VALIDATION ACCURACY WITH CROSVM τ logu highest accuracy; s=1 is L-regularized L-loss support vector classification dual and s = is L-regularized L-loss support vector classification primal. For CROSVM, we performed a cross-validation search for each dataset to find the optimal values for parameters U and τ. Table VIII presents the parameter values used in the experiments for each dataset. TABLE VIII EXPERIMENT PARAMETERS FOR EACH DATASET LIBLINEAR LIBSVM CROSVM dataset s c g log U τ mnist w8a covtype C. Space Complexity The space complexity of our approach is Onτ, whereas Fastfood and Tensor Sketch both have a space complexity of OnD. For example in the results reported in Table V τ = 1000 and D = VI. CONCLUSION We introduced a highly efficient feature map that transforms vectors in the input space to sparse, high dimensional vectors in the feature space where the inner product in the feature space is the CRO kernel. We showed that the CRO kernel approximates the RBF kernel for unit length input vectors. This allows us to use very efficient linear SVM algorithms that have been optimized for high dimensional sparse vectors. The results show that we can achieve the same accuracy as non-linear RBF SVMs with linear training time and constant classification time. This approach can enable many new timesensitive applications where accuracy does not need to be sacrificed for training and classification speed. REFERENCES [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , Nov [] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research, vol. 9, pp , June 008. [3] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines, ACM Trans. on Intelligent Systems and Technology, 011. [4] K. Eshghi and S. Rajaram, Locality sensitive hash functions based on concomitant rank order statistics, in KDD, 008, pp [5] M. Kafai, K. Eshghi, and B. Bhanu, Discrete cosine transform localitysensitive hashes for face retrieval, IEEE Trans. on Multimedia, no. 4, pp , June 014. [6] A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in neural info. proc. systems, 007, pp [7] Q. Le, T. Sarlós, and A. Smola, Fastfood: Approximate kernel expansions in loglinear time, in Int. Conf. on Machine Learning, 013. [8] N. Pham and R. Pagh, Fast and scalable polynomial kernels via explicit feature maps, in 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining KDD, 013, pp [9] T. Joachims and C.-N. J. Yu, Sparse kernel svms via cutting-plane training, Mach. Learn., vol. 76, no. -3, pp , Sep [10] N. Segata and E. Blanzieri, Fast and scalable local kernel machines, Journal of Machine Learning Research, vol. 11, pp , 010. [11] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, Core vector machines: Fast svm training on very large data sets, in Journal of Machine Learning Research, 005, pp [1] M. Nandan, P. P. Khargonekar, and S. S. Talathi, Fast svm training using approximate extreme points, Journal of Machine Learning Research, vol. 15, no. 1, pp , 014. [13] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, Training and testing low-degree polynomial data mappings via linear svm, J. of Machine Learning Research, vol. 11, pp , Aug [14] A. Vedaldi and A. Zisserman, Efficient additive kernels via explicit feature maps, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp , 01. [15] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature hashing for large scale multitask learning, in Int. Conf. on Machine Learning ICML. ACM, 009, pp [16] S. Litayem, A. Joly, and N. Boujemaa, Hash-based support vector machines approximation for large scale prediction. in British Machine Vision Conference, 01, pp [17] Y.-C. Su, T.-H. Chiu, Y.-H. Kuo, C.-Y. Yeh, and W. Hsu, Scalable mobile visual classification by kernel preserving projection over highdimensional features, IEEE Trans. on Multimedia, vol. 16, no. 6, pp , Oct [18] J. von Tangen Sivertsen, Scalable learning through linearithmic time kernel approximation techniques, Master s thesis, IT University of Copenhagen, 014. [19] M. Raginsky and S. Lazebnik, Locality-sensitive binary codes from shift-invariant kernels, in Advances in neural information processing systems, 009, pp [0] M. M. Siddiqui, Distribution of quantiles in samples from a bivariate population, Journal of Research of the National Institute of Standards and Technology, [1] A. J. Smola and B. Schölkopf, A tutorial on support vector regression, Statistics and Computing, vol. 14, no. 3, pp. 199, Aug [] O. A. Vasicek, A series expansion for the bivariate normal integral, Journal of Computational Finance, Jul [3] M. Sibuya, Bivariate extreme statistics, i, Annals of the Institute of Statistical Mathematics, vol. 11, no., pp , A. Concomitant Rank VII. APPENDIX Theorem 1. Let M be a U N matrix of iid N 0, 1 random variables. Let Q be an instance of M. Let A, B R N. Let 0 < λ < 1 be a constant. Let τ = λu 1. Let γ = Φ 1 τ. Then lim E HQ,τ A H Q,τ B = Φ γ, γ, cosa, B U 31 Proof: First, we note that for any scalar s > 0 and input vector A, H Q,τ sa = H Q,τ A. This is because multiplying the input vector by a positive scalar simply scales the elements in QA, and does not change the order of the elements.

8 We also note that for any scalars s 1 0, s 0, cosa, B = coss 1 A, s B. A Let A = A and B = B U B. We will prove the U following: lim HQ,τ A H Q,τ B = Φ γ, γ, cosa, B U 3 Which, by the argument above, is sufficient to prove eq. 31. Consider the vectors Let x = QA 33 y = QB 34 ρ = cosa, B 35 x1 x xu It is possible to prove that,,..., are iid y 1 y y U samples from a bivariate normal distribution, where for all 1 i U, xi D 0 1 ρ N, 36 y i 0 ρ 1 Now consider the following question: for any 1 i U, what is the probability that i H Q,τ A H Q,τ B From Definition 3 and the definition of x and y it follows that i H Q,τ A Rx i, x τ 37 i H Q,τ B Ry i, y τ 38 Where R is the rank function introduced in Definition 1. Thus i H Q,τ A H Q,τ B By theorem Rx i, x τ Ry i, y τ 39 lim P [Rx i, x τ Ry i, y τ] = Φ γ, γ, ρ 40 Substituting in eq. 39 lim P [ i H Q,τ A H Q,τ B ] = Φ γ, γ, ρ 41 Since 1 i U, from eq. 41 it follows that lim E [ H Q,τ A H Q,τ B ] = UΦ γ, γ, ρ 4 lim E HQ,τ A H Q,τ B = Φ γ, γ, ρ 43 U Theorem. Let U be a positive integer. Let 0 < λ < 1 be a constant. Let t = λu 1. Let γ = Φ 1 t. Let x1 x xu S =,,..., y 1 y y U be a sample from Φ x, y, ρ. Let x = x 1, x,..., x U T, and y = y 1, y,..., y U T. Let 1 i U. Then lim P [Rx i, x t Ry i, y t] = Φ γ, γ, ρ 44 Proof: First, we note that since x and y are samples from a normal distribution, the probability that they have duplicate elements is zero. So we will assume that they do not have duplicate elements. Consider the sample S = x1 which is S with y 1,..., xi xi 1 xi+1 xu,,..., y i 1 y i+1 y U taken out. Let y i x = x 1,... x i 1, x i+1,..., x U T y = y 1,... y i 1, y i+1,..., y U T Since there are no duplicates in x or y, there are unique elements ˆx and ŷ in x and y such that Rˆx, x = t Rŷ, y = t It is not too difficult to show that Rx i, x t Ry i, y t x i < ˆx y i < ŷ 45 That is, P [Rx i, x t Ry i, y t] = P [x i < ˆx y i < ŷ] 46 Let q = Φ 1 λ 47 Then, when we apply proposition 3 to S, we get the following: q ˆx ŷ D N q, s r r s 48 with s and r constants solely dependent on λ. Now, by assumption, xi D 0 1 ρ N, 49 y i 0 ρ 1 xi ˆx We know that and are instances of independent y i ŷ random variables with distributions eq. 48 and eq. 49. Using standard procedure for the subtraction of two independent bivariate normal random variables, we can derive xi ˆx y i ŷ D N Now, by definition, q q 1 + s, ρ + r ρ + r 1 + s 50 t = λu 1 51

9 Thus for some 0 < δ < 1 lim Φ 1 It is also clear that lim lim t = λu 1 + δ 5 t U 1 = λ + δ 53 U 1 t U 1 = λ 54 = Φ 1 λ 55 t U 1 lim γ = q 56 lim ρ s U 1 = 1 57 r U 1 = ρ 58 Thus, from eq. 50, eq. 56, eq. 57 and eq. 58 we can derive, xi ˆx D γ 1 ρ y i ŷ N, 59 γ ρ 1 Thus by proposition lim i ˆx < 0 y i ŷ < 0] = Φ γ, γ, ρ 60 lim i < ˆx y i < ŷ] = Φ γ, γ, ρ 61 Substituting the left hand side of eq. 61 in the right hand side of eq. 46 we get lim P [Rx i, x t Ry i, y t] = Φ γ, γ, ρ 6 Proposition. Let x y D q N, q 1 ρ ρ 1 63 Then P x < 0 y < 0 = Φ q, q, ρ u x q Proof: Let =. Then from eq. 63 it easily v y q follows that u v D 0 1 ρ N, 0 ρ 1 Thus, by eq. 64 and definition of Φ 64 P [u < q v < q] = Φ q, q, ρ 65 By definition, u = x q and v = y q. Substituting in eq. 65 we get P [x q < q y q < q] = Φ q, q, ρ 66 P [x < 0 y < 0] = Φ q, q, ρ 67 x1 x xn Proposition 3. Let,,..., be a sample y 1 from Φ x, y, ρ. Let 0 < λ < 1 be a constant. Let y t = λn y n q = Φ 1 λ λ1 λ s = φq r = Φq, q, ρ λ φ q Let X 1:n, X :n,..., X n:n be the order statistics on x 1, x,..., x n and Y 1:n, Y :n,..., Y n:n the order statistics on y 1, y,..., y n. Then, as n, Φ Xt:n D N 1 λ s r Y t:n Φ 1, n n s 68 λ Proof: This follows from the theorem on page 148 in [0] with appropriate substitutions. r n n B. K γ A, B is an Admissible Support Vector Kernel Theorem 3. Let A, B R n. Then K γ A, B satisfies the Mercer s conditions, and is an admissible SV kernel. Proof: According to criteria in Theorem 8 in [1], to prove that K γ A, B satisfies Mercer s conditions, it is sufficient to prove that there is a convergent series K γ A, B = α n A B n 69 where α n 0 for all n. By definition, n=0 K γ A, B = Φ γ, γ, cosa, B 70 Using tetrachoric series expansion for Φ [] we can show that Φ γ, γ, ρ = Φ γ + φ He k γ γ ρ k+1 71 k + 1! k=0 = Φ [φγhe k 1 γ] γ + ρ k 7 k! k=1 where He k x is the k th Hermite polynomial. Let Then we have ρ = cosa, B 73 1 z = A B 74 ρ = za B 75 Substituting in eq. 70 and eq. 7 and simplifying we get K γ A, B = Φ [φγhe k 1 γ] γ + za B k k! = Φ γ + k=1 76 [φγhe k 1 γ] z k A B k k! k=1 77

10 Since Φ γ 0 and for all k [φγhe k 1 γ] z k 0 78 k! eq. 77 is sufficient to prove the theorem. C. Formula for The CRO Kernel Proposition 4. K γ A, B = Φ x + Proof: Let By definition, Now, as proved in [3] i.e. Thus Let cosa,b 0 exp γ 1+ρ π 1 ρ dρ 79 Cx, ρ = Φx, x, ρ 80 K γ A, B = Cγ, cosa, B 81 d dρ Φ x, y, ρ = φ x, y, ρ 8 d exp x y +ρxy dρ Φ 1 ρ x, y, ρ = π 83 1 ρ x x +ρx d exp dρ Cx, ρ = 1 ρ π 84 1 ρ exp x 1 ρ 1 ρ = π 85 1 ρ ρ 1 86 Then eq. 85 can be simplified as x d exp dρ Cx, ρ = 1+ρ π 87 1 ρ We know that Cx, 0 = Φ x. Thus r exp Cx, r = Φ x + From eq. 81 and eq. 88 we get K γ A, B = Φ x + 0 cosa,b 0 x 1+ρ π dρ 88 1 ρ exp γ 1+ρ π dρ 89 1 ρ Notice that condition 86 is satisfied inside the integral in eq. 89

KERNEL methods have been shown to be effective for many

KERNEL methods have been shown to be effective for many CROification: Accurate Kernel Classification with the Efficiency of Sparse Linear SVM Mehran Kafai, Member, IEEE and Kave Eshghi 1 Abstract Kernel methods have been shown to be effective for many machine

More information

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Nystrom Method for Approximating the GMM Kernel

Nystrom Method for Approximating the GMM Kernel Nystrom Method for Approximating the GMM Kernel Ping Li Department of Statistics and Biostatistics Department of omputer Science Rutgers University Piscataway, NJ 0, USA pingli@stat.rutgers.edu Abstract

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Subspace Embeddings for the Polynomial Kernel

Subspace Embeddings for the Polynomial Kernel Subspace Embeddings for the Polynomial Kernel Haim Avron IBM T.J. Watson Research Center Yorktown Heights, NY 10598 haimav@us.ibm.com Huy L. Nguy ên Simons Institute, UC Berkeley Berkeley, CA 94720 hlnguyen@cs.princeton.edu

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Journal of Machine Learning Research 8 (27) 291-31 Submitted 1/6; Revised 7/6; Published 2/7 Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Gaëlle Loosli Stéphane Canu

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Random Features for Large Scale Kernel Machines

Random Features for Large Scale Kernel Machines Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007. Explicit feature maps Fast algorithm for

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels

Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels Devdatt Dubhashi Department of Computer Science and Engineering, Chalmers University, dubhashi@chalmers.se Functional

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights Linear Discriminant Functions and Support Vector Machines Linear, threshold units CSE19, Winter 11 Biometrics CSE 19 Lecture 11 1 X i : inputs W i : weights θ : threshold 3 4 5 1 6 7 Courtesy of University

More information

Pairwise Away Steps for the Frank-Wolfe Algorithm

Pairwise Away Steps for the Frank-Wolfe Algorithm Pairwise Away Steps for the Frank-Wolfe Algorithm Héctor Allende Department of Informatics Universidad Federico Santa María, Chile hallende@inf.utfsm.cl Ricardo Ñanculef Department of Informatics Universidad

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Probabilistic Hashing for Efficient Search and Learning

Probabilistic Hashing for Efficient Search and Learning Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Classification of Hand-Written Digits Using Scattering Convolutional Network

Classification of Hand-Written Digits Using Scattering Convolutional Network Mid-year Progress Report Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan Co-Advisor: Dr. Maneesh Singh (SRI) Background Overview

More information

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

More information

Support vector comparison machines

Support vector comparison machines THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE Abstract Support vector comparison machines Toby HOCKING, Supaporn SPANURATTANA, and Masashi SUGIYAMA Department

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Random Projections. Lopez Paz & Duvenaud. November 7, 2013 Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random

More information

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Improvements to Platt s SMO Algorithm for SVM Classifier Design LETTER Communicated by John Platt Improvements to Platt s SMO Algorithm for SVM Classifier Design S. S. Keerthi Department of Mechanical and Production Engineering, National University of Singapore, Singapore-119260

More information

Discriminative Learning and Big Data

Discriminative Learning and Big Data AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture

More information

Chapter XII: Data Pre and Post Processing

Chapter XII: Data Pre and Post Processing Chapter XII: Data Pre and Post Processing Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 XII.1 4-1 Chapter XII: Data Pre and Post Processing 1. Data

More information

Advanced Topics in Machine Learning, Summer Semester 2012

Advanced Topics in Machine Learning, Summer Semester 2012 Math. - Naturwiss. Fakultät Fachbereich Informatik Kognitive Systeme. Prof. A. Zell Advanced Topics in Machine Learning, Summer Semester 2012 Assignment 3 Aufgabe 1 Lagrangian Methods [20 Points] Handed

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Kernel methods CSE 250B

Kernel methods CSE 250B Kernel methods CSE 250B Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Deviations from

More information

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin 22 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 23 26, 22, SANTANDER, SPAIN ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM Abhishek Singh, Narendra Ahuja

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Ian E.H. Yen 1 Ting-Wei Lin Shou-De Lin Pradeep Ravikumar 1 Inderjit S. Dhillon 1 Department of Computer Science 1: University of

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Towards Deep Kernel Machines

Towards Deep Kernel Machines Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

12. Cholesky factorization

12. Cholesky factorization L. Vandenberghe ECE133A (Winter 2018) 12. Cholesky factorization positive definite matrices examples Cholesky factorization complex positive definite matrices kernel methods 12-1 Definitions a symmetric

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

MULTIPLEKERNELLEARNING CSE902

MULTIPLEKERNELLEARNING CSE902 MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information