On the Impact of Kernel Approximation on Learning Accuracy

Size: px
Start display at page:

Download "On the Impact of Kernel Approximation on Learning Accuracy"

Transcription

1 On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY Courant Institute and Google Research New York, NY Courant Institute New York, NY Abstract Kernel approxiation is coonly used to scale kernel-based algoriths to applications containing as any as several illion instances. This paper analyzes the effect of such approxiations in the kernel atrix on the hypothesis generated by several widely used learning algoriths. We give stability bounds based on the nor of the kernel approxiation for these algoriths, including SVMs, KRR, and graph Laplacian-based regularization algoriths. These bounds help deterine the degree of approxiation that can be tolerated in the estiation of the kernel atrix. Our analysis is general and applies to arbitrary approxiations of the kernel atrix. However, we also give a specific analysis of the Nyströ low-rank approxiation in this context and report the results of experients evaluating the quality of the Nyströ low-rank kernel approxiation when used with ridge regression. 1 Introduction The size of odern day learning probles found in coputer vision, natural language processing, systes design and any other areas is often in the order of hundreds of thousands and can exceed several illion. Scaling standard kernel-based algoriths such as support vector achines SVMs) Cortes and Vapnik, 1995), kernel ridge regression KRR) Saunders et al., 1998), kernel principal coponent analysis KPCA) Schölkopf et al., 1998) to such agnitudes is a serious issue since even storing the kernel atrix can be prohibitive at this size. Appearing in Proceedings of the 13 th International Conference on Artificial Intelligence and Statistics AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volue 9 of JMLR: W&CP 9. Copyright 2010 by the authors. One solution suggested for dealing with such large-scale probles consists of a low-rank approxiation of the kernel atrix Willias and Seeger, 2000). Other variants of these approxiation techniques based on the Nyströ ethod have also been recently presented and shown to be applicable to large-scale probles Belabbas and Wolfe, 2009; Drineas and Mahoney, 2005; Kuar et al., 2009a; Talwalkar et al., 2008; Zhang et al., 2008). Kernel approxiations based on other techniques such as colun sapling Kuar et al., 2009b), incoplete Cholesky decoposition Bach and Jordan, 2002; Fine and Scheinberg, 2002) or Kernel Matching Pursuit KMP) Hussain and Shawe-Taylor, 2008; Vincent and Bengio, 2000) have also been widely used in large-scale learning applications. But, how does the kernel approxiation affect the perforance of the learning algorith? There exists soe previous work on this subject. Spectral clustering with perturbed data was studied in a restrictive setting with several assuption by Huang et al. 2008). In Fine and Scheinberg 2002), the authors address this question in ters of the ipact on the value of the objective function to be optiized by the learning algorith. However, we strive to take the question one step further and directly analyze the effect of an approxiation in the kernel atrix on the hypothesis generated by several widely used kernel-based learning algoriths. We give stability bounds based on the nor of the kernel approxiation for these algoriths, including SVMs, KRR, and graph Laplacian-based regularization algoriths Belkin et al., 2004). These bounds help deterine the degree of approxiation that can be tolerated in the estiation of the kernel atrix. Our analysis differs fro previous applications of stability analysis as put forward by Bousquet and Elisseeff 2001). Instead of studying the effect of changing one training point, we study the effect of changing the kernel atrix. Our analysis is general and applies to arbitrary approxiations of the kernel atrix. However, we also give a specific analysis of the Nyströ

2 On the Ipact of Kernel Approxiation on Learning Accuracy low-rank approxiation given the recent interest in this ethod and the successful applications of this algorith to large-scale applications. We also report the results of experients evaluating the quality of this kernel approxiation when used with ridge regression. The reainder of this paper is organized as follows. Section 2 introduces the proble of kernel stability and gives a kernel stability analysis of several algoriths. Section 3 provides a brief review of the Nyströ approxiation ethod and gives error bounds that can be cobined with the kernel stability bounds. Section 4 reports the results of experients with kernel approxiation cobined with kernel ridge regression. 2 Kernel Stability Analysis In this section we analyze the ipact of kernel approxiation on several coon kernel-based learning algoriths: KRR, SVM and graph Laplacian-based regularization algoriths. Our stability analyses result in bounds on the hypotheses directly in ters of the quality of the kernel approxiation. In our analysis we assue that the kernel approxiation is only used during training where the kernel approxiation ay serve to reduce resource requireents. At testing tie the true kernel function is used. This scenario that we are considering is standard for the Nyströ ethod and other approxiations. We consider the standard supervised learning setting where the learning algorith receives a saple of labeled points S = x 1, y 1 ),..., x, y )) X Y ), where X is the input space and Y the set of labels, Y = R with y M in the regression case, and Y = {1, +1} in the classification case. Throughout the paper the kernel atrix K and its approxiation K are assued to be syetric, positive, and sei-definite SPSD). 2.1 Kernel Ridge Regression We first provide a stability analysis of kernel ridge regression. The following is the dual optiization proble solved by KRR Saunders et al., 1998): ax α R λα α + αkα 2α y, 1) where λ = λ 0 > 0 is the ridge paraeter. The proble adits the closed for solution α=k+λi) 1 y. We denote by h the hypothesis returned by kernel ridge regression when using the exact kernel atrix. Proposition 1. Let h denote the hypothesis returned by kernel ridge regression when using the approxiate kernel atrix K R. Furtherore, define κ > 0 such that Kx, x) κ and K x, x) κ for all x X. This condition is verified with κ=1 for Gaussian kernels for exaple. Then the following inequalities hold for all x X, h x) hx) κm λ 2 0 K K 2. 2) Proof. Let α denote the solution obtained using the approxiate kernel atrix K. We can write α α = K + λi) 1 y K + λi) 1 y 3) = [ K + λi) 1 K K)K + λi) 1] y, 4) where we used the identity M 1 M 1 =M 1 M M)M 1 valid for any invertible atrices M,M. Thus, α α can be bounded as follows: α α K + λi) 1 K K K + λi) 1 y K K 2 y λ in K + λi)λ in K + λi), 5) where λ in K +λi) is the sallest eigenvalue of K +λi and λ in K + λi) the sallest eigenvalue of K + λi. The hypothesis h derived with the exact kernel atrix is defined by hx) = α ikx, x i ) = α k x, where k x = Kx, x 1 ),..., Kx, x )). By assuption, no approxiation affects k x, thus the approxiate hypothesis h is given by h x)=α k x and h x) hx) α α k x κ α α. 6) Using the bound on α α given by inequality 5), the fact that the eigenvalues of K + λi) and K + λi) are larger than or equal to λ since K and K are SPSD atrices, and y M yields h κm K K 2 x) hx) λ in K + λi)λ in K + λi) κm λ 2 K 2. 0 K The generalization bounds for KRR, e.g., stability bounds Bousquet and Elisseeff, 2001), are of the for Rh) Rh) + O1/ ), where Rh) denotes the generalization error and Rh) the epirical error of a hypothesis h with respect to the square loss. The proposition shows that h x) hx) 2 = O K K 2 2/λ ). Thus, it suggests that the kernel approxiation tolerated should be such that K K 2 2 /λ4 0 2 O1/ ), that is, such that K K 2 Oλ 2 0 3/4 ). Note that the ain bound used in the proof of the theore, inequality 5), is tight in the sense that it can be atched for soe kernels K and K. Indeed, let K and K be the

3 Corinna Cortes, Mehryar Mohri, Aeet Talwalkar kernel functions defined by Kx, y)=β and K x, y)=β if x=y, K x, y)= Kx, y)=0 otherwise, with β, β 0. Then, the corresponding kernel atrices for a saple S are K = βi and K = β I, and the dual paraeter vectors are given by α = y/β+λ) and α = y/β +λ). Now, since λ in K + λi) = β +λ and λ in K + λi) = β+λ, and K K =β β, the following equality holds: α α = = β β β y + λ)β + λ) 7) K K λ in K + λi)λ in K y. + λi) 8) This liits significant iproveents of the bound of Proposition 1 using siilar techniques. 2.2 Support Vector Machines This section analyzes the kernel stability of SVMs. For siplicity, we shall consider the case where the classification function sought has no offset. In practice, this corresponds to using a constant feature. Let Φ: X F denote a feature apping fro the input space X to a Hilbert space F corresponding to soe kernel K. The hypothesis set we consider is thus H = {h: w F x X, hx)= w Φx)}. The following is the standard prial optiization proble for SVMs: in w F Kw) = 1 2 w 2 + RK w), 9) where R K w) = 1 Ly iw Φx i )) is the epirical error, with Ly i w Φx i )) = ax0, 1 y i w Φx i )) the hinge loss associated to the ith point. In the following, we analyze the difference between the hypothesis h returned by SVMs when trained on the saple S of points and using a kernel K, versus the hypothesis h obtained when training on the sae saple with the kernel K. For a fixed x X, we shall copare ore specifically hx) and h x). Thus, we can work with the finite set X +1 ={x 1,..., x, x +1 }, with x +1 =x. Different feature appings Φ can be associated to the sae kernel K. To copare the solutions w and w of the optiization probles based on F K and F K, we can choose the feature appingsφ and Φ associated to K and K such that they both ap to R +1 as follows. Let K +1 denote the Gra atrix associated to K and K +1 that of kernel K for the set of points X +1. Then for all u X +1, Φ and Φ can be defined by Φu) = K 1/2 +1 Kx 1, u). Kx +1, u) K x 1, u) and Φ u) = K 1/2 +1. K x +1, u) 10), 11) where K +1 denotes the pseudo-inverse of K +1 and K +1 that of K +1. It is not hard to see then that for all u, v X +1, Ku, v) = Φu) Φv) and K u, v) = Φ u) Φ v) Schölkopf and Sola, 2002). Since the optiization proble depends only on the saple S, we can use the feature appings just defined in the expression of F K and F K. This does not affect in any way the standard SVMs optiization proble. Let w R +1 denote the iniizer of F K and w that of F K. By definition, if we let w denote w w, for all s [0, 1], the following inequalities hold: F K w) F K w + s w) 12) and F K w ) F K w s w). 13) Suing these two inequalities, rearranging ters, and using the identity w+s w 2 w 2 )+ w s w 2 w 2 )=2s1s) w 2, we obtain as in Bousquet and Elisseeff, 2001): s1 s) w 2 [ RK w + s w) R K w) ) + RK w s w) R K w ) )]. Note that w + s w = sw + 1 s)w and w s w = sw + 1 s)w. Then, by the convexity of the hinge loss and thus R K and R K, the following inequalities hold: R K w + s w) R K w) s R K w ) R K w)) R K w s w) R K w ) s R K w ) R K w)). Plugging in these inequalities on the left-hand side, siplifying by s and taking the liit s 0 yields w 2 [ RK w ) R K w ) ) + RK w) R K w) )] = + [ Ly i w Φx i )) Ly i w Φ x i ))) Ly i w Φ x i )) Ly i w Φx i )) ], where the last inequality results fro the definition of the epirical error. Since the hinge loss is 1-Lipschitz, we can

4 On the Ipact of Kernel Approxiation on Learning Accuracy bound the ters on the right-hand side as follows: w 2 = [ w Φ x i ) Φx i ) ] + w Φ x i ) Φx i ) 14) w + w ) Φ x i ) Φx i ). 15) Let e i denote the ith unit vector of R +1, then Kx 1, x i ),..., Kx +1, x i )) =K +1 e i. Thus, in view of the definition of Φ, for all i [1, + 1], Φx i ) = K 1/2 +1 [Kx 1, x i ),..., Kx, x i ), Kx, x i )] = K 1/2 +1 K +1e i = K 1/2 +1 e i, 16) and siilarly Φ x i )=K 1/2 +1 e i. K 1/2 +1 e i is the ith colun of K 1/2 +1 and siilarly K 1/2 e i the ith colun of K 1/2 +1. Thus, 15) can be rewritten as w w 2 w + w ) K 1/2 +1 K1/2 +1 )e i. As for the case of ridge regression, we shall assue that there exists κ > 0 such that Kx, x) κ and K x, x) κ for all x X +1. Now, since w can be written in ters of the dual variables 0 α i C, C = / as w = α ikx i, ), it can be bounded as w /κ 1/2 = κ 1/2 and siilarly w κ 1/2. Thus, we can write w w 2 2C2 0 κ1/2 2C2 0κ 1/2 K 1/2 +1 K1/2 +1 )e i K 1/2 +1 K1/2 +1 ) e i = 2C0κ 2 1/2 K 1/2 +1 K1/ ) Let K denote the Gra atrix associated to K and K that of kernel K for the saple S. Then, the following result holds. Proposition 2. Let h denote the hypothesis returned by SVMs when using the approxiate kernel atrix K R. Then, the following inequality holds for all x X : h x) hx) 2κ 3 4 C0 K K [1 + [ K K 2 4κ ] 1] 4. 18) Proof. In view of 16) and 17), the following holds: h x) hx) = w Φ x) w Φx) = w w) Φ x) + w Φ x) Φx)) w w Φ x) + w Φ x) Φx) = w w Φ x) + w Φ x +1 ) Φx +1 ) ) 1/2κ 2C0 2 κ1/2 K 1/2 +1 K1/2 +1 1/2 + κ 1/2 K 1/2 +1 K1/2 +1 )e +1 2κ 3/4 K 1/2 +1 K1/2 +1 1/2 + κ 1/2 K 1/2 +1 K1/2 +1. Now, by Lea 1 see Appendix), K 1/2 +1 K1/ K +1 K +1 1/2 2. By assuption, the kernel approxiation is only used at training tie so Kx, x i ) = K x, x i ), for all i [1, ], and since by definition x = x +1, the last rows or the last coluns of the atrices K +1 and K +1 coincide. Therefore, the atrix K +1K +1 coincides with the atrix K K bordered with a zero-colun and zero-row and K 1/2 +1 K1/ K K 1/2 2. Thus, h x) hx) 2κ 3/4 K K 1/4 + κ 1/2 K K 1/2, 19) which is exactly the stateent of the proposition. Since the hinge loss l is 1-Lipschitz, Proposition 2 leads directly to the following bound on the pointwise difference of the hinge loss between the hypotheses h and h. Corollary 1. Let h denote the hypothesis returned by SVMs when using the approxiate kernel atrix K R. Then, the following inequality holds for all x X and y Y: L yh x) ) L yhx) ) 2κ 3 4 C0 K K [1 + [ K K 2 4κ ] 1] 4. 20) The bounds we obtain for SVMs are weaker than our bound for KRR. This is due ainly to the different loss functions defining the optiization probles of these algoriths. 2.3 Graph Laplacian regularization algoriths We lastly study the kernel stability of graph-laplacian regularization algoriths such as that of Belkin et al. 2004). Given a connected weighted graph G = X, E) in which

5 Corinna Cortes, Mehryar Mohri, Aeet Talwalkar edge weights can be interpreted as siilarities between vertices, the task consists of predicting the vertex labels of u vertices using a labeled training saple S of vertices. The input space X is thus reduced to the set of vertices, and a hypothesis h: X R can be identified with the finite-diensional vector h of its predictions h = [hx 1 ),..., hx +u )]. The hypothesis set H can thus be identified with R +u here. Let h S denote the restriction of h to the training points, [hx 1 ),..., hx )] R, and siilarly let y S denote [y 1,..., y ] R. Then, the following is the optiization proble corresponding to this proble: in h H h Lh + h S y S ) h S y S ) 21) subject to h 1 = 0, where L is the graph Laplacian and 1 the colun vector with all entries equal to 1. Thus, h Lh = ij=1 w ijhx i )hx j )) 2, for soe weight atrix w ij ). The label vectory is assued to be centered, which iplies that 1 y = 0. Since the graph is connected, the eigenvalue zero of the Laplacian has ultiplicity one. Define I S R +u) +u) to be the diagonal atrix with [I S ] i,i = 1 if i and 0 otherwise. Maintaining the notation used in Belkin et al. 2004), we let P H denote the projection on the hyperplane ) H orthogonal to 1 and let M = P H L + I S and M = P H L + I S ). We denote by h the hypothesis returned by the algorith when using the exact kernel atrix L and by L an approxiate graph Laplacian such that h L h = ij=1 w ij hx i) hx j )) 2, based on atrix w ij ) instead of w ij). We shall assue that there exist M > 0 such that y i M for i [1, ]. Proposition 3. Let h denote the hypothesis returned by the graph-laplacian regularization algorith when using an approxiate Laplacian L R. Then, the following inequality holds: h h 3/2 M/ λ2 1) 2 L L, 22) where λ 2 = ax{λ 2, λ 2} with λ 2 denoting the second sallest eigenvalue of the kernel atrix L and λ 2 the second sallest eigenvalue of L. Proof. The closed-for solution of Proble 21 is given )) 1yS by Belkin et al. 2004): h = P H L+I S. Thus, we can use that expression and the atrix identity for M 1 M 1 ) we already used in the analysis of KRR to write h h = M 1 y S M 1 y S 23) = M 1 M 1 )y S 24) = M 1 M M )M 1 y S 25) M 1 L L )M 1 y S 26) M 1 M 1 y S L L. 27) For any colun atrix z R +u) 1, by the triangle inequality and the projection property P H z z, the following inequalities hold: P H L = P H L + P H I S z P H I S z 28) This yields the lower bound: P H L + P H I S z + P H I S z 29) P H L + I S )z + I S z. 30) Mz = P H L + I S )z 31) P H L I S z 32) C 0 ) λ 2 1 z, 33) which gives the following upper bounds on M 1 and M 1 : M 1 1 λ 2 1 and M 1 1 λ 2 1. Plugging in these inequalities in 27) and using y S 1/2 M lead to h h 3/2 M/ λ 2 1) λ 2 1) L L. The generalization bounds for the graph-laplacian algorith are of the for Rh) Rh)+O C λ 0 21) ) Belkin 2 et al., 2004). In view of the bound given by the theore, this suggests that the approxiation tolerated should verify L L O1/ ). 3 Application to Nyströ ethod The previous section provided stability analyses for several coon learning algoriths studying the effect of using an approxiate kernel atrix instead of the true one. The difference in hypothesis value is expressed siply in ters of the difference between the kernels easured by

6 On the Ipact of Kernel Approxiation on Learning Accuracy Dataset Description # Points ) # Features d) Kernel Largest label M) ABALONE physical attributes of abalones RBF 29 KIN-8n kineatics of robot ar RBF 1.5 Table 1: Suary of datasets used in our experients Asuncion and Newan, 2007; Ghahraani, 1996). soe nor. Although these bounds are general bounds that are independent of how the approxiation is obtained so long as K reains SPSD), one relevant application of these bounds involves the Nyströ ethod. As shown by Willias and Seeger 2000), later by Drineas and Mahoney 2005); Talwalkar et al. 2008); Zhang et al. 2008), low-rank approxiations of the kernel atrix via the Nyströ ethod can provide an effective technique for tackling large-scale data sets. However, all previous theoretical work analyzing the perforance of the Nyströ ethod has focused on the quality of the low-rank approxiations, rather than the perforance of the kernel learning algoriths used in conjunction with these approxiations. In this section, we first provide a brief review of the Nyströ ethod and then show how we can leverage the analysis of Section 2 to present novel perforance guarantees for the Nyströ ethod in the context of kernel learning algoriths. 3.1 Nyströ ethod The Nyströ approxiation of a syetric positive seidefinite SPSD) atrix K is based on a saple of n coluns of K Drineas and Mahoney, 2005; Willias and Seeger, 2000). Let C denote the n atrix fored by these coluns and W the n n atrix consisting of the intersection of these n coluns with the corresponding n rows of K. The coluns and rows of K can be rearranged based on this sapling so that K and C be written as follows: [ ] [ ] W K K = 21 W and C =. 34) K 21 K 22 K 21 Note that W is also SPSD since K is SPSD. For a unifor sapling of the coluns, the Nyströ ethod generates a rank-k approxiation K of K for k n defined by: K = CW + k C K, 35) where W k is the best k-rank approxiation of W for the Frobenius nor, that is W k = argin rankv)=k W V F and W + k denotes the pseudo-inverse of W k. W + k can be derived fro the singular value decoposition SVD) of W, W = UΣU, where U is orthonoral and Σ = diagσ 1,..., σ ) is a real diagonal atrix with σ 1 σ 0. For k rankw), it is given by W + k = k σ1 i U i U i, where U i denotes the ith colun of U. Since the running tie coplexity of SVD is On 3 ) and Onk) is required for ultiplication with C, the total coplexity of the Nyströ approxiation coputation is On 3 +nk). 3.2 Nyströ kernel ridge regression The accuracy of low-rank Nyströ approxiations has been theoretically analyzed by Drineas and Mahoney 2005); Kuar et al. 2009c). The following theore, adapted fro Drineas and Mahoney 2005) for the case of unifor sapling, gives an upper bound on the nor- 2 error of the Nyströ approxiation of the for K K 2 / K 2 K K k 2 / K 2 +O1/ n). We denote by K ax the axiu diagonal entry of K. Theore 1. Let K denote the rank-k Nyströ approxiation of K based on n coluns sapled uniforly at rando with replaceent fro K, and K k the best rank-k approxiation of K. Then, with probability at least 1 δ, the following inequalities hold for any saple of size n: K K 2 K K k 2 + n K ax 2 + log 1 δ). Theore 1 focuses on the quality of low-rank approxiations. Cobining the analysis fro Section 2 with this theore enables us to bound the relative perforance of the kernel learning algoriths when the Nyströ ethod is used as a eans of scaling kernel learning algoriths. To illustrate this point, Theore 2 uses Proposition 1 along with Theore 1 to upper bound the relative perforance of kernel ridge regression as a function of the approxiation accuracy of the Nyströ ethod. Theore 2. Let h denote the hypothesis returned by kernel ridge regression when using the approxiate rank-k kernel K R n n generated using the Nyströ ethod. Then, with probability at least 1 δ, the following inequality holds for all x X, h x)hx) κm λ 2 0 [ KK k 2 + n K ax 2+log 1 δ) ]. A siilar technique can be used to bound the error of the Nyströ approxiation when used with the other algoriths discussed in Section 2. The results are oitted due to space constraints.

7 Corinna Cortes, Mehryar Mohri, Aeet Talwalkar Abalone Kin8n Average Absolute Error Average Absolute Error Relative Spectral Distance Relative Spectral Distance Figure 1: Average absolute error of the kernel ridge regression hypothesis, h ), generated fro the Nyströ approxiation, K, as a function of relative spectral distance K K 2 / K 2. For each dataset, the reported results show the average absolute error as a function of relative spectral distance for both the full dataset and for a subset of the data containing = 2000 points. Results for the sae value of are connected with a line. The different points along the lines correspond to various nubers of sapled coluns, i.e., n ranging fro 1% to 50% of. 3.3 Nyströ Woodbury Approxiation The Nyströ ethod provides an effective algorith for obtaining a rank-k approxiation for the kernel atrix. As suggested by Willias and Seeger 2000) in the context of Gaussian Processes, this approxiation can be cobined with the Woodbury inversion lea to derive an efficient algorith for inverting the kernel atrix. The Woodbury inversion Lea states that the inverse of a rank-k correction of soe atrix can be coputed by doing a rank-k correction to the inverse of the original atrix. In the context of KRR, using the rank-k approxiation K given by the Nyströ ethod, instead of K, and applying the inversion lea yields λi + K) 1 36) λi + CW + k C ) 1 37) = 1 I C [ λi k + W + k λ C C ] 1 W + k C ). 38) Thus, only an inversion of a atrix of size k is needed as opposed to the original proble of size. 4 Experients For our experiental results, we focused on the kernel stability of kernel ridge regression, generating approxiate kernel atrices using the Nyströ ethod. We worked with the datasets listed in Table 1, and for each dataset, we randoly selected 80% of the points to generate K and used the reaining 20% as the test set, T. For each testtrain split, we first perfored grid search to deterine the optial ridge for K, as well as the associated optial hypothesis, h ). Next, using this optial ridge, we generated a set of Nyströ approxiations, using various nubers of sapled coluns, i.e., n ranging fro 1% to 50% of. For each Nyströ approxiation, K, we coputed the associated hypothesis h ) using the sae ridge and easured the distance between h and h as follows: average absolute error = x T h x) hx). 39) T We easured the distance between K and K as follows: relative spectral distance = K K 2 K ) Figure 1 presents results for each dataset using all points and a subset of 2000 points. The plots show the average absolute error of h ) as a function of relative spectral distance. Proposition 1 predicts a linear relationship between kernel approxiation and relative error which is corroborated by the experients, as both datasets display this behavior for different sizes of training data. 5 Conclusion Kernel approxiation is used in a variety of contexts and its use is crucial for scaling any learning algoriths to very large tasks. We presented a stability-based analysis of the effect of kernel approxiation on the hypotheses returned by several coon learning algoriths. Our analysis is independent of how the approxiation is obtained and siply expresses the change in hypothesis value in ters of the difference between the approxiate kernel atrix and the true one easured by soe nor. We also provided a specific analysis of the Nyströ low-rank approxiation in

8 On the Ipact of Kernel Approxiation on Learning Accuracy this context and reported experiental results that support our theoretical analysis. In practice, the two steps of kernel atrix approxiation and training of a kernel-based algorith are typically executed separately. Work by Bach and Jordan 2005) suggested one possible ethod for cobining these two steps. Perhaps ore accurate results could be obtained by cobining these two stages using the bounds we presented or other siilar ones based on our analysis. A Lea 1 The proof of Lea 1 is given for copleteness. Lea 1. Let M and M be two n n SPSD atrices. Then, the following bound holds for the difference of the square root atrices: M 1/2 M 1/2 2 M M 1/2 2. Proof. By definition of the spectral nor, M M M M 2 I where I is the n n identity atrix. Hence, M M + M M 2 I and M 1/2 M + λi) 1/2, with λ = M M 2. Thus, M 1/2 M + λi) 1/2 M 1/2 + λ 1/2 I, by sub-additivity of. This shows that M 1/2 M 1/2 λ 1/2 I and by syetrym 1/2 M 1/2 λ 1/2 I, thus M 1/2 M 1/2 2 M M 1/2 2, which proves the stateent of the lea. References A. Asuncion and D.J. Newan. UCI achine learning repository, Francis Bach and Michael Jordan. Kernel independent coponent analysis. Journal of Machine Learning Research, 3:1 48, Francis R. Bach and Michael I. Jordan. Predictive lowrank decoposition for kernel ethods. In International Conference on Machine Learning, M. A. Belabbas and P. J. Wolfe. Spectral ethods in achine learning and new strategies for very large datasets. Proceedings of the National Acadey of Sciences of the United States of Aerica, 1062): , January Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and sei-supervised learning on large graphs. In Conference on Learning Theory, Olivier Bousquet and André Elisseeff. Algorithic stability and generalization perforance. In Advances in Neural Inforation Processing Systes, Corinna Cortes and Vladiir Vapnik. Support-Vector Networks. Machine Learning, 203): , Petros Drineas and Michael W. Mahoney. On the Nyströ ethod for approxiating a Gra atrix for iproved kernel-based learning. Journal of Machine Learning Research, 6: , Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2: , Zoubin Ghahraani. The kin datasets, Ling Huang, Donghui Yan, Michael Jordan, and Nina Taft. Spectral clustering with perturbed data. In Advances in Neural Inforation Processing Systes, Zakria Hussain and John Shawe-Taylor. Theory of atching pursuit. In Advances in Neural Inforation Processing Systes, Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. Enseble Nyströ ethod. In Advances in Neural Inforation Processing Systes, Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. On sapling-based approxiate spectral decoposition. In International Conference on Machine Learning, Sanjiv Kuar, Mehryar Mohri, and Aeet Talwalkar. Sapling techniques for the Nyströ ethod. In Conference on Artificial Intelligence and Statistics, Craig Saunders, Alexander Gaeran, and Volodya Vovk. Ridge regression learning algorith in dual variables. In International Conference on Machine Learning, Bernhard Schölkopf and Alex Sola. Learning with Kernels. MIT Press: Cabridge, MA, Bernhard Schölkopf, Alexander Sola, and Klaus-Robert Müller. Nonlinear coponent analysis as a kernel eigenvalue proble. Neural Coputation, 105): , Aeet Talwalkar, Sanjiv Kuar, and Henry Rowley. Large-scale anifold learning. In Conference on Vision and Pattern Recognition, Pascal Vincent and Yoshua Bengio. Kernel Matching Pursuit. Technical Report 1179, Départeent d Inforatique et Recherche Opérationnelle, Université de Montréal, Christopher K. I. Willias and Matthias Seeger. Using the Nyströ ethod to speed up kernel achines. In Advances in Neural Inforation Processing Systes, Kai Zhang, Ivor Tsang, and Jaes Kwok. Iproved Nyströ low-rank approxiation and error analysis. In International Conference on Machine Learning, 2008.

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Multi-view Discriminative Manifold Embedding for Pattern Classification

Multi-view Discriminative Manifold Embedding for Pattern Classification Multi-view Discriinative Manifold Ebedding for Pattern Classification X. Wang Departen of Inforation Zhenghzou 450053, China Y. Guo Departent of Digestive Zhengzhou 450053, China Z. Wang Henan University

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 5 Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient coputation of inner products

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Kernel Methods Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Motivation Efficient coputation of inner products in high diension. Non-linear decision

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

arxiv: v1 [stat.ml] 31 Jan 2018

arxiv: v1 [stat.ml] 31 Jan 2018 Increental kernel PCA and the Nyströ ethod arxiv:802.00043v [stat.ml] 3 Jan 208 Fredrik Hallgren Departent of Statistical Science University College London London WCE 6BT, United Kingdo fredrik.hallgren@ucl.ac.uk

More information

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

The Hilbert Schmidt version of the commutator theorem for zero trace matrices The Hilbert Schidt version of the coutator theore for zero trace atrices Oer Angel Gideon Schechtan March 205 Abstract Let A be a coplex atrix with zero trace. Then there are atrices B and C such that

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Lecture 20 November 7, 2013

Lecture 20 November 7, 2013 CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though,

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Lecture 9 November 23, 2015

Lecture 9 November 23, 2015 CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Physics 215 Winter The Density Matrix

Physics 215 Winter The Density Matrix Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions Journal of Matheatical Research with Applications Jul., 207, Vol. 37, No. 4, pp. 496 504 DOI:0.3770/j.issn:2095-265.207.04.0 Http://jre.dlut.edu.cn An l Regularized Method for Nuerical Differentiation

More information

Structured Prediction Theory Based on Factor Graph Complexity

Structured Prediction Theory Based on Factor Graph Complexity Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

HIGH RESOLUTION NEAR-FIELD MULTIPLE TARGET DETECTION AND LOCALIZATION USING SUPPORT VECTOR MACHINES

HIGH RESOLUTION NEAR-FIELD MULTIPLE TARGET DETECTION AND LOCALIZATION USING SUPPORT VECTOR MACHINES ICONIC 2007 St. Louis, O, USA June 27-29, 2007 HIGH RESOLUTION NEAR-FIELD ULTIPLE TARGET DETECTION AND LOCALIZATION USING SUPPORT VECTOR ACHINES A. Randazzo,. A. Abou-Khousa 2,.Pastorino, and R. Zoughi

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

Neural Network Learning as an Inverse Problem

Neural Network Learning as an Inverse Problem Neural Network Learning as an Inverse Proble VĚRA KU RKOVÁ, Institute of Coputer Science, Acadey of Sciences of the Czech Republic, Pod Vodárenskou věží 2, 182 07 Prague 8, Czech Republic. Eail: vera@cs.cas.cz

More information

arxiv: v1 [math.na] 10 Oct 2016

arxiv: v1 [math.na] 10 Oct 2016 GREEDY GAUSS-NEWTON ALGORITHM FOR FINDING SPARSE SOLUTIONS TO NONLINEAR UNDERDETERMINED SYSTEMS OF EQUATIONS MÅRTEN GULLIKSSON AND ANNA OLEYNIK arxiv:6.395v [ath.na] Oct 26 Abstract. We consider the proble

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

E. Alpaydın AERFAISS

E. Alpaydın AERFAISS E. Alpaydın AERFAISS 00 Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy

More information

arxiv: v4 [cs.lg] 4 Apr 2016

arxiv: v4 [cs.lg] 4 Apr 2016 e-publication 3 3-5 Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions arxiv:35796v4 cslg 4 Apr 6 Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY Spencer

More information

Lecture 21 Nov 18, 2015

Lecture 21 Nov 18, 2015 CS 388R: Randoized Algoriths Fall 05 Prof. Eric Price Lecture Nov 8, 05 Scribe: Chad Voegele, Arun Sai Overview In the last class, we defined the ters cut sparsifier and spectral sparsifier and introduced

More information

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers Predictive Vaccinology: Optiisation of Predictions Using Support Vector Machine Classifiers Ivana Bozic,2, Guang Lan Zhang 2,3, and Vladiir Brusic 2,4 Faculty of Matheatics, University of Belgrade, Belgrade,

More information

Weighted- 1 minimization with multiple weighting sets

Weighted- 1 minimization with multiple weighting sets Weighted- 1 iniization with ultiple weighting sets Hassan Mansour a,b and Özgür Yılaza a Matheatics Departent, University of British Colubia, Vancouver - BC, Canada; b Coputer Science Departent, University

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

Improved Bounds for the Nyström Method with Application to Kernel Classification

Improved Bounds for the Nyström Method with Application to Kernel Classification Iproved Bounds for the yströ Method with Application to Kernel Classification Rong Jin, Meber, IEEE, Tianbao Yang, Mehrdad Mahdavi, Yu-Feng Li, Zhi-Hua Zhou, Fellow, IEEE Abstract We develop two approaches

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Using a De-Convolution Window for Operating Modal Analysis

Using a De-Convolution Window for Operating Modal Analysis Using a De-Convolution Window for Operating Modal Analysis Brian Schwarz Vibrant Technology, Inc. Scotts Valley, CA Mark Richardson Vibrant Technology, Inc. Scotts Valley, CA Abstract Operating Modal Analysis

More information

Introduction to Kernel methods

Introduction to Kernel methods Introduction to Kernel ethods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012 Introduction

More information

A Simple Homotopy Algorithm for Compressive Sensing

A Simple Homotopy Algorithm for Compressive Sensing A Siple Hootopy Algorith for Copressive Sensing Lijun Zhang Tianbao Yang Rong Jin Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Departent of Coputer

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

Hybrid System Identification: An SDP Approach

Hybrid System Identification: An SDP Approach 49th IEEE Conference on Decision and Control Deceber 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA Hybrid Syste Identification: An SDP Approach C Feng, C M Lagoa, N Ozay and M Sznaier Abstract The

More information