Sparse Semi-supervised Learning Using Conjugate Functions

Size: px

Start display at page:

Download "Sparse Semi-supervised Learning Using Conjugate Functions"

Winfred Goodwin
5 years ago
Views:

1 Journa of Machine Learning Research (200) Submitted 2/09; Pubished 9/0 Sparse Semi-supervised Learning Using Conjugate Functions Shiiang Sun Department of Computer Science and Technoogy East China Norma University 500 Dongchuan Road, Shanghai 20024, P. R. China John Shawe-Tayor Department of Computer Science University Coege London Gower Street, London WCE 6BT, United Kingdom Editor: Tony Jebara Abstract In this paper, we propose a genera framework for sparse semi-supervised earning, which concerns using a sma portion of unabeed data and a few abeed data to represent target functions and thus has the merit of acceerating function evauations when predicting the output of a new exampe. This framework makes use of Fenche-Legendre conjugates to rewrite a convex insensitive oss invoving a reguarization with unabeed data, and is appicabe to a famiy of semi-supervised earning methods such as muti-view co-reguarized east squares and singe-view Lapacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse muti-view SVMs which use a squared ε-insensitive oss. The resutant optimization is an inf-sup probem and the optima soutions have arguaby sadde-point properties. We present a gobay optima iterative agorithm to optimize the probem. We give the margin bound on the generaization error of the sparse muti-view SVMs, and derive the empirica Rademacher compexity for the induced function cass. Experiments on artificia and rea-word data show their effectiveness. We further give a sequentia training approach to show their possibiity and potentia for uses in arge-scae probems and provide encouraging experimenta resuts indicating the efficacy of the margin bound and empirica Rademacher compexity on characterizing the roes of unabeed data for semi-supervised earning. Keywords: semi-supervised earning, Fenche-Legendre conjugate, representer theorem, mutiview reguarization, support vector machine, statistica earning theory. Introduction Semi-supervised earning, considering how to estimate a target function from a few abeed exampes and a arge quantity of unabeed exampes, is one of currenty active research directions. If the unabeed data are propery used, it can get a superior performance over the counterpart supervised earning approaches. For an overview of semi-supervised earning methods, refer to Chapee et a. (2006) and Zhu (2008). Athough semi-supervised earning was argey motivated by different rea-word appications where obtaining abes is expensive or time-consuming, a ot of theoretica outcomes have aso been accompished. Typica appications of semi-supervised earning incude natura image cassification and text cassification, where it is inexpensive to coect arge numbers of images and texts by c 200 Shiiang Sun and John Shawe-Tayor.

2 SUN AND SHAWE-TAYLOR automatic programs, but needs a high cost to abe them manuay. Theoretica resuts on semisupervised earning incude PAC-anaysis (Bacan and Bum, 2005), manifod reguarization (Bekin et a., 2006), and muti-view reguarization theories (Sindhwani and Rosenberg, 2008), etc. Among the methods proposed for semi-supervised earning, a famiy of them, for exampe, Lapacian reguarized east squares (RLS), Lapacian support vector machines (SVMs), Co-RLS, Co-Lapacian RLS, Co-Lapacian SVMs, and manifod co-reguarization (Bekin et a., 2006; Sindhwani et a., 2005; Sindhwani and Rosenberg, 2008), make use of the foowing representer theorem (Kimedorf and Wahba, 97) to represent the target function in a reproducing kerne Hibert space (RKHS). Theorem (Representer theorem) LetH be an RKHS with kerne k :X X R. Fix any function V : R n R and any nondecreasing function Ψ : R R. Define J( f)= V( f(x ),..., f(x n ))+Ψ( f 2 ), and inear spacel = span{k(x, ),...,k(x n, )}. Then for any f H we have J( f L ) J( f) with f L being the projection of f ontol in the foowing form f L = n α i k(x i, ). Thus if J = min f J( f) exists, this minimum is attained for some f L. Moreover, if Ψ is stricty increasing, each minimizer of J( f) overh must be contained inl. Generay, in the objective function J( f) of these semi-supervised earning methods, abeed exampes are used to cacuate an empirica oss of the target function and simutaneousy unabeed exampes are used for some reguarization purpose. By the representer theorem, the target function woud invove kerne evauations on a the abeed and unabeed exampes. This is computationay undesirabe, because for semi-supervised earning usuay a consideraby arge number of unabeed exampes are avaiabe. Consequenty, sparsity in the number of unabeed data used to represent target functions is crucia, which constitutes the focus of this paper. However, itte work has been done on this theme. In particuar, there is no unified framework proposed yet to dea with this sparsity concern. Whie the sparse Lapacian core vector machines (Tsang and Kwok, 2007) touched this probem, it has a compicated optimization and is not generic enough to generaize to other simiar semi-supervised earning methods. In contrast with this, the technique deveoped in this paper, based on Fenche-Legendre conjugates, is computationay simpe and widey appicabe. As far as muti-view earning is concerned there has been work that introduces sparsity of the unabeed data into the representation of the cassifiers (Szedmak and Shawe-Tayor, 2007). This buids on the ideas deveoped for two view earning known as the SVM-2K (Farquhar et a., 2006). The approach adopted is the use of an ε-insensitive oss function for the simiarity constraint between the two functions from two views. Unfortunatey the resuting optimization is somewhat unmanageabe and ony scaes to sma-scae data sets despite interesting theoretica bounds that show the improvement gained using the unabeed data. The work by Szedmak and Shawe-Tayor (2007) forms the starting point for the current paper which aims to deveop reated methods that are possibe to be scaed to very arge data sets. Our approach is to go back to consider 2 oss between the outputs of the cassifiers arising from two 2424

3 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS views and shows that this probem can be soved impicity with variabes ony indexed by the abeed data. To compute the vaue of this function on new data woud sti require a non-sparse dua representation in terms of the unabeed data. However, we show that through optimizing weights of the unabeed data the soution of the 2 probem converges to the soution of an ε-insensitive probem ensuring that we subsequenty obtain sparsity in the unabeed data. Furthermore, we deveop the generaization anaysis of Szedmak and Shawe-Tayor (2007) to this case giving computabe expressions for the corresponding empirica Rademacher compexity. To show the appication of Fenche-Legendre conjugates, in Section 2 we propose a nove sparse semi-supervised earning approach: sparse muti-view SVMs, where the conjugate functions pay a centra roe in reformuating the optimization probem. The dua optimization of a subroutine of the sparse muti-view SVMs is converted to a quadratic programming probem in Section 3 whose scae ony depends on the number of abeed exampes, indicating the advantages of using conjugate functions. The generaization error of the sparse muti-view SVMs is given in Section 4 in terms of Rademacher compexity theory, foowed by a derivation of empirica Rademacher compexity of the cass of functions induced by this new method in Section 5. Section 6 reports experimenta resuts of the sparse muti-view SVMs, comparisons with reated methods, and the possibiity and potentia for arge-scae appications through sequentia training. Extensions of the use of conjugate functions to a genera convex oss and other reated semi-supervised earning approaches are discussed in Section 7. Finay, Section 8 concudes this paper. 2. Sparse Muti-view SVMs Muti-view semi-supervised earning, an important branch of semi-supervised earning, combines different sets of properties of an exampe to earn a target function. These different sets of properties are often referred to as views. Typica appications of muti-view earning are web-page categorization and content-based mutimedia information retrieva. In web-page categorization, each web-page can be simutaneousy described by disparate properties such as main text, inbound and outbound hyper-inks. In content-based mutimedia information retrieva, a mutimedia segment can incude both audio and video components. For such scenarios earning with mutipe views is usuay very beneficia. Even for probems with no natura mutipe views, artificiay generated views can sti work favoraby (Nigam and Ghani, 2000). A usefu assumption for muti-view earning is that features from each view are sufficient to train a good earner (Bum and Mitche, 998; Bacan et a., 2005; Farquhar et a., 2006). Making good use of this assumption through coaborative training or reguarization between views can remove many fase hypotheses from the hypothesis space, and thus faciitates effective earning. For muti-view earning, an input x consists of mutipe components from different views, for exampe, x = (x,...,x m ) for an m-view representation. A function f j defined on view j ony depends on x j, that is f j (x) := f j (x j ). Suppose we have a set of abeed exampes {(x i,y i )} with y i {, }, and a set of u unabeed exampes{x i } i=+ +u. The objective function of our sparse muti-view SVMs in the case of two views is given as foows, which can be readiy extended to 2425

4 SUN AND SHAWE-TAYLOR more than two views. min f H, f 2 H 2 2 [( y i f (x i )) + +( y i f 2 (x i )) + ]+ γ n ( f 2 + f 2 2 +u )+γ v ( f (x i ) f 2 (x i ) ε) 2 +, () where nonnegative scaars γ n,γ v are respectivey norm reguarization and muti-view reguarization coefficients, and the ast term is an ε-insensitive oss between two views with function ( ) + := max(0, ) being the hinge oss. The fina cassifier for predicting the abe of a new exampe is ( ) f (x)+ f 2 (x) f c (x)=sgn. (2) 2 In the rest of this section, we wi show that the use of the ε-insensitive oss indeed enforces sparsity, and Fenche-Legendre conjugates can be adopted to reformuate the optimization probem. We aso show the sadde-point properties for optima soutions and give a (gobay optima) iterative optimization agorithm. 2. Sparsity In order to show the roe of the ε-insensitive oss for sparsity pursuit, here we represent f (x) and f 2 (x) in feature spaces as f (x)=w φ (x)+b, f 2 (x)=w 2 φ 2 (x)+b 2, where φ i (x)(,2) is the image of x in feature spaces. Probem () can be rewritten as min w,w 2,ξ,ξ 2,b,b 2 s.t. P 0 = 2 γ v +u (ξ i + ξ i 2)+γ n ( w 2 + w 2 2 )+ ( w φ (x i )+b w 2 φ 2 (x i ) b 2 ε) 2 + y i (w φ (x i )+b ) ξ i, y i (w 2 φ 2(x i )+b 2 ) ξ i 2, ξ i, ξi 2 0,,...,, where ξ :=[ξ,...,ξ ] and ξ 2 :=[ξ 2,...,ξ 2 ]. The Lagrangian is L= P 0 [λ i (y i (w φ (x i )+b ) +ξ i )+ λ i 2(y i (w 2 φ 2 (x i )+b 2 ) +ξ i 2)+ν i ξ i + ν i 2ξ i 2], where λ i,λi 2,νi,νi 2 0 (,...,) are Lagrange mutipiers. Suppose w, w 2 are the optima soutions. By the KKT conditions, the optima soutions w shoud satisfy L w = 0. Therefore, we get w = γ +u v γ n ( w φ (x i )+b w 2 φ 2 (x i ) b 2 ε) + φ i + 2γ n 2426 λ i y i φ (x i ),

5 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS where we suppose the derivative exists everywhere and φ i := sgn{w φ (x i )+b w 2 φ 2(x i ) b 2 }φ (x i ). Now we can assess the sparsity of probem (). From the above equation, w is the inear combination of abeed exampes with λ i > 0, and those unabeed exampes on which the difference of predictions from two views exceeds ε. In this sense, we can get sparse soutions by providing a non-zero ε. Simiar anaysis appies to w 2. Therefore, function f(x) is sparse in the number of used unabeed exampes. This anaysis on sparsity is aso we justified by the representer theorem. 2.2 Reformuation Using Conjugate Functions Define t i :=[ f (x i ) f 2 (x i )] 2. Then the ε-insensitive oss term can be written as +u f ε (t)= ( t i ε) 2 +, (3) where vector t :=[t,...,t +u ]. We give a theorem affirming the convexity of function f ε (t). Theorem 2 Function f ε (t) defined by (3) is convex. Proof First, we show that f ε (t i ) := ( t i ε) 2 + with a convex domain [0,+ ) is convex. When t i (ε 2,+ ), the second derivative 2 f ε (t i ) = 2 εt 3/2 i 0. Thus, function f ε (t i ) is convex for t i (ε 2,+ ). Moreover, the vaue of function f ε (t i ) for t i (ε 2,+ ) is arger than 0 which is the vaue of f ε (t i ) for t i [0,ε 2 ], and function f ε (t i ) with domain [0,+ ) is continuous at ε 2. Hence, f ε (t i ) is convex on the domain[0,+ ). Then, being a nonnegative weighted sum of convex functions, f ε (t) is indeed convex. Define conjugate vector z=[z,...,z +u ] with entries being conjugate variabes. The Fenche- Legendre conjugate (which is aso often caed convex conjugate or conjugate function) f ε(z) is fε(z)= sup (t z f ε (t))=sup t dom f ε t +u [z i t i ( t i ε) 2 +]= +u sup t i [z i t i ( t i ε) 2 +]. The domain of the conjugate function consists of z R +u for which the supremum is finite (i.e., bounded above) (Boyd and Vandenberghe, 2004). Define f ε(z i )=sup t i [z i t i ( t i ε) 2 +]. (4) Then, f ε(z) = +u f ε(z i ). As a pointwise supremum of a famiy of affine functions, f ε(z i ) is convex. Being a nonnegative weighted sum of convex functions, f ε(z) is aso convex. Beow we derive the formuation of f ε(z i ). Theorem 3 Function f ε(z i ) defined by (4) has the foowing form f ε(z i )= { zi ε 2 z i, for 0<z i < 0, for z i 0. (5) 2427

6 SUN AND SHAWE-TAYLOR Proof By definition, we have f ε(z i )=max t i { sup 0 t i ε 2 z i t i, sup t i >ε 2 [z i t i ( t i ε) 2 ] }. (6) The vaue of function sup 0 ti ε 2 z it i is simpe to characterize. We now characterize the second term sup ti >ε 2[z it i ( t i ε) 2 ]=sup ti >ε 2(z it i t i ε 2 + 2ε t i ). For 0<z i <, we et the first derivative equa to zero to find the supremum. For z i 0 or z i the derivative does not exist and thus we use function vaues at end points to find the supremum. As a resut, we have sup[z i t i ( t i ε) 2 ]= t i >ε 2 z i ε 2 z i, for 0<z i < z i ε 2, for z i 0 +, for z i. According to (6) and further removing the range where f ε(z i ) is unbounded above, we reach the conjugate given in (5). Now the Fenche-Legendre conjugate f ε(z) can be represented by +u f ε(z i ), which is aso we justified by the foowing theorem. Theorem 4 (Boyd and Vandenberghe, 2004) If ϕ(u,v)=ϕ (u)+ϕ 2 (v), where ϕ and ϕ 2 are independent convex functions (independent means they are functions of different variabes) with conjugates ϕ and ϕ 2, respectivey, then ϕ (ω,z)=ϕ (ω)+ϕ 2(z). A nice property of the conjugate function is on the conjugate of the conjugate, which is centra to the reformuation of our optimization probem. This property is stated by Lemma 5. Lemma 5 (Rifkin and Lippert, 2007) If function f is cosed, convex, and proper, then the conjugate function of the conjugate is itsef, that is, f = f, where we have defined function f is cosed if its epigraph is cosed, and f is proper if dom f /0 and f >. It is true that function f ε (t) is cosed and proper. Moreover, we have proved the convexity of f ε (t) in Theorem 2. Therefore, we can use Lemma 5 to get the foowing equaity f ε (t)=sup(z t fε(z)). (7) z That is By (7), we have +u ( t i ε) 2 + = sup z (z t f ε(z))=sup z +u [z i t i f ε(z i )]. +u ( f (x i ) f 2 (x i ) ε) 2 + = +u ( t i ε) 2 + = sup z +u [z i t i f ε(z i )]. 2428

7 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS Therefore, the objective function for sparse muti-view SVMs becomes min f H, f 2 H 2 2 [( y i f (x i )) + +( y i f 2 (x i )) + ]+ γ n ( f 2 + f 2 2 +u )+γ v sup {z i [ f (x i ) f 2 (x i )] 2 fε(z i )}. (8) z As an appication of Theorem, the soution to probem (8) has the foowing form +u f (x)= α i +u k (x i,x), f 2 (x)= Appying the reproducing properties of kernes, we get f 2 = α K α, f 2 2 = α 2 K 2 α 2, α i 2k 2 (x i,x). (9) where K and K 2 are(+u) (+u) Gram matrices from two viewsv andv 2, respectivey, and vector α =(α,...,α+u ), α 2 =(α 2,...,α+u 2 ). Moreover, we have f = K α, f 2 = K 2 α 2, with f := ( f (x ),..., f (x +u )), f 2 := ( f 2 (x ),..., f 2 (x +u )). Define diagona matrix U = diag(z,...,z +u ) with every eement taking vaues in the range [0,). Probem (8) can be reformuated as min sup α,α 2,ξ,ξ 2,b,b 2 z s.t. 2.3 Sadde-Point Property 2 (ξ i + ξ i 2)+γ n (α K α + α 2 K 2 α 2 )+ γ v [(K α K 2 α 2 ) +u U(K α K 2 α 2 ) y i ( +u j= α j k (x j,x i )+b ) ξ i, f ε(z i )] y i ( +u j= α j 2 k 2(x j,x i )+b 2 ) ξ i 2, (0) ξ i, ξi 2 0,,...,. We present a theorem concerning the convexity and concavity of optimization probem (0). Theorem 6 The objective function in probem (0) is convex with respect to α,α 2, ξ, ξ 2, b, and b 2, and concave with respect to z. Proof First, we show the convexity. The standard form of this optimization probem is min α,α 2,ξ,ξ 2,b,b 2 sup z s.t. 2 (ξ i + ξ i 2)+γ n (α K α + α 2 K 2 α 2 )+ γ v [(K α K 2 α 2 ) +u U(K α K 2 α 2 ) y i ( +u j= α j k (x j,x i )+b )+ ξ i 0, y i ( +u j= α j 2 k 2(x j,x i )+b 2 )+ ξ i 2 0, ξ i, ξi 2 0,,...,. f ε(z i )] 2429

8 SUN AND SHAWE-TAYLOR This probem invoves one objective function and three sets of inequaity constraint functions (on the eft hand side of each inequaity). Ceary, the domain of each objective and constraint function is a convex set. Now it suffices to prove the convexity of this probem by assessing the convexity of these functions. As a constraint functions are affine, they are convex. Then, we use the second-order condition, positive semidefinite property of a function s Hessian or second derivative to judge the convexity of the objective function (Boyd and Vandenberghe, 2004). According to this condition, the first two items of the objective function are cear to be convex. The third part can be rewritten as (K α K 2 α 2 ) U(K α K 2 α 2 )= U /2( K K 2 ) ( α α 2 ) 2, which is a convex function 2 composed with an affine mapping and thus aso convex (Boyd and Vandenberghe, 2004). Being a nonnegative weighted sums of convex functions, the objective function is therefore convex with respect to α,α 2,ξ,ξ 2,b,b 2. Then, we show the concavity using (8). As fε(z i ) is convex, z i [ f (x i ) f 2 (x i )] 2 fε(z i ) is concave with respect to z i. The concavity of +u {z i[ f (x i ) f 2 (x i )] 2 fε(z i )} foows from the fact that a nonnegative weighted sum of concave functions is concave (Boyd and Vandenberghe, 2004). Hence, the objective function in probem (0) is concave with respect to z. Let θ denote the parameters α,α 2,ξ,ξ 2,b,b 2. We can simpy denote the above optimization probem as inf sup f(θ,z) () θ z associated with constraints on the abeed exampes, where f(θ,z) is convex with respect to θ, and concave with respect to z. We give the foowing theorem on the equivaence of swapping the infimum and supremum for our optimization probem and incude a proof for competeness. Theorem 7 (Boyd and Vandenberghe, 2004) If f(θ, z) with domain Θ and Z is convex with respect to θ Θ, and concave with respect to z Z, the foowing equaity hods under some sight assumptions. inf θ sup z f(θ,z)=supinf f(θ,z) z θ Proof The idea is first to represent the eft-hand side as a vaue of a convex function, and then show that the conjugate of its conjugate is equa to the right-hand side when the same input vaue is pugged in Boyd and Vandenberghe (2004). The eft-hand side can be expressed as p(0), where p(u)=inf sup [ f(θ,z)+u z]. θ z It is not difficut to show that p is a convex function. Being a pointwise supremum of convex function f(θ,z)+u z, sup z [ f(θ,z)+u z] is a convex function of(θ,u). Because sup z [ f(θ,z)+u z] is convex with respect to (θ,u), we have p(u) is convex. The Fenche-Legendre conjugate of p(u) is p (v)=sup[u v inf sup ( f(θ,z)+u z)], u θ z 2430

9 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS which woud be+ if z v. Therefore, { p infθ f(θ,v), for v Z (v)= +, otherwise. The conjugate of p (z) is given by p (u)=sup(u z p (z))=sup(u z+inf z Z z Z θ f(θ,z))=sup inf [ z θ f(θ,z)+u z]. Suppose 0 domp(u) and p(u) is cosed and proper. Then by Lemma 5 we have p(0) = p (0) which competes the proof. Now we give a theorem showing that the optima pair θ, z is a sadde-point. Theorem 8 If the foowing equaity hods for function f(θ,z) inf θ sup z then the optima pair θ, z forms a sadde-point. Proof From the given equaity, we have and Therefore, f(θ,z)=supinf f(θ,z)= f( θ, z), z θ inf sup f(θ,z)=sup f( θ,z)= f( θ, z), θ z z supinf f(θ,z)=inf f(θ, z)= f( θ, z), z θ θ f( θ,z) f( θ, z) f(θ, z), which indeed satisfies the definition of a sadde-point. The proof is competed. 2.4 Iterative Optimization Agorithm To sove the optimization probem sup z inf θ f(θ,z) which is respectivey concave and convex with respect to z and θ, we give an agorithm with guaranteed convergence by the foowing theorem. Theorem 9 Given an initia vaue z 0 for z, sove inf θ f(θ,z 0 ) and obtain the goba optima point θ 0. Then we find argmax z f(θ 0,z) to get z from which we can get θ as a resut of optimize inf θ f(θ,z ). Repeat this process unti a convergence point (ˆθ,ẑ) is reached. Suppose θ, z is a sadde point. We have f(ˆθ,ẑ) = f( θ, z). That is, we got the optima vaues of the objective function. If f is stricty concave and stricty convex with respect to the variabes, we further have ˆθ= θ and ẑ= z. Proof According to the properties of function f and the agorithm procedure, we know that the convergence point is a sadde point. Thus, we have f( θ,ẑ) f(ˆθ,ẑ) f(ˆθ, z). 243

10 SUN AND SHAWE-TAYLOR By the sadde-point property of ( θ, z), we have f( θ, z) f( θ,ẑ), and f( θ, z) f(ˆθ, z). Therefore, the above inequaities shoud hod with equaities and we have f( θ, z)= f(ˆθ,ẑ). Furthermore, if f is stricty concave and stricty convex with respect to the variabes, it is true that ˆθ= θ and ẑ= z. On soving argmax z f(θ,z) required in Theorem 9, we can maximize the term reated to z, namey z t f ε(z)= +u [z it i f ε(z i )]. For this purpose, we have the foowing theorem. Theorem 0 and arg sup z i dom f ε(z i ) sup [z i t i fε(z i )]=( t i ε) 2 +, z i dom fε(z i ) [z i t i f ε(z i )]= { ε ti, for t i > ε 2 0, for 0 t i ε 2. Without oss of generaity, we can confine the range of z i to[0,). Proof We have sup [z i t i fε(z i )]=max{ sup z i dom fε(z i ) z i 0<z i < z i t i z iε 2, supz i t i }. z i The first supremum can be soved by setting the derivative with respect to z i to zero. We have sup z i t i z iε 2 =( t i ε) 2 0<z i < z i where t i > ε 2, and the supremum is attained with z i = ε ti. When t i < 0, sup zi 0 z it i is unbounded above. When 0 t i ε 2, sup zi 0 z it i = 0 with the supremum attained at z i = 0. Therefore, max zi {sup 0<zi < z it i z iε 2 z i,sup zi 0 z it i } = ( t i ε) 2 + with the supremum attained when z i [0,), which competes the proof. z i 0 For sparsity pursuit, during each iteration we remove those unabeed exampes whose corresponding z i s are zero. By the representer theorem, this woud not infuence the vaue of the objective function. For Theorem 9, this means that the eement of z whose vaues are zero in the ast iteration wi remain zero for the next iteration. When there are no unabeed exampes eigibe for eimination, the iteration wi terminate and the convergence point (ˆθ,ẑ) is reached. 2432

11 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS 3. Dua Optimization According to the iterative optimization agorithm, when optimizing probem (0), we start from an initia vaue z 0 and then sove θ 0. In this section, we show how to sove this subroutine with fixed z. Now the optimization probem is equivaent to min α,α 2,ξ,ξ 2,b,b 2 s.t. F 0 = 2 (ξ i + ξ i 2)+γ n (α K α + α 2 K 2 α 2 )+ γ v (K α K 2 α 2 ) U(K α K 2 α 2 ) y i ( +u j= α j k (x j,x i )+b ) ξ i, y i ( +u j= α j 2 k 2(x j,x i )+b 2 ) ξ i 2, (2) ξ i, ξi 2 0,,...,. 3. Lagrange Dua Function We wi sove probem (2) through optimizing its dua probem which is simper to sove. Now we derive its Lagrange dua function. Suppose λ i,λi 2,νi,νi 2 0 (i =,...,) be the Lagrange mutipiers associated with the inequaity constraints. Define λ j = [λ j,...,λ j ] and ν j = [ν j,...,ν j ] ( j =,2). The Lagrangian L(α,α 2,ξ,ξ 2,b,b 2,λ,λ 2,ν,ν 2 ) can be written as Note that L= F 0 [λ i +u (y i ( α j k (x j,x i )+b ) +ξ i )+ j= λ i +u 2(y i ( α j 2 k 2(x j,x i )+b 2 ) +ξ i 2)+ν i ξ i + ν i 2ξ i 2]. j= (K α K 2 α 2 ) U(K α K 2 α 2 ) = α K UK α 2α K UK 2 α 2 + α 2 K 2 UK 2 α 2. To obtain the Lagrangian dua function, L has to be minimized with respect to the prima variabes α,α 2,ξ,ξ 2,b,b 2. To eiminate these variabes, we compute the corresponding partia derivatives and set them to 0, obtaining the foowing conditions 2J α 2γ v K UK 2 α 2 = Λ, (3) 2J 2 α 2 2γ v K 2 UK α = Λ 2, (4) λ i + ν i = 2, (5) λ i 2+ ν i 2 = 2, (6) λ i y i = 0, λ i 2y i = 0, (7) 2433

12 SUN AND SHAWE-TAYLOR where J := γ n K + γ v K UK, J 2 := γ n K 2 + γ v K 2 UK 2, Λ := Λ 2 := λ i y i K (:,i), λ i 2y i K 2 (:,i), with K (:,i) and K 2 (:,i) being the ith coumn of the corresponding Gram matrices. Substituting (3) (7) into L resuts in the foowing expression of the Lagrangian dua function g L (λ,λ 2,ν,ν 2 ) g L = γ n (α K α + α 2 K 2 α 2 )+γ v (α K UK α 2α K UK 2 α 2 + α 2 K 2 UK 2 α 2 ) α Λ α 2 Λ 2 + = 2 α Λ + 2 α 2 Λ 2 α Λ α 2 Λ 2 + = 2 α Λ 2 α 2 Λ 2 + We obtain the foowing from (3) and (4) (λ i + λ i 2) (λ i + λ i 2) (λ i + λ i 2). (8) From (3) and (20), we have α = 2 J (Λ + 2γ v K UK 2 α 2 ) (9) α 2 = 2 J 2 (Λ 2+ 2γ v K 2 UK α ). (20) (2J 2γ 2 vk UK 2 J 2 K 2UK )α = Λ + γ v K UK 2 J 2 Λ 2. Define M = 2J 2γ 2 vk UK 2 J 2 K 2UK. Suppose the above inear system is we-posed (if iposed we can empoy approximate numerica anaysis techniques). We get From (4) and (9), we have α = M (Λ + γ v K UK 2 J 2 Λ 2). (2J 2 2γ 2 vk 2 UK J K UK 2 )α 2 = Λ 2 + γ v K 2 UK J Λ. Define M 2 = 2J 2 2γ 2 vk 2 UK J K UK 2. Thus we get α 2 = M 2 (Λ 2+ γ v K 2 UK J Λ ). 2434

13 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS Now with α and α 2 substituted into (8), the Lagrange dua function g L (λ,λ 2,ν,ν 2 ) is 3.2 Soving the Dua Probem g L = inf L= α,α 2,ξ,ξ 2,b,b 2 2 α Λ 2 α 2 Λ 2 + The Lagrange dua probem is given by = 2 (Λ + γ v K UK 2 J 2 Λ 2) M Λ 2 (Λ 2+ γ v K 2 UK J Λ ) M2 Λ 2+ (λ i + λ i 2). (λ i + λ i 2) max λ,λ 2 s.t. g L 0 λ i 2, 0 λ i 2 2, λi y i = 0, λi 2 y i = 0.,...,,..., As Lagrange dua functions are aways concave (Boyd and Vandenberghe, 2004), we can formuate the above probem as a convex optimization probem min λ,λ 2 s.t. g L 0 λ i 2, 0 λ i 2 2, λi y i = 0, λi 2 y i = 0.,...,,..., (2) Define matrix Y = diag(y,...,y ). Then, Λ = K Y λ and Λ 2 = K 2 Y λ 2 with K = K (:, :) and K 2 = K 2 (:, :). We have g L = 2 (Λ + γ v K UK 2 J 2 Λ 2) M Λ + 2 (Λ 2+ γ v K 2 UK J Λ ) M2 Λ 2 = ( 2 (λ λ A B 2) C D )( λ λ 2 (λ i + λ i 2) ) (λ + λ 2 ), where A B C D := Y KM K Y, := γ v Y KJ K UK 2 M2 K 2Y, := γ v Y K2J 2 K 2UK M K Y, := Y K2M 2 K 2Y, 2435

14 SUN AND SHAWE-TAYLOR and =(,..., () ). Substituting M and M 2 into the expressions of B and C, we ( can prove ) that B= C. In addition, A B because of the convexity of function g, we affirm that matrix is positive semi-definite. C D Hence, the optimization probem in (2) can be rewritten as min λ,λ 2 s.t. ( 2 (λ λ A B 2) C D 0 λ 2, 0 λ 2 2, λ y=0, λ 2 y=0, )( λ λ 2 ) (λ + λ 2 ) where y=(y,...,y ). After soving this probem using standard software, we then obtain ν i and ν i 2 by (5) and (6). We now state the advantages of optimizing this dua probem over optimizing the prima probem (2): Less optimization variabes as for typica semi-supervised earning u, and Simper constraint functions. The soution of bias terms b and b 2 can be obtained through support vectors. Due to KKT conditions, the foowing equaities hod λ i +u (y i ( α j k (x j,x i )+b ) +ξ i )=0, j= λ i +u 2(y i ( α j 2 k 2(x j,x i )+b 2 ) +ξ i 2)=0, j= ν i ξ i = 0, ν i 2ξ i 2 = 0,,...,. For support vectors x i, we have ν i j > 0 (and thus ξi j = 0) and λi j > 0 ( j =,2). Therefore, we can resove the bias terms by averaging y i ( +u j= α j k (x j,x i )+b ) = 0 and y i ( +u j= α j 2 k 2(x j,x i )+ b 2 ) =0 over a support vectors. 3.3 Advantages of Using Conjugate Functions In this subsection, we show the direct optimization of probem () without the use of conjugate functions is of arge scae and time-consuming, which justifies the advantages of using conjugate functions. 2436

15 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS The prima probem can be rewritten as min D 0 = α,α 2,ξ,ξ 2,b,b 2,δ i 2 (ξ i + ξ i 2)+γ n (α K α + α +u 2 K 2 α 2 )+γ v δ 2 i y i ( +u j= α j k (x j,x i )+b ) ξ i,,...,, y i ( +u j= α j 2 k 2(x j,x i )+b 2 ) ξ i 2,,...,, s.t. ξ i, ξi 2 0,,...,, (22) ( +u j= α j k (x j,x i )+b ) ( +u j= α j 2 k 2(x j,x i )+b 2 ) δ i ε,,...,+u, ( +u j= α j k (x j,x i )+b ) ( +u j= α j 2 k 2(x j,x i )+b 2 ) δ i + ε,,...,+u, where y i {, }, γ n,γ v 0. We wi sove probem (22) through optimizing its dua probem which can be simper to sove. Suppose λ i,λi 2,νi,νi 2 0(,...,) and µi,µi 2 (,...,+u) are the Lagrange mutipiers associated with the inequaity constraints of probem (22). Define δ=[δ,...,δ +u ], λ j =[λ j,...,λ j ], ν j = [ν j,...,ν j ], and µ j = [µ j,...,µ+u j ] ( j =,2). The Lagrangian L(α,α 2,ξ,ξ 2,b,b 2,δ,λ,λ 2,ν,ν 2,µ,µ 2 ) can be written as L= D 0 [λ i +u (y i ( α j k (x j,x i )+b ) +ξ i )+ j= λ i +u 2(y i ( α j 2 k 2(x j,x i )+b 2 ) +ξ i 2)+ν i ξ i + ν i 2ξ i 2] +u +u j= µ i +u [ j= µ i +u 2[ j= α j +u k (x j,x i )+b j= α j +u k (x j,x i )+b j= α j 2 k 2(x j,x i ) b 2 + δ i + ε]+ α j 2 k 2(x j,x i ) b 2 δ i ε]. To obtain the Lagrangian dua function, L has to be minimized with respect to the prima variabes α,α 2,ξ,ξ 2,b,b 2,δ. To eiminate these variabes, we compute the corresponding partia 2437

16 SUN AND SHAWE-TAYLOR derivatives and set them to 0, obtaining the foowing conditions 2γ n K α = 2γ n K 2 α 2 = λ i + ν i = 2,,..., λ i 2+ ν i 2 = 2,,..., λ i +u y i K (:,i)+ (µ i µ i 2)K (:,i), λ i +u 2y i K 2 (:,i) (µ i µ i 2)K 2 (:,i), λ i +u y i µ i +u + µ i 2 = 0, λ i +u 2y i + µ i +u µ i 2 = 0, 2γ v δ i µ i µ i 2 = 0,,...,+u. Substituting these equations into the Lagrangian as what was done in Section 3., it is cear that finay L is a quadratic function invoving λ,λ 2,µ,µ 2. The dua optimization probem woud be a quadratic optimization invoving 2 + 2( + u) parameters. Now we see this direct optimization is indeed of arge-scae and time-consuming. 4. Generaization Error In this section, we anayze the generaization performance of the sparse muti-view SVMs making use of Rademacher compexity theory and the margin bound. 4. Rademacher Compexity Theory Important background on Rademacher compexity theory (Bartett and Mendeson, 2002; Shawe- Tayor and Cristianini, 2004) is introduced beow. Definition For a sampe S ={x,...,x } generated by a distribution D on a set X and a reavaued function cassf with domainx, the empirica Rademacher compexity off is the random variabe ] 2 ˆR (F)=E σ [sup σ f F i f(x i ) x,...,x, where σ = {σ,...,σ } are independent uniform {±}-vaued (Rademacher) random variabes. The Rademacher compexity off is ] 2 R (F)=E S [ ˆR (F)]=E Sσ [sup σ f F i f(x i ). Lemma 2 Fix δ (0,) and et F be a cass of functions mapping from an input space X ( X = X Y or X =X) to[0,]. Let( x i ) be drawn independenty according to a probabiity distribution 2438

17 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS D. Then with probabiity at east δ over random draws of sampes of size, every f F satisfies n(2/δ) E D [ f( x)] Ê[ f( x)]+r (F)+ 2 n(2/δ) Ê[ f( x)]+ ˆR (F)+3, 2 where Ê[ f( x)] is the empirica error averaged on theexampes. 4.2 Margin Bound for Sparse Muti-view SVMs By (2), the prediction function of the sparse muti-view SVMs is derived from the average of predictions from two views. Define the soft prediction function as g(x)= 2 ( f (x)+ f 2 (x)). We obtain the foowing margin bound regarding the generaization error of sparse muti-view SVMs. This bound is widey appicabe to muti-view SVMs, for exampe, Szedmak and Shawe- Tayor (2007) independenty provided a simiar bound for the SVM-2K method. Theorem 3 Fix δ (0,) and etf be the cass of functions mapping from X =X Y to R given by f(x,y)= yg(x) where g= 2 ( f + f 2 ) G and f F. Let S={(x,y ),,(x,y )} be drawn independenty according to a probabiity distribution D. Then with probabiity at east δ over sampes of size, every g G satisfies P D (y sgn(g(x))) 2 n(2/δ) (ξ i + ξ i 2)+2 ˆR (G)+3, 2 where ξ i :=( y i f (x i )) +, ξ i 2 :=( y i f 2 (x i )) +. Function y i f (x i ) and y i f 2 (x i ) are caed margins. Proof Let H( ) be the Heaviside function that returns if its argument is greater than 0 and zero otherwise. We have P D (y sgn(g(x)))=e D [H( yg(x))]. (23) Consider a oss functiona : R [0,], given by, if a 0; A(a)= +a, if a 0; 0, otherwise. By Lemma 2 and since functiona dominates H, we have (Shawe-Tayor and Cristianini, 2004) E D [H( f(x,y)) ] E D [A( f(x,y)) ] n(2/δ) Ê[A( f(x,y)) ]+ ˆR ((A ) F)

18 SUN AND SHAWE-TAYLOR Therefore, E D [H( f(x,y))] In addition, we have n(2/δ) Ê[A( f(x,y))]+ ˆR ((A ) F)+3. 2 Ê[A( f(x,y))] = 2 2 = 2 ( y i g(x i )) + ( y i f (x i )+ y i f 2 (x i )) + [( y i f (x i )) + +( y i f 2 (x i )) + ] (ξ i + ξ i 2), where ξ i denotes the amount by which function f fais to achieve margin for(x i,y i ) and ξ i 2 appies simiary to function f 2. Since (A )(0)=0, we can appy the Lipschitz condition (Bartett and Mendeson, 2002) of function(a ) to get ˆR ((A ) F) 2ˆR (F). It remains to bound the empirica Rademacher compexity of the cassf. With y i {, }, we have ˆR (F) = E σ [sup 2 f F = E σ [sup 2 g G = E σ [sup 2 g G Finay, combining (23) (24) competes the proof. σ i f(x i,y i ) ] σ i y i g(x i ) ] σ i g(x i ) ] = ˆR (G). (24) 5. Empirica Rademacher Compexity Our optimization agorithm iterativey updates z to sove θ. In this section, we first derive the empirica Rademacher compexity of ˆR (G) for the function cass induced after one iteration with an initiay fixed z, and then give its formuation appicabe for any number of subsequent iterations incuding the termination case. This Rademacher compexity is crucia for Theorem 3 when anayzing the performance of the corresponding cassifiers obtained by our iterative optimization agorithm. Specificay, for the empirica Rademacher compexity we give the foowing theorem 2440

19 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS /2 )Uu, Theorem 4 SupposeS = γ n (K K K +K 2K2 K 2 ), Θ= γ n Uu /2 (K u K K u +K 2uK2 K 2u J = γ n Uu /2 (K u K K K 2uK2 K 2 ), where K and K 2 are respectivey the first rows of the Gram matrices K and K 2, K u and K 2u are respectivey the ast u rows of matrix K and K 2, and U u is the diagona matrix incuding the ast u diagona eements (initiay fixed z +,...,z +u ) of U. Then the empirica Rademacher compexity ˆR (G) is bounded as U 2 ˆR (G) U, where U2 = tr(s) γ v tr(j (I+ γ v Θ) J) for the first iteration of sparse muti-view SVMs, andu 2 = tr(s) for subsequent iterations. The remainder of this section before Section 5.4 competes the proof of this theorem, which was partiay inspired by Rosenberg and Bartett (2007) for anayzing co-reguarized east squares. We use probem (8) to reason about ˆR (G). As a resut of fixed z, we can remove f ε(z i ) without oss of generaity to resove f and f 2. It is true that the oss function ˆL :H H 2 [0, ) with ˆL := 2 [( y i f (x i )) + +( y i f 2 (x i )) + ] satisfies ˆL(0,0)=. Let Q( f, f 2 ) denote the objective function in (8) with f ε(z i ) removed. Substituting in the trivia predictors f 0 and f 2 0 gives the foowing upper bound min Q( f, f 2 ) Q(0,0)= ˆL(0,0)=. f, f 2 H H 2 Since a terms of Q( f, f 2 ) are nonnegative, we concude that any( f, f 2 ) minimizing Q( f, f 2 ) is contained in Ĥ = {( f, f 2 ) : γ n ( f 2 + f 2 2 +u )+γ v z i [ f (x i ) f 2 (x i )] 2 }. (25) i=+ Therefore, the fina predictor is chosen from the function cass The compexity ˆR (G) is G ={x 2 [ f (x)+ f 2 (x)] :( f, f 2 ) Ĥ}. ˆR (G)=E σ [ sup ( f, f 2 ) Ĥ ] σ i ( f (x i )+ f 2 (x i )). (26) As it ony depends on the vaues of function f ( ) and f 2 ( ) on theabeed exampes, by the reproducing kerne property which says the projection of function f onto a cosed subspace containing k(x, ) has the same vaue at x as f itsef does (Rosenberg and Bartett, 2007) we can restrict the function cassĥ to the span of abeed and unabeed data and thus write it as Ĥ = {( f, f 2 ) : γ n (α K α + α 2 K 2 α 2 )+ γ v (K u α K 2u α 2 ) U u (K u α K 2u α 2 ) } ( ) = {( f, f 2 ) :(α α α 2)N }, where K u and K 2u are respectivey the ast u rows of matrix K and K 2, U u is the diagona matrix incuding the ast u diagona eements of U, and ( ) ( ) K 0 K N := γ n + γ u 0 K v 2 K2u U u (K u K 2u ). (27) α 2 244

20 SUN AND SHAWE-TAYLOR 5. Evauating the Supremum in Eucidean Space Since ( f, f 2 ) Ĥ impies ( f, f 2 ) Ĥ, we can drop the absoute sign in (26). Now we can write ˆR (G) = ( ) E σ sup {σ K α + σ K 2 α 2 :(α α α 2)N } α,α 2 R α +u 2 = E σ sup α,α 2 R +u {σ (K K 2 ) ( α α 2 ) :(α α 2)N ( α α 2 ) }, (28) where K,K 2 represent the firstrows of the Gram matrices K and K 2, respectivey. For a symmetric positive definite matrix M, it is simpe to show that (Rosenberg and Bartett, 2007) sup v α= M /2 v. α:α Mα Without oss of generaity, suppose positive semi-definite matrix N in (28) is positive definite and thus has fu rank. If N does not have fu rank, we can use subspace decomposition to rewrite ˆR (G) to obtain a simiar representation. Thus, we can evauate the supremum as described above to get 5.2 Bounding ˆR (G) above and beow ˆR (G)= E σ N /2 ( K K 2 ) σ. We make use of the Kahane-Khintchine inequaity (Lataa and Oeszkiewicz, 994), stated here for convenience, to bound ˆR (G). Lemma 5 For any vectors a,,a n in a Hibert space and independent Rademacher random variabes σ,,σ n, we have By Lemma 5 we have n 2 E σ i a i 2 (E n σ i a i ) 2 E n σ i a i 2. U 2 ˆR (G) U, (29) where U 2 = E σ N /2 ( K K 2 ) σ 2 = E σ tr[(k K 2 )N ( K = tr[(k K 2 )N ( K K 2 K 2 ) ]. ) σσ ] Reca that N = γ n ( K 0 0 K 2 ) + γ v ( K u K 2u ) U u (K u K 2u ). 2442

21 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS Define Σ=γ n ( K 0 0 K 2 ), R=( K u K 2u ) U /2 u. Using the Sherman-Morrison-Woodbury formua (Goub and Loan, 996), we expand N as Define Ω=(K K 2 ). We get Define N = Σ γ v Σ R(I+ γ v R Σ R) R Σ. U 2 = tr(ωσ Ω ) γ v tr[ωσ R(I+ γ v R Σ R) R Σ Ω ]. S = ΩΣ Ω = γ n (K K K + K 2 K 2 K 2), Θ = R Σ R= Uu /2 (K u K γ K u+ K 2u K2 K 2u)Uu /2, n J = R Σ Ω = Uu /2 (K u K γ K K 2u K2 K 2). (30) n Putting expressions together, we get U 2 = tr(s) γ v tr(j (I+ γ v Θ) J). (3) 5.2. REGULARIZATION TERM ANALYSIS From (29) and (3), it is cear to see the roes the reguarization parameters γ n and γ v pay in the empirica Rademacher compexity ˆR (G). The amount of reduction in the Rademacher compexity brought by γ v is (γ v )=γ v tr(j (I+ γ v Θ) J). This term has the property shown by the foowing emma given by Rosenberg and Bartett (2007) when anayzing co-reguarized east squares. Here the meanings of J and Θ are different from Rosenberg and Bartett (2007). Lemma 6 (Rosenberg and Bartett, 2007) (0)=0, (γ v ) is nondecreasing on γ v 0, and given that Θ is positive definite, we have 5.3 Extending to Iterative Optimization im (γ v)= tr(j Θ J). γ v As our sparse muti-view SVMs empoy an iterative optimization procedure for sparsity pursuit, the former outcome for empirica Rademacher compexity woud not appy if we use more than one iteration to update z. However, we can extend the former anaysis to suit this case. 2443

22 SUN AND SHAWE-TAYLOR Reca that z i [0,)(i=+,...,+u) and U u = diag(z +,...,z +u ). During iterations, it is possibe that U u becomes a zero matrix or other arbitrary matrix with diagona eements in the range [0,). In any case, the resutant function cass can be covered by Ĥ = {( f, f 2 ) : γ n ( f 2 + f 2 2 ) }, which is obtained by omitting the term containing z i in (25). Foowing a simiar derivation, the matrix N in (27) woud be ( ) K 0 N = γ n. 0 K 2 Finay, we can obtain a bound on the empirica Rademacher compexity ˆR (G) identica to (29) but nowu 2 = tr(s) withs defined in (30). The proof of Theorem 4 is competed. 5.4 Examining ˆR (G) Here, we examine the roe ˆR (G) pays in the margin bound. Since K and K 2 are the first rows of K and K 2, the formuation of tr(s) can be simpified as tr(s) = γ n tr(k K K + K 2 K 2 K 2)= γ n (K (i,i)+k 2 (i,i)). (32) Now, we see that for iterative optimization of sparse muti-view SVMs, the empirica Rademacher compexity ˆR (G) withu 2 = tr(s) ony depends on the abeed exampes and the chosen kerne functions. Consequenty, the margin bound does not rey on the unabeed training sets. In this case the margin bound is quite straightforward to reason. If we do not use iterative optimization, the empirica Rademacher compexity ˆR (G) wi invove other two terms Θ and J. By a simiar technique as in (32), we can show that Θ ony depends on the unabeed data and the kerne functions, whie J encodes the interaction between abeed and unabeed data. As a resut, the margin bound reies on both abeed and unabeed data. For this case, we wi give an evauation of the margin bound with different sizes of unabeed sets in Section Experiments We performed experiments on artificia data and rea-word data to evauate the proposed sparse muti-view SVMs (SpMvSVMs). For SpMvSVMs with ε>0, the entries of z were fixed as for abeed data and initiaized as for unabeed data. The termination condition for iterative optimization is either no unabeed exampes can be removed or the maximum iteration number surpasses 50. Comparisons are made with supervised SVMs, and the unsupervised SVM-2K method. Each accuracy/error reported in this paper is an averaged accuracy/error vaue over ten random spits of data into abeed, unabeed and test data. Later in this section, we aso provide a sequentia training strategy for SpMvSVMs, which shows an accuracy improvement over the gradua adding of unabeed data whie with roughy inear and sub-inear increases of running time. This indicates the possibiity and potentia of appying SpMvSVMs to arge-scae data sets. At the end, margin bound evauation resuts are reported. 2444

23 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS 2 Cass Cass 2 2 Cass Cass (a) View (b) View 2 Figure : Exampes in the two-moons-two-ines data set. 6. Artificia Data This two-moons-two-ines synthetic data set was generated according to Sindhwani et a. (2005). Exampes in two casses scatter ike two moons in one view and two parae ines in the other. To ink the two views, points on one moon were enforced to associate at random with points on one ine. Each cass has 400 exampes and a tota of 800 exampes were generated as shown in Figure. For SpMvSVMs, the numbers of exampes in the abeed training set, unabeed training set and test set were fixed as four, 596, and 200, respectivey. Gaussian kerne with bandwidth 0.35 and the inear kerne were used for view and view 2, respectivey. The parameters γ n and γ v were seected from a sma grid {0 6,0 4,0 2,,0,00} by five-fod cross vaidation on the whoe data set. The chosen vaues are γ n = 0 4 and γ v =. In this paper, γ v is normaized by the number of abeed and unabeed exampes invoved in the muti-view reguarization term. For supervised SVMs, which concatenated features from the two views, we aso found the reguarization coefficient from this grid by five-fod cross vaidation. To evauate SpMvSVMs, we varied the size of the unabeed training set from 20%, 60% to 00% of the tota number of unabeed data, and used different vaues for the insensitive parameter ε, which ranged from 0 to 0.2 with an interva 0.0 (when ε is zero, sparsity is not considered). The test accuracies and transductive accuracies (on the corresponding unabeed set) are given in Figure 2(a) and Figure 2(b), respectivey. It shoud be noted that the numbers of data used to cacuate transductive accuracies are different for the three curves in Figure 2(b). The numbers of removed unabeed exampes for different ε vaues are shown in Figure 3. From Figure 2 and Figure 3, we find that with the increase of ε, more and more unabeed data are removed, and the remove of a sma number of unabeed data can hardy decrease the performance of the resutant cassifiers, especiay when the origina size of unabeed set is arge. Therefore, we can find a good baance between sparsity and accuracy using an appropriate ε. In addition, more unabeed data can benefit the performance of the earned cassifiers with the same ε. 2445

24 SUN AND SHAWE-TAYLOR Accuracy 0.9 Accuracy SVM #Unabeed:20% #Unabeed:60% #Unabeed:00% Epsion 0.85 SVM #Unabeed:20% #Unabeed:60% #Unabeed:00% Epsion (a) Test accuracy (b) Transductive accuracy Figure 2: Cassification accuracies of SpMvSVMs with different sizes of unabeed set and ε vaues on the artificia data. The accuracies of SVMs are aso shown. Number of Removed Unabeed Exampes #Unabeed:20% #Unabeed:60% #Unabeed:00% Epsion Figure 3: The numbers of unabeed exampes removed by SpMvSVMs for different ε vaues on the artificia data. 6.2 Text Cassification We appied the SpMvSVMs to the WebKB text cassification task studied in Bum and Mitche (998); Sindhwani et a. (2005); Sun (2008). The data set consists of 05 two-view web pages coected from the computer science department web sites at four U.S. universities: Corne, University of Washington, University of Wisconsin, and University of Texas. The task is to predict whether a web page is a course home page or not. This probem has an unbaanced cass distribution since there are a tota of 230 course home pages (positive exampes). The first view of the data is the words appearing on the web page itsef, whereas the second view is the underined words 2446

25 SPARSE SEMI-SUPERVISED LEARNING USING CONJUGATE FUNCTIONS Accuracy 0.85 Accuracy SVM #Unabeed:20% #Unabeed:60% #Unabeed:00% Epsion 0.8 SVM #Unabeed:20% #Unabeed:60% #Unabeed:00% Epsion (a) Test accuracy (b) Transductive accuracy Figure 4: Cassification accuracies of SpMvSVMs with different sizes of unabeed set and ε vaues on text cassification. The accuracies of SVMs are aso shown. in a inks pointing to the web page from other pages. We preprocessed each view by removing stop words, punctuation and numbers and then appied Porter s stemming to the text (Porter, 980). In addition, words that occur in five or fewer documents were ignored. This resuted in 2332 and 87-dimensiona vectors in the first and second view, respectivey. Finay, document vectors were normaized to t f.id f (the product of term frequency and inverse document frequency) features (Saton and Buckey, 988). For SpMvSVMs, the numbers of exampes in the abeed training set, unabeed training set and test set were fixed as 32, 699, and 320, respectivey. In the training set and test set, the numbers of negative exampes are three times of those of positive exampes to refect the overa proportion of positive and negative exampes. The inear kerne was used for both views. The parameters γ n and γ v for SpMvSVMs and the reguarziation coefficient for SVMs were seected using the same method as in Section 6.. The chosen vaues for SpMvSVMs are γ n = 0 6 and γ v = 0.0. To evauate SpMvSVMs, we aso varied the size of the unabeed training set from 20%, 60% to 00% of the tota number of unabeed data, and used different vaues for the insensitive parameter ε ranging from 0 to 0.2 with an interva 0.0. The test accuracies and transductive accuracies are given in Figure 4(a) and Figure 4(b), respectivey. The numbers of removed unabeed exampes for different ε vaues are shown in Figure 5. From Figure 5, we find that with the increase of ε, more and more unabeed data can be removed. Refected by Figure 4, the remove of unabeed data ony sighty decrease the performance of the resutant cassifiers, and this decrease is ess when more unabeed data is used. We draw a same concusion as before: an appropriate ε can be adopted to keep a good baance between sparsity and accuracy. We aso observe a different phenomenon, that is, using 60% and 00% unabeed data resut in simiar test accuracies as shown in Figure 4(a). This is reasonabe because the performance improvement of any cassifier is aways bounded no matter how many data are used. 2447

26 SUN AND SHAWE-TAYLOR Number of Removed Unabeed Exampes #Unabeed:20% #Unabeed:60% #Unabeed:00% Epsion Figure 5: The numbers of unabeed exampes removed by SpMvSVMs for different ε vaues on text cassification. 6.3 Comparison with SVM-2K, and Sequentia Training The SVM-2K method proposed by Szedmak and Shawe-Tayor (2007) can expoit unabeed data for muti-view earning. Simiar to SpMvSVMs, it aso combines the maximum margin and mutiview reguarization principes. However, it adopts norm for muti-view reguarization, and this reguarization ony uses unabeed data. Specificay, the SVM-2K method has the foowing optimization for cassifier parameters w, w 2, b, and b 2 in two views min s.t. 2 w w 2 2 +C ξ i +C 2 ξ i 2+C η +u η j j=+ w φ (x j )+b w 2 φ 2(x j ) b 2 η j + ε y i (w φ (x i )+b ) ξ i y i (w 2 φ 2(x i )+b 2 ) ξ i 2 ξ i 0,ξi 2 0,η j 0 for a,...,,and j=+,...,+u, where an ε-insensitive parameter is used to reax the prediction consistency between views. In this subsection, we carry out an empirica comparison between SVM-2K and our SpMvSVMs for semi-supervised earning with identica data spits. Our first comparison takes the ε-insensitive parameter in both SpMvSVMs and SVM-2K as zero and uses the above two data sets with different sizes of unabeed training sets, namey, from 20% to 60% to 00%. For SVM-2K, we adopted the same parameter seection approach as in Szedmak and Shawe-Tayor (2007) through five-fod cross-vaidation. That is, the vaues of C and C 2 were fixed to and C η were seected from the range{0.0 2 i }(,...,0). The experimenta resuts are isted in Tabe, from which we see that both the test accuracies and transductive accuracies of SpMvSVMs are superior to those counterparts of SVM-2K. The second comparison considers sequentia training of SpMvSVMs and SVM-2K. The purpose is to show the reationship between running time, cassification accuracies and the number of 2448

Explicit overall risk minimization transductive bound

Explicit overall risk minimization transductive bound 1 Expicit overa risk minimization transductive bound Sergio Decherchi, Paoo Gastado, Sandro Ridea, Rodofo Zunino Dept. of Biophysica and Eectronic Engineering (DIBE), Genoa University Via Opera Pia 11a,