A Brief Survey on Semi-supervised Learning with Graph Regularization

Size: px

Start display at page:

Download "A Brief Survey on Semi-supervised Learning with Graph Regularization"

Dominic Stafford
5 years ago
Views:

1 A Brie Survey on Semi-supervised Learning with Graph Regularization Hongyuan You Department o Computer Science University o Caliornia, Santa Barbara hyou@cs.ucsb.edu Abstract In this survey, we go over a ew historical literatures on semi-supervised learning problems which apply graph regularization on both labled and unlabeled data to improve classiication perormance. These semi-supervised methods usually construct a nearest neighbour graph on instance space under certain measure unction, and then wor under the smoothness assumption that class labels o samples change quite slowly or connected instances on the graph. Graph Lapalcian o the graph has been used in the graph smoothness regularization since it is a discrete analogous to Beltrami Laplace smoothness operator in the continuous space. A relationship between ernel method and graph Laplacian has also been discovered, and suggests an approach to design speciic ernel or classiication utilizing the spectrum o Laplacian matrix. Smoothness Regularization via Laplace Operator and Graph Laplacian. Semi-supervised Learning Problem Nowadays there are many practical applications o machine learning problem. Though in academic papers we oten assume these problems have some nice conditions and huge number o samples or inerence and learning, we cannot achieve such datasets in reality. One o the biggest problem is that not all o sample instances have class labels since people can hardly label millions o instances manually. Then we need some learning method that can deal with the problem with both labeld and unlabled data at the same time, utilizing the labels o labeled instances and also the inormation rom unlabeled instances. Let X be the instance space, {x, x 2,..., x } are the labeled training samples with labels {y, y 2,..., y }, y i [, +], and {x +, x +2,..., x n } are the n unlabeled samples. Usually we have n which means only very ew samples are labeled. Our purpose is learning a classiication unction : x [, +] to predict the class o testing samples. We restrict ourselves to the transductive setting where the unlabelled data also serve as the test data in our perormance analysis..2 Maniold Assumption and Laplace Operator The inormation rom unlabeled instance is hard to be incorporated into the classiication model unless we could raise a reasonable assumption o the unlabeld data. The irst assumption is thatthe data lies on a low-dimensional maniold within a high-dimensional representation space. We can easily ind many examples to justiy this assumption. For example a handwriting digital number can be represented as a matrix o image pixel with very high dimension, however it can also be represented by only a ew parameters in the low-dimensional ambient space, and in this space, the structure o similar handwriting digital numbers lie close to each other to orm a low-dimensional maniold. Similar examples can be ind in document categorization problems, where people discov-

2 ered documents represented by word identiier vector (remar the counts or requency) should be in a maniold no matter how high dimension the original documents are. Realizing the existence o the intrinsic structure o the instance space is signiicant or developing classiication methods or such ind o problems. On such a low-dimensional maniold, we should have the second assumption, which we may already suggest above, that instances have a small distance on the maniold should have the same labels. Under this assumption, we are able to use these unlabeled instances as long as they are close to some labeled training samples. Under the maniold setting, we can use the Laplace Operator or regularization. Let the Laplace operator = Σ i 2 / x 2 i, and let be a L2 unction deined on an n-dimensional compact Riemann maniold M. It has been now that, the spectrum o is discrete (λ 0 = 0) and its eigenunctions e i ( ) orm a basis o L 2 (M), which means any can be represented as a linear combination o e i ( ): (x) = α i e i (x) () i=0 where e i ( ) = λ i e i ( ). Using Laplace operator we can have a natural measure o smoothness or any unction on M, which computes the variation o unction values: S() := 2 dµ = dµ =, L 2 (M) (2) M And the smoothness o eigenunctions: M S(e i ( )) = e i ( ), e i ( ) L2 (M) = λ i (3) Then we can easily compute the smoothness by eigenvalue λ i s and decomposition actor α i s: S(( )) =, L2 (M) = α i e i ( ), α i e i ( ) L2 (M) = λ i αi 2 (4) i=0 The last equation tells us, the eigenunction with large eigenvalues will lead to a high smoothness value and diers a lot on dierent instances that consists an oscillation component o the original unction. However the eigenunction with low eigenvalue has a smooth pattern under the measure. For the learning problem, we can mae be the classiier we want. Then i we want close instances on the low-dimensional maniold have similar classiication results, then the smoothness measure S() should be small enough. So S() become an ideal regularizer..3 Graph Approximation o Maniold and Graph Laplacian However how to recover the intrinsic structure or the low-dimensional maniold o the input data is still an open problem remained unsolved. What we need here or semi-supervised learning is the second assumption on relationship between similarity distance and the label. We could construct a nearest neighbour graph G to connect similar instances (though distance on graph diers rom distance on maniold), and mae the graph to be a good approximation o the low-dimensional maniold o the input instance space. Any instance x i should adding t nearest neighbours in the graph G. Let A be the adjacency matrix o G, and A ij = ω ij = i instances x i and x j are close under some similarity unction, such as angle distance (you can also change values on edges into meaningul weights). People have developed many methods on how to compute the similarity distance that we would omit them here. Till now our assumption could be clear, Two instances connected by one edge should have similar classiication results. Under the idea o graph approximation o maniold, we can use the graph Laplacian as a natural analogous o Lapalce operator. Let L = D A be the Laplcian matrix o the adjacency matrix A o graph G, where D = diag{d, d 2,..., d n } and d i is the degree o instance x i in the nearest neighbour graph G. We now that L has its eigensystem {(λ i, v i )} n and {v i} n orm an orthonormal basis o R n. Any vector = {, 2,..., n } T can be represented with a set o basis, = α i v i, Lv i = λ i v i (5) i=0 i=0 2

3 More important, the natural measure o smoothness could eep the same orm when we use L to replace : S L () := ω ij ( i j ) 2 = T L =, L G (6) i,j From this smoothness deinition, we can notice that i ω ij is nonzero and large value, then a dierence o their labels will increase the smoothness measure s value, and a same label will mae the item zero, which is exactly what we want in regularization. The smoothness o eigenunctions are the same as their eigenvalues: S L (v i ) = v T i Lv i = λ i v T i v i = λ i (7) Then we can easily compute the smoothness by eigenvalue λ i s and decomposition actor α i s: S L () = L, G = α i Lv i, α i v i G = λ i αi 2 (8) As we can see these properties are all the same with the continuous version. 2 Learning Optimal Labels or Unlabeled Data 2. Classiication as Interpolating Problem I we treat the vector s entries are the unction values o classiier ( ) on some instances, and the model o instance space has the smoothness assumption, then it means can be computed as a combination o the prediction v i which are the eigenvectors o Laplacian matrix L. The only thing we need to now is the combination actor α i s. This early wor in the semi-supervised learning problem is irst done by Mihail Belin and Partha Niyogi[2]. The continuous ormulation should be 2 p = arg in (x) α j e j (x) dx (9) M However, its hard to get those eigenunctions on the maniold M, and we usually could access to a limit amount o labeled instances, thus they set the discrete version o ormulation below: 2 p = arg in y i α j v j (i) (0) where {v i } p j= is the irst p eigenvectors o L corresponding to the smallest p eigenvalues, and = Σ j α j v j. Although they mentioned the smoothness measure o the Graph Laplacian, they did not explicitly use them in this method but choose the smallest p eigenvectors. Looing bac to the ormula o S L () we could ind that such component selection leads to the top p smooth eigenvectors and ensure the smoothness o the learned prediction results. One can solve the least square problem as ollowing: j=0 j= α = ( E T E ) E T y () where α = (α, α 2,..., α p ) T, y = (y, y 2,..., y p ) T, and v () v 2 () v p () v (2) v 2 (2) v p (2) E = (2) v (s) v 2 (s) v p (s) The inal classiier or unlabeled data in the dataset is p (x i ) = sgn α j v j (i) (3) One problem might be the diiculty o approximation the Laplace operator decomposition with graph Laplacian decomposition, because the decomposition o eigenunctions o should be much more meaningul and generalizable than the decomposition o eigenvectors o. j= 3

4 Tihonov Regularization The optimization problem o this method simply changes a little bit rom above that is to explicitly add the smoothness measure into the objective unction. But the theoretical result o interest shows the error bound is controlled with the Fiedler number o the graph, which means generalization ability o graph regularization learning method has some relationship with the geometric invariants o the graph [2]. The setting here is a little dierent: labeled instances could be randomly chosen without replacement, which means each labeled instance may appear more than once with same or dierent labels. And we preprocess the labels lie ỹ = (y ȳ, y 2 ȳ,..., y ȳ), ȳ = y i (4) They try to learn the labels by solving ollowing optimization problem: = arg in (ỹ i i ) 2 + γs L () (5) T =0 where S L is the regularization term and S L could be replaced by S L p or p N or S exp ( tl) or t > 0. Let n i be the number o the multiplicity o x i, and let y ij be the j-th label o x i where j =, 2,..., n i. Then we can rewrite the objective unction into the quadratic programming problem: n i n i arg in n i 2 i + yij 2 2 i y ij + γ T L T =0 = arg in T =0 j= ( T I 2ỹ T j= ) + γ T L = arg in T ( + γs) 2ỹ T T =0 By Lagrange multiplier, we set L(, µ) = 0, then get (I + γs) = ỹ + (µ/2), redeine µ/2 µ, the optimal solution o label vector is where the selection o µ maes T = 0. = ( + γs) (ỹ + µ) µ = T ( + γs) ỹ T ( + γs) By deining the empirical error bound on labeled training samples R () = ( i ỹ i ) 2 (8) and deining the generalization error bound on all labled and unlabeled samples (6) (7) R() = E x D,y R ((x) ỹ(x)) 2 (9) We can have the ollowing perormance bound[2] based on stability perormance analysis [3]: Let γ be the regularization parameter, T be a set o > 4 vertices x,...x where each vertex occurs no more than t times, together with values y,..., y, and y i M. Let be the regularization solution using the smoothness unctional L with the second smallest eigenvalue λ 2. Assuming that i K. We have the ollowing bound exists with probability at least δ: 2log(2/δ) R () R() β + (β + (K + M)) (20) where β = 3M t (γλ 2 t) 2 + 4M γλ 2 t (2) 4

5 Here the Fiedler number λ 2 shows in the generalization error bound. λ 2 gives an estimation o the connectivity o the graph. I λ 2 is small, then it is possible to have a small graph cut, and i λ 2 is large, then the graph is more connected. And we can see rom the bound, the larger the λ 2, the better perormance on the smoothness regularization, because an over-sparse graph is hard to peorm smoothness assumption, and the prediction value o a single instance is inluenced only on a very ew number o neighbours which may leads to overitting and worse generalization perormance. One drawbac o these two methods is that they can only predict labels o a set o unlabeled instance at one time. In the next time, they have to recompute the minimization problem or a new set o unlabeled data, that is, they do not learn an explicit orm o classiier (x) or the instance which does not show up in the dataset but only the best prediction result (x i ). To overcome this shortage, we will then loo into ernel method and ind the relationship between graph Laplaican and the ernel matrix. 3 Kernel Method and Graph Regularization 3. Kernel unction and ernel matrix Kernel method is a class o algorithms which contains Support Vector Machine (SVM). The basic idea is utilizing training sample instances to predict testing samples through pairwise similarity. The pairwise similarity matrix is then the ernel matrix K, which the entry K ij = K(x i, x j ) shows the similarity between x i and x j, and K(, ) here should be a ernel unction over X X, where X be the instance space. Actually, a unction K : X X R is a ernel unction, i: (i) K is symmetric, i.e. K(x, y) = K(y, x); (ii) K is positive semi-deinite, i.e. or any set o {x,..., x n}, ernel matrix K (which K ij = K(x i, x j )) is positive semi-deinite. And or every valid ernel unction K, people has proved in Mercer theorem that there is always a unction φ : X V which satisies K(x, y) = φ(x), φ(y) V. It means the pairwise similarity computed by ernel unction actually comes rom the inner product in some high dimensional eature space V (usually), and φ is a ind o eature transormation. This eature-space ormulation will helps in the uture. 3.2 Repoducing ernel Hilbert spaces Let us loo deeper into the ernel method and reproducing ernel Hilbert space, which is helpul or solving the semi-supervised learning problem using graph regularization. Let X be the instance space, K : X X R be a ernel unction described above. Let unctions : X R where the classiier should comes rom. H K is a Hilbert space o unction s with inner product, K. We say that H K is a reproducing ernel Hilbert space (RKHS), i the deinition o H K and, K satisy the ollowing two conditions: (i) mapping unction Φ : x K(x, ), or any x X, we have K(x, ) H K ; (ii) (x) = ( ), K(x, ) K, or any H K and x X. The irst condition is a map rom x to a unction K(x, ). The second condition is the reproducing ernel property which implies K(x, y) = K(x, ), K(y, ). Φ maps a space o instances into a space o unctions, and maes the unction value o at point x become the inner product o and K(x, ) Regularization in an RKHS is usally ormulated as ollowing minimization problem which will learn a unction H K : = arg in E γ (K) = arg in L((x i ), y i ) + γ, K (22) 5

6 where L is the loss unction and γ > 0 is the unction regularization parameter. People has nown that the minimizer has the ollowing orm [0] (x) = Σ c i K(x i, x), x X (23) which is a linear combination o the reproducing mapping unction K(x i, ) o training sample instancs x i. This orm helps a lot in our learning problem because it reduces a learning problem on all possible unctions into a problem o learning a set o combination actors. Thus it is possible to generalize previous methods by learning a classiier (x) or all instances x in space X rather than just learning the labels or unlabled data in the training dataset. 3.3 Construct RKHS and Regularization A speciic setting is that Let H K be the linear combination o the ollowing ind o unction: ( ) = l α i K(x i, ) (24) where l N, and α i R, K is the valid ernel unction, x i s are labeled or unlabled training samples. Let the inner product o H K deined as ollowing ( ), g( ) K = i α i K(x i, ), j α j K(x j, ) = i,j α i α j K(x i, x j ) (25) Under this setting you can easily chec that H K is a RKHS with ernel K and inner product, K. Then people proved ollowing equivalence [7][8]: I we have a valid ernel K on all the labled and unlabeled instances, the problem o learning labels o unlabeled instances (ocusing on labels) below: is equivalent to the problem = arg in ( ) = arg in ( ( ) L (ỹ i i ) + γ T K ) L (ỹ i (x i )) + γ, K Note that (i) the second problem is the same as SVM in the eature space ormulation (we mentioned earlier) and can be solved in dual space eiciently; (ii) the irst problem optimizes a label vector and the second problem optimizes a unction; (iii) The equivalence means that (x i ) = x i or those unlabled instances in the dataset. This equivalence means that we can learn a classiier to predict new instances without more computational cost. 3.4 From Graph Laplacian to Kernel Now the only problem is how to choose a valid ernel rom the graph Laplacian matrix L we now. Assume that we have the graph Laplacian matrix L over n instances which o them are labeled training samples. Let us deine a semi-norm on R n space with L: (26) (27), g := T Lg,, g R n (28) It is a semi-norm because that, = 0 i is a constant vector. However i we deine the space H 0 to be the space o real value unctions deined on instances in the graph, and let H be a subspace o H 0 which any unction cannot have the same value on all instances. Then it is clear that, g is a well deined norm on H 0 [6]. We can claim that the pseudoinverse L + is a valid ernel matrix. It is because we now the ollowing equation T LL + = LL + = (29) 6

7 which can be interpreted as ollowing i =, L + i = T LL + i (30) where L + i is the i-th column o L+. It shows a reproducing property o the inner product deined with L + on H. Thus L+ is a valid reproducing ernel on H. I the graph Laplacian matrix is invertible, we simply use L other than the expression o L +. And we can use such K = L + in the minimization problem in the last subsection. This is a simple justiication o design ernel rom graph Laplacian matrix. And urther, we can hope that the ernel matrix has the ollowing spectral orm: K = v i v T i (3) λ i where v i s are eigenvectors o L. 4 Spectral Transorm or Graph Kernel 4. Parametric Spectral Transorm or Graph Kernel However, we can urther improve our ernel or better perormance, where the basic idea is to encourage the spectrum with small eigenvalue and penalize the spectrum with high eigenvalues. Here we add a unction r( ) deined on eigenvalues and perorm a parametric spectral transorm o K 0 to design a better ernel K. Then we have the ollowing orm o ernel K. K = r(λ i ) v iv T i (32) As shown in [4], under such spectral grap ernel ramewor, many ernels can be designed by transorming the Regularized Laplacian Kernel Diusion Process Kernel K = (I + σ 2 L) r(λ) = + σ 2 λ (33) K = exp( σ 2 /2 L) p-step Random Wal Kernel K = (αi L) p, α 2 Inverse Cosine Kernel K = cos( Lπ/4) 4.2 Nonparametric Spectral Transorm or Graph Kernel r(λ) = exp(σ 2 /2λ) (34) r(λ) = exp(σ 2 /2λ) (35) r(λ) = exp(σ 2 /2λ) (36) Many methods use the ramewor above to design ernels rom graph Laplacian matrix L. But the spectral transormation does not address question o which parameter amily to use or r. The number o degrees o reedom o r (usually one or two) might be too ew or modelling the learning problem and leaves the ernel overly constrained. Then people are trying to optimize a nonparametric spectral transormation which only applies the ordering constraint (see the last constraint) [5]. arg max µ Â(K, T ) = subject to K = Σ n µ i K i µ i 0 trace(k) = µ i µ i+, i = 2,.., n K, T F K, K F T, T F (37) 7

8 The objective unction is called the empirical ernel alignment which evaluates the itness o a ernel K to the labels o labled training samples. It has many good properties which have been shown in [9]., F is the Frobenius product, deined as M, N F = Σ i,j = M ij N(ij). K in the objective unction is the ernel matrix K restricted in the irst labeled training samples, and T is the outer product o y, that is, T = yy T. And µ = (µ, µ 2,..., µ n ) T is the combination actor o ernel s spectrum. Such an ordering constraint is reasonable. As we mentioned beore, considering the smoothness measure S G, the smaller the eigenvalue λ i, the smoother its corresponding eigenvector v i over the graph G. Then the ordering constraint maintain a decreasing signiicance order o the transormed spectral vector o the ernel, which implies it encourages smooth unctions in the end. Note that we do not constrain the constant vector v with the zero eigenvalue λ (assume connected graph G), which do mae sense or it is simply a bias. Constraint trace(k) = are both trying to ix the scale invariance o the ernel alignment. The optimization problem above has an equivalent orm: arg max K, T F µ subject to K, K F µ i 0 µ i µ i+, i = 2,.., n Under such subtle coniguration o the optimization problem, it can be urther transormed into an equivalent quadratically constrained quadratic programming problem (QCQP) below which has easible and ast optimization method. Reerences arg max µ vec(t ) T Mµ subject to Mµ µ i 0 µ i µ i+, i = 2,.., n [] Belin, M., & Niyogi, P. (2002). Semi-supervised learning on maniolds. Advances in Neural Inormation Processing Systems (NIPS). [2] Belin, M., Matveeva, I., & Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. Learning theory, Springer Berlin Heidelberg. [3] Bousquet, O., & Elissee, A. (2002). Stability and generalization. The Journal o Machine Learning Research, 2, [4] Smola, A. J., & Kondor, R. (2003). Kernels and regularization on graphs. In Learning theory and ernel machines (pp ). Springer Berlin Heidelberg. [5] Zhu, X., Kandola, J. S., Ghahramani, Z., & Laerty, J. D. (2004). Nonparametric Transorms o Graph Kernels or Semi-Supervised Learning. Advances in Neural Inormation Processing Systems (NIPS), Vol. 7, [6] Argyriou, A., Herbster, M., & Pontil, M. (2005, August). Combining graph Laplacians or semi-supervised learning. Advances in Neural Inormation Processing Systems (NIPS). [7] Johnson, R., & Zhang, T. (2008). Graph-based semi-supervised learning and spectral ernel design. Inormation Theory, IEEE Transactions on, 54(), [8] Zhang, T., & Ando, R. (2006). Analysis o spectral ernel design based semi-supervised learning. Advances in neural inormation processing systems, 8, 60. [9] Cristianini, N., Shawe-Taylor, J., Elissee, A., & Kandola, J. (200, December). On ernel target alignment. Advances in neural inormation processing systems, 8, Vol. 2, 4. [0] Herbster, M., Pontil, M., & Wainer, L. (2005, August). Online learning over graphs. Proceedings o the 22nd international conerence on Machine learning, (38) (39) 8

Definition: Let f(x) be a function of one variable with continuous derivatives of all orders at a the point x 0, then the series.

Definition: Let f(x) be a function of one variable with continuous derivatives of all orders at a the point x 0, then the series. 2.4 Local properties o unctions o several variables In this section we will learn how to address three kinds o problems which are o great importance in the ield o applied mathematics: how to obtain the