New Regularized Algorithms for Transductive Learning

Size: px

Start display at page:

Download "New Regularized Algorithms for Transductive Learning"

Candice Sanders
6 years ago
Views:

1 New Reglarized Algorithms for Transdctie Learning Partha Pratim Talkdar and Koby Crammer Compter & Information Science Department Uniersity of Pennsylania Philadelphia, PA Abstract. We propose a new graph-based label propagation algorithm for transdctie learning. Each example is associated with a ertex in an ndirected graph and a weighted edge between two ertices represents similarity between the two corresponding example. We bild on Adsorption, a recently proposed algorithm and analyze its properties. We then state or learning algorithm as a conex optimization problem oer mlti-label assignments and derie an efficient algorithm to sole this problem. We state the conditions nder which or algorithm is garanteed to conerge. We proide experimental eidence on arios real-world datasets demonstrating the effectieness of or algorithm oer other algorithms for sch problems. We also show that or algorithm can be extended to incorporate additional prior information, and demonstrate it with classifying data where the labels are not mtally exclsie. Key words: label propagation, transdctie learning, graph based semi-sperised learning. 1 Introdction Sperised machine learning methods hae achieed considerable sccess in a wide ariety of domains ranging from Natral Langage Processing, Speech Recognition to Bioinformatics. Unfortnately, preparing labeled data for sch methods is often expensie and time consming, while nlabeled data are widely aailable in many cases. This was the major motiation that led to the deelopment of semi-sperised algorithms which learn from limited amonts of labeled data and ast amonts of freely aailable nannotated data. Recently, graph based semi-sperised algorithms hae achieed considerable attention [2, 7, 11, 14, 17]. Sch methods represent instances as ertices in a graph with edges between ertices encoding similarities between them. Graph-based semi-sperised algorithms often propagate the label information from the few labeled ertices to the entire graph. Most of the algorithms tradeoff between accracy initially labeled nodes shold retain those labels, relaxations allowed by some methods) with smoothness adjacent ertices in the graph shold be assigned similar labels). Most algorithms only otpt label information to the nlabeled data in a transdctie setting, while some algorithms are designed for the semi-sperised framework and bild a classification model which can be applied to ot-of-sample examples.

2 Adsorption [1] is one sch recently proposed graph based semi-sperised algorithm which has been sccessflly sed for different tasks, sch as recommending YoTbe ideos to sers [1] and large scale assignment of semantic classes to entities within Information Extraction [13]. Adsorption has many desirable properties: it can perform mlticlass classification, it can be parallelized and hence can be scaled to handle large data sets which is of particlar importance for semi-sperised algorithms. Een thogh Adsorption works well in practice, to the best of or knowledge it has neer been analyzed before and hence or nderstanding of it is limited. Hoping to fill this gap, we make the following contribtions in this paper: We analyze the Adsorption algorithm [1] and show that there does not exist an objectie fnction whose local optimization wold be the otpt of the Adsorption algorithm. Motiated by this negatie reslt, we propose a new graph based semi-sperised algorithm Modified Adsorption, MAD), which shares Adsorption s desirable properties, yet with some important differences. We state the learning problem as an optimization problem and deelop efficient iteratie) methods to sole it. We also list the conditions nder which the optimization algorithm MAD is garanteed to conerge. The transition to an optimization based learning algorithm proides a flexible and general framework that enables s to specify a ariety reqirements. We demonstrate this framework sing data with non-mtally exclsie labels, reslting in the Modified Adsorption for Dependent Labels MADDL, prononced medal) algorithm. We proide experimental eidence demonstrating the effectieness of or proposed algorithm on arios real world datasets. 2 Adsorption Algorithm Adsorption [1] is a general algorithmic framework for transdctie learning where the learner is often gien a small set of labeled examples and a ery large set of nlabeled examples. The goal is to label all the nlabeled examples, and possibly nder the assmption of label-noise, also to relabel the labeled examples. As many other related algorithms [17, 12, 5], Adsorption assmes that the learning problem is gien in a graph form, where examples or instances are represented as nodes or ertices and edges code similarity between examples. Some of the nodes are associated with a pre-specified label, which is correct in the noise-free case, or can be sbject to label-noise. Additional information can be gien in the form of weights oer the labels. Adsorption propagates label-information from the labeled examples to the entire set of ertices ia the edges. The labeling is represented sing a non-negatie score for each label, with high score for some label indicating high-association of a ertex or its corresponding instance) with that label. If the scores are additiely normalized they can be thoght of as a conditional distribtion oer the labels gien the node or example) identity. More formally, Adsorption is gien an ndirected graph G = V, E, W ), where a node V corresponds to an example, an edge e = a, b) V V indicates that the

3 label of the two ertices a, b V shold be similar and the weight W ab R + reflects the strength of this similarity. We denote the total nmber of examples or ertices by n = V, by n l the nmber of examples for which we hae prior knowledge of their label and by n the nmber of nlabeled examples to be labeled. Clearly n l + n = n. Let L be the set of possible labels, their total nmber is denoted by m = L and withot loss of generality we assme that the possible labels are L = {1... m}. Each instance V is associated with two row-ectors Y, Ŷ R m +. The lth element of the ector Y encodes the prior knowledge for ertex. The higher the ale of Y l the stronger we a-priori beliee that the label of shold be l L and a ale of zero Y l = 0 indicates no prior abot the label l for ertex. Unlabeled examples hae all their elements set to zero, that is Y l = 0 for l = 1... m. The second ector Ŷ R m + is the otpt of the algorithm, sing similar semantics as Y. For example, a high ale of Ŷl indicates that the algorithm beliees that the ertex shold hae the label l. We denote by Y, Ŷ Rn m + the matrices whose rows are Y and Ŷ respectiely. Finally, we denote by 0 d the all-zeros row ector of dimension d. 2.1 Random-Walk View The Adsorption algorithm can be iewed as a controlled random walk oer the graph G. The control is formalized ia three possible actions: inject, contine and abandon denoted by inj, cont, abnd) with pre-defined probabilities p inj, p cont, p abnd 0 per ertex V. Clearly their sm is nit: p inj + p cont + p abnd = 1. To label any ertex V either labeled or nlabeled) we initiate a random-walk starting at facing three options: with probability p inj the random-walk stops and retrn i.e. inject) the pre-defined ector information Y. We constrain p inj = 0 for nlabeled ertices. Second, with probability p abnd the random-walk abandons the labeling process and retrn the all-zeros ector 0 m. Third, with probability p cont the random-walk contines to one of s neighbors with probability proportional to W 0. Note that by definition W = 0 if, ) / E. We smmarize the aboe process with the following set of eqations. The transition probabilities are, W, ) E Pr [ W ] = :,) E 0 otherwise. 1) The expected) score Ŷ for node V is gien by, Ŷ = p inj Y + p cont Pr [ ] Ŷ + pabnd 0 m. 2) :,) E 2.2 Aeraging View For this iew we add a designated symbol called the dmmy label denoted by ν / L. This additional label explicitly encodes ignorance abot the correct label and it means that a dmmy label can be sed instead. Explicitly, we add an additional colmn to all

4 Algorithm 1 Adsorption Algorithm Inpt: - Graph: G = V, E, W ) - Prior labeling: Y R m+1 for V - Probabilities: p inj, p cont Otpt: - Label Scores: Ŷ for V, p abnd for V 1: Ŷ Y for V {Initialization} 2: 3: repeat X W Ŷ 4: D X W 5: for all V do 6: Ŷ p inj 7: end for 8: ntil conergence for V Y + p cont D + p abnd r the ectors defined aboe, and hae that Y, Ŷ R m+1 + and Y, Ŷ Rn m+1) +. We set Y ν = 0, that is, a-priori no ertex is associated with the dmmy label, and replace the zero ector 0 m with the ector r R m+1 + where r l = 0 for l ν and r ν = 1. In words, if the random-walk is abandoned, then the corresponding labeling ector is zero for all tre labels in L, and an arbitrary ale of nit for the dmmy label ν. This way, there is always positie score for at least one label, the ones in L or the dmmy label. The aeraging iew then defines a set of fixed-point eqations to pdate the predicted labels. A smmary of the eqations appears in Algorithm 1. The algorithm is rn ntil conergence which is achieed when the label distribtion on each node ceases to change within some tolerance ale. Since Adsorption is memoryless, it scales to tens of millions of nodes with dense edges and can be easily parallelized [1]. Balja et. al. [1] show that p to the additional dmmy label, these two iews are eqialent. It remains to specify the ales of p inj, p cont and p abnd. For the experiments reported in Section 6, we set their ale sing the following heristics adapted from Balja et. al. [1]) which depends on a parameter β which we set to β = 2. For each node we define two qantities: c and d and define p cont c ; p inj d. The first qantity c [0, 1] is monotonically decreasing with the nmber of neighbors for node in the graph G. Intitiely, the higher the ale of c, the lower the nmber of neighbors of ertex and higher the information they contain abot the labeling of. The other qantity d 0 is monotonically increasing with the entropy for labeled ertices), and in this case we prefer to se the prior-information rather than the compted qantities from the neighbors.

5 Specifically we first compte the entropy of the transition probabilities for each node, H[] = Pr [ ] log Pr [ ], and then pass it throgh the following monotonically decreasing fnction, fx) = log β logβ + e x )). Note that f0) = logβ)/ logβ + 1) and that fx) goes to zero, as x goes to infinity. We define, c = f H[]). Next we define, d = { 1 c ) H[] the ertex is labeled 0 the ertex is nlabled Finally, to ensre proper normalization of p cont, p inj and p cont z = maxc + d, 1), and p abnd, we define, = c ; p inj = d ; p abnd = 1 p cont p inj. z z Ths, abandonment occrs only when the contination and injection probabilities are low enogh. This is most likely to happen at nlabeled nodes with high degree. Once the random walk reaches sch a node ), the walk is terminated with probability. This, in effect, preents the Adsorption algorithm from propagating information throgh high degree nodes. We note that the probabilities p inj, p cont and p abnd for node may be set with heristics other than the fan-ot entropy heristics shown aboe to sit specific application contexts. p abnd 3 Analysis of the Adsorption Algorithm Or next goal is to find an objectie fnction that the Adsorption algorithm minimizes. Or starting point is line 6 of Algorithm 1. We note that when the algorithm conerges, both sides of the assignment operator eqal each other before the assignment takes place. Ths when the algorithm terminates, we hae for all V : Ŷ = p inj Y + p cont 1 N W Ŷ + p abnd i r, where N = W.

6 The last set of eqalities is eqialent to, G {Ŷ} V ) = 0 for V, 3) where we define, ) G {Ŷ} V = p inj Y + p cont 1 N W Ŷ + p abnd i r Ŷ. Now, if the Adsorption ) algorithm was minimizing some objectie fnction denoted by Q {Ŷ} V ), the termination condition of Eq. 3) was in fact a condition on the ector of its partial deriaties where we wold identify G = Q. 4) Ŷ Since the fnctions G are linear and ths has continos deriaties), necessary conditions for the existence of a fnction Q sch that 4) holds is that the deriaties of G are symmetric [8], that is, G = G. Ŷ Ŷ Compting and comparing the deriaties we get, Ŷ G = p cont W N δ, ) p cont which is tre since in general N N and p cont W N δ, ) = G, Ŷ p cont. We conclde: Theorem 1. There does not exists a fnction Q with continos second partial deriaties sch that the Adsorption algorithm conergences when gradient of Q are eqal to zero. In other words, we searched for a well-behaed) fnction Q sch that its local optimal wold be the otpt of the Adsorption algorithm, and showed that this search will always fail. We se this negatie reslts to define a new algorithm, which bilds on the Adsorption algorithm and is optimizing a fnction of the nknowns Ŷ for V. 4 New Algorithm: Modified Adsorption MAD) Or starting point is Sec. 2.2 where we assme to hae been gien a weighted-graph G = V, E, W ) and a matrix Y R n m+1) + and are seeking for a labeling-matrix Ŷ R n m+1) +. In this section it is more conenient to decompose the matrices Y and Ŷ into their colmns, rather than rows. Specifically, we denote by Y l R n + the lth colmn of Y and similarly by Ŷl R n + the lth colmn of Ŷ. We distingish the rows and colmns of the matrices Y and Ŷ sing their indices, the colmns are indexed with the label index l, while the rows are indexed are with a ertex index or ).

7 We bild on preios research [3, 17, 11] and constrct an objectie that reflects three reqirements as follows. First, for the labeled ertices we like the otpt of the algorithm to be close to the a-priori gien labels, that is Y Ŷ. Second, for pair of ertices that are close according to the inpt graph, we wold like their labeling to be close, that is Ŷ Ŷ if W is large. Third, we want the otpt to be as ninformatie as possible, this seres as additional reglarization, that is Ŷ r. We now frther deelop the objectie in light of the three reqirements. We se the Eclidian distance to measre discrepancy between two qantities, and start with the first reqirement aboe, p inj l ) 2 Y l Ŷl = = l l p inj ) 2 Y l Ŷl ) ) Y l Ŷl S Y l Ŷl, where we define the diagonal matrix S R n n and S = p inj if ertex is labeled and S = 0 otherwise. The matrix S captres the intition that for different ertices we enforce the labeling of the algorithm to match the a-priori labeling with different extent. Next, we modify the similarity weight between ertices to take into accont the difference in degree of arios ertices. In particlar we define W = p cont W. Ths, a ertex will not be similar to a ertex if either the inpt weights W are low or the ertex has a large-degree p cont is low). We write the second reqirement as, W Ŷ Ŷ 2 = W Ŷl Ŷl) 2 = W Ŷl Ŷl,, = l = l Ŷ l LŶl, W ) Ŷl 2 + l l W ) l, Ŷl 2 2 l, ) 2 W ŶlŶl where, L = D + D W W, and D, D are n n diagonal matrices with D = W, D = W. Finally we define the matrix R R n m+1) + where the th row of R eqals p abnd r we define r in Sec 2.2). In other words the first m colmns of R eqal zero, and the last m+1th colmn) eqal the elements of p abnd. The third reqirement aboe is ths written as, ) 2 2 Ŷl R l = Ŷl R l. l l

8 We combine the three terms aboe into a single objectie which we wold like to minimize), giing to each term a different importance sing the weights µ 1, µ 2, µ 3. CŶ) = ) ) ] [µ 1 Y l Ŷl S Y l Ŷl + µ 2 Ŷl T 2 L Ŷl + µ 3 Ŷ l R l 5). 2 l The objectie in Eqation 5 is similar to the Qadratic Cost Criteria [3], with the exception that the matrices S and L hae different constrctions. We remind the reader that Ŷl, Y l, R l are the lth colmns each of size n 1) of the matrices Ŷ, Y and R respectiely. 4.1 Soling the Optimization Problem We now deelop an algorithm to optimize 5) similar to the qadratic cost criteria [3]. Differentiating Eqation 5 w.r.t. Ŷl we get, 1 δcŷ) = µ 1 SŶl Y l ) + µ 2 LŶl + µ 3 Ŷl R l ) 2 δŷl = µ 1 S + µ 2 L + µ 3 I)Ŷl µ 1 SY l + µ 3 R l ). 6) Differentiating once more we get, 1 δcŷ) = µ 1 S + µ 2 L + µ 3 I, 2 δŷlδŷl and since both S and L are symmetric and positie semidefinite matrices PSD), we get that the Hessian is PSD as well. Hence, the optimal minima is obtained by setting the first deriatie i.e. Eqation 6)) to 0 as follows, µ 1 S + µ 2 L + µ 3 I) Ŷl = µ 1 SY l + µ 3 R l ). Hence, the new labels Ŷ) can be obtained by a matrix inersion followed by matrix mltiplication. Howeer, this can be qite expensie when large matrices are inoled. A more efficient way to obtain the new label scores is to sole a set of linear eqations sing Jacobi iteration which we now describe. 4.2 Jacobi Method Gien the following linear system in x) Mx = b the Jacobi iteratie algorithm defines the approximate soltion at the t+1)th iteration gien the soltion at tth iteration as follows, x t+1) i = 1 b i M ij x t) j. 7) M ii j i

9 We apply the iteratie algorithm to or problem by sbstitting x = Ŷl, M = µ 1 S + µ 2 L + µ 3 I and b = µ 1 SY l + µ 3 R l in 7), Ŷ t+1) l = 1 µ 1 SY l ) + µ 3 R l M Ŷ t) l M Let s compte the ales of SY l ) i, M ijj i) and M ii. First, M ) = µ 1 S + µ 2 L + µ 3 I. Note that since S and I are diagonal, we hae that S = 0 and I = 0 for. Sbstitting the ale of L we get, M ) = µ 2 L = µ 2 D + D ) W W, and as before the matrices D and D are diagonal and ths D + D = 0. Finally, sbstitting the ales of W and W we get, M ) = µ 2 p cont We now compte the second qantity, SY l ) = S Y + t 8) W + p cont W ). 9) S t Y t = p inj Y, where the second term eqals zero since S is diagonal. Finally, the third term, M = µ 1 S + µ 2 L + µ 3 I = µ 1 p inj + µ 2 D + D W W ) + µ 3 = µ 1 p inj + µ 2 p cont W + p cont W ) + µ 3. Plgging the aboe eqations into 8) and sing the fact that the diagonal elements of W are zero, we get, ) Ŷ t+1) = 1 µ 1 p inj Y + µ 2 p cont W + p cont ) W M Ŷt) + µ 3 p abnd r. 10) We call the new algorithm MAD for Modified-Adsorption and it is smmarized in Algorithm 2. Note that for graphs G that are inariant to permtations of the ertices, and setting µ 1 = 2 µ 2 = µ 3 = 1, MAD redces to the Adsorption algorithm. 4.3 Conergence A sfficient condition for the iteratie process of Eqation 7) to conerge is that M is strictly diagonally dominant [10], that is if, M > M for all ales of

10 Algorithm 2 Modified Adsorption MAD) Algorithm Inpt: - Graph: G = V, E, W ) - Prior labeling: Y R m+1 for V - Probabilities: p inj, p cont Otpt: - Label Scores: Ŷ for V, p abnd for V 1: Ŷ Y for V {Initialization} 2: M µ 1 p inj P + µ 2 W + p cont W ) + µ 3 3: repeat 4: D X 5: for all V do `pcont pcont W + p cont 6: Ŷ 1 `µ1 M p inj 7: end for 8: ntil conergence W Ŷ Y + µ 2 D + µ 3 p abnd r We hae, M M = µ 1 p inj + µ 2 µ 2 p cont W + p cont ) W + µ3 p cont W + p cont ) W = µ 1 p inj + µ 3 11) Note that p inj 0 for all and that µ 3 is a free parameter in 11). Ths we can garantee a strict diagonal dominance and hence conergence) by setting µ 3 > 0. 5 Extensions: Non-Mtally Exclsie Labels In many learning settings, labels are not mtally exclsie. For example, in hierarchical classification, labels are organized in a tree. In this section, we extend the MAD algorithm to handle dependence among labels. This can be easily done sing or new formlation which is based on objectie optimization. Specifically, we shall add additional terms to the objectie for each pair of dependent labels. Let C be a m m matrix where m is the nmber of labels exclding the dmmy label) as before. Each entry, C ll, of this matrix C represents the dependence or similarity among the labels l and l. By encoding dependence in this pairwise fashion, we can captre dependencies among labels represented as arbitrary graphs. The extended objectie is shown in Eqation 12. CŶ) = l ) ) [µ 1 Y l Ŷl S Y l Ŷl i + µ 2 Ŷ T l L Ŷl + µ 3 Ŷ l R l 2 +µ 4 C ll Ŷil Ŷil )2 12) l,l 2

11 The last term in Eqation 12 penalizes the algorithm if similar labels as determined by the matrix C) are assigned different scores, with seerity of the penalty controlled by µ 4. Now, analyzing the objectie in Eqation 12 in the manner otlined in Section 4, we arrie at the pdate rle shown in Eqation 13. where, Ŷ t+1) l = 1 M l µ 1 p inj M l = µ 1 p inj + µ 2 Y l + µ 2 p cont p cont µ 3 p abnd W + p cont r l + µ 4 l W ) Ŷt) C ll Ŷ t) l l + W + p cont W ) + µ 3 + µ 4 13) Replacing Line 6 in MAD Algorithm 2) with Eqation 13, we end p with a new algorithm: Modified Adsorption for Dependent Labels MADDL). In Section 6.4, we shall se MADDL to obtain smooth ranking for sentiment classification. l C ll 6 Experimental Reslts We compare MAD with arios state-of-the-art learning algorithms on two tasks, text classification Sec. 6.1) and sentiment analysis Sec. 6.2), and demonstrate its effectieness. In Sec. 6.3, we also proide experimental eidence showing that MAD is qite insensitie to wide ariation of ales of its hyper-parameters. In Sec. 6.4, we present eidence showing how MADDL can be sed to obtain smooth ranking for sentiment prediction, a particlar instantiation of classification with non-mtally exclsie labels. For the experiments reported in this section inoling Adsorption, MAD and MADDL, the a-priori label matrix Y was colmn-normalized so that all labels hae eqal oerall injection score. Also, the dmmy label was ignored dring ealation as its main role is to add reglarization dring learning phase only. 6.1 Text Classification World Wide Knowledge Base WebKB) is a text classification dataset widely sed for ealating transdctie learning algorithms. Most recently, the dataset was sed by Sbramanya and Bilmes [11], who kindly shared their preprocessed complete WebKB graph with s. There are a total of 4, 204 ertices in the graph, with the nodes labeled with one of for categories: corse, faclty, project, stdent. A K-NN graph is created from this complete graph by retaining only top K neighbors of each node, where the ale of K is treated as a hyper-parameter. We follow the experimental protocol in [11]. The dataset was randomly partitioned into for sets. A transdction set was generated by first selecting one of the for splits at random and then sampling n l docments from it; the remaining three sets are sed as the

12 Class SVM TSVM SGT LP AM Adsorption MAD corse faclty project stdent aerage Table 1. PRBEP for the WebKB data set with n l = 48 training and 3148 testing instances. All reslts are aerages oer 20 randomly generated transdction sets. The last row is the macroaerage oer all the classes. MAD is the proposed approach. Reslts for SVM, TSVM, SGT, LP and AM are reprodced from Table 2 of [11]. test set for ealation. This process was repeated 21 times to generate as many trainingtest splits. The first split was sed to tne the hyper-parameters, with search oer the following: K {10, 50, 100, 500, 1000, 2000, 4204}, µ 2, µ 3 {1e 8, 1e 4, 1e 2, 1, 10, 1e2, 1e3}. The ale of µ 1 was set to 1 for this experiment. Both for Adsorption and MAD, the optimal ale of K was 1, 000. Frthermore, the optimal ale for the other parameters were fond to be µ 2 = µ 3 = 1. As in preios work [11], we se Precision-Recall Break Een Point PRBEP) [9] as the ealation metric. Same ealation measre, dataset and the same experimental protocol makes the reslts reported here directly comparable to those reported preiosly [11]. For easier readability, the reslts from Table 2 of Sbramanya and Bilmes [11] are cited in Table 1 of this paper, comparing performance of Adsorption based methods Adsorption and MAD) to many preiosly proposed approaches: SVM [6], Transdctie-SVM [6], Spectral Graph Transdction SGT) [7], Label Propagation LP) [16] and Alternating Minimization AM) [11]. The first for rows in Table 1 shows PRBEP for indiidal categories, with the last line showing the macro-aeraged PRBEP across all categories. The MAD algorithm achiees the best performance oerall for n l = 48). Performance comparison of MAD and Adsorption for increasing n l are shown in Figre 1. Comparing these reslts against Fig. 2 in Sbramanya and Bilmes [11], it seems that MAD otperforms all other methods compared except AM [11]) for all ales of n l. MAD performs better than AM for n l = 48, bt achiees second best soltion for the other three ales of n l. We are crrently inestigating why MAD is best for settings with fewer labeled examples. 6.2 Sentiment Analysis The goal of sentiment analysis is to atomatically assign polarity scores to text collections, with a high score reflecting positie sentiment ser likes) and a low score reflecting negatie sentiment ser dislikes). In this section, we report reslts on sentiment classification in the transdctie setting. From Section 6.1 and [11], we obsere that Label Propagation LP) [16] is one of the best performing L2-norm based transdctie learning algorithm. Hence, we compare the performance of MAD against Adsorption and LP. For the experiments in this section, we se a set of 4, 768 ser reiews from the electronics domain [4]. Each reiew is assigned one of the for scores: 1 worst), 2,

Fig. 1. PRBEP macro-aeraged) for the WebKB dataset with 3148 testing instances. All reslts are aerages oer 20 randomly generated transdction sets. 3, 4 best).

13 Fig. 1. PRBEP macro-aeraged) for the WebKB dataset with 3148 testing instances. All reslts are aerages oer 20 randomly generated transdction sets. 3, 4 best). We create a K-NN graph from these reiews by sing cosine similarity as the measre of similarity between reiews. We created 5 training-test splits from this data sing the process described in Section 6.1. One split was sed to tne the hyper-parameters while the rest were sed for training and ealation. Hyper-parameter search was carried oer the following ranges: K {10, 100, 500}, µ 1 {1, 100}, µ 2 {1e 4, 1, 10}, µ 3 {1e 8, 1, 100, 1e3}. Precision is sed as the ealation metric. Comparison of different algorithms for arying nmber of labeled instances are shown in Figre 2. From this, we note that MAD and Adsorption otperform LP, while Adsorption and MAD are competitie 6.3 Parameter Sensitiity We ealated the sensitiity of MAD to ariations of its µ 2 and µ 3 hyper-parameters, with all other hyper-parameters fixed. We sed a 2000-NN graph constrcted from the WebKB dataset and a 500-NN graph constrcted from the Sentiment dataset. In both cases, 100 nodes were labeled. We tried three ales each for µ 2 and µ 3, ranging in at least 3 order of magnitde. For the WebKB, the PRBEP aried between and for the sentiment data, the precision aried in the range with µ 2 µ 3 while precision dropped to 25 with µ 2 > µ 3. This nderscores the need for reglarization in these models, which is enforced with high µ 3. We note that in both cases the algorithm is less sensitie to the ale of µ 2 than the ale of µ 3. In general, we hae fond that setting µ 3 to one or two order magnitde more than µ 2 is a reasonable choice. We hae also fond that the MAD algorithm is qite insensitie to ariations in µ 1. For example on the sentiment dataset, we tried two ales for µ 1 ranging two order of magnitde, with other hyper-parameters fixed. In this case, precision aried in the range

Fig. 2. Precision for the Sentiment Analysis dataset with 3568 testing instances. All reslts are aerages oer 4 randomly generated transdction sets.

14 Fig. 2. Precision for the Sentiment Analysis dataset with 3568 testing instances. All reslts are aerages oer 4 randomly generated transdction sets. µ e3 1e4 Prediction Loss L1) at rank Prediction Loss L1) at rank Table 2. Aerage prediction loss at ranks 1 & 2 for arios ales of µ 4) for sentiment prediction. All reslts are aeraged oer 4 rns. See Section 6.4 for details. 6.4 Smooth Ranking for Sentiment Analysis We reisit the sentiment prediction problem in Section 6.2, bt with the additional reqirement that ranking of the labels 1, 2, 3, 4) generated by the algorithm shold be smooth i.e. we prefer the ranking 1 > 2 > 3 > 4 oer the ranking 1 > 4 > 3 > 2, where 3 > 2 means that the algorithm ranks label 3 higher than label 2. The ranking 1 > 2 > 3 > 4 is smoother as it doesn t inole rogh transition 1 > 4 which is present in 1 > 4 > 3 > 2. We se the framework of stating reqirements as an objectie to be optimized. We se the MADDL algorithm of Sec. 5 initializing the matrix C as follows assming that labels 1 and 2 are related, while labels 3 and 4 are related): C 12 = C 21 = 1, C 34 = C 43 = 1 with all other entries in matrix C set to 0. Sch constraints along with appropriate µ 4 in Eqation 12)) will force the algorithm to assign similar scores to dependent labels, thereby assigning them adjacent ranks in the final otpt. MAD and MADDL were then sed to predict ranked labels for ertices on a 1000-NN graph constrcted from the sentiment data sed in Sec. 6.2, with 100 randomly selected nodes labeled. For this experiment we set µ 1 = µ 2 = 1, µ 3 = 100. The L1-loss between the gold label and labels predicted at ranks r = 1, 2 for increasing ales of µ 4 are gien in Table 2. Note that, MADDL with µ 4 = 0 corresponds to MAD. From Table 2 we obsere that with

15 Cont of Top Predicted Pair in MAD Otpt Cont of Top Predicted Pair in MADDL Otpt Label Label Label Label 1 4 Fig. 3. Plot of conts of top predicted label pairs order ignored) in MAD s predictions with µ 1 = µ 2 = 1, µ 3 = 100. Fig. 4. Plot of conts of top label pairs order ignored) in MADDL s predictions Section 5), with µ 1 = µ 2 = 1, µ 3 = 100, µ 4 = 1e3. increasing µ 4, MADDL is ranking at r = 2 a label which is related as per C) to the top ranked label at r = 1, bt at the same time maintain the qality of prediction at r = 1 first row of Table 2), thereby ensring a smoother ranking. From Table 2, we also obsere that MADDL is insensitie to ariations of µ 4 beyond a certain range. This sggests that µ 4 may be set to a high) ale and that tning it may not be necessary. Another iew of the same phenomenon is shown in Fig. 3 and Fig. 4. In these figres, we plot the conts of top predicted label pair order of prediction is ignored for better readability) generated by the MAD and MADDL algorithms. By comparing these two figres we obsere that label pairs e.g. 2,1) and 4,3)) faored by C aboe) are more freqent in MADDL s predictions than in MAD s. At the same time, nonsmooth predictions e.g. 4, 1)) are irtally absent in MADDL s predictions while they are qite freqent in MAD s. These clearly demonstrate MADDL s ability to generate smooth predictions in a principled way, and more generally the ability to handle data with non-mtally exclsie or dependent labels. 7 Related Work LP [16] is one of the first graph based semi-sperised algorithms. Een thogh there are seeral similarities between LP and MAD, there are important differences: 1) LP doesn t allow the labels on seeded nodes to change while MAD does). As was pointed ot preiosly [3], this can be problematic in case of noisy seeds. 2) There is no way for LP to express label ncertainty abot a node. MAD can accomplish this by assigning high score to the dmmy label. More recently, a KL minimization based algorithm was presented in [11]. Frther inestigation is necessary to determine the merits of each approach. For a general introdction to the area of graph-based semi-sperised learning, the reader is referred to a srey by Zh [15].

16 8 Conclsion In this paper we hae analyzed the Adsorption algorithm [1] and proposed a new graph based semi-sperised learning algorithm, MAD. We hae deeloped efficient iteratie) soltion to sole or conex optimization based learning problem. We hae also listed the conditions nder which the algorithm is garanteed to conerge. Transition to an optimization based learning algorithm allows s to easily extend the algorithm to handle data with non-mtally exclsie labels, reslting in the MADDL algorithm. We hae proided experimental eidence demonstrating effectieness of or proposed methods. As part of ftre work, we plan to ealate the proposed methods frther and apply the MADDL method in problems with dependent labels e.g. Information Extraction). Acknowledgment This research is partially spported by NSF grant #IIS The athors wold like to thank F. Pereira and D. Siakmar for sefl discssions. References 1. S. Balja, R. Seth, D. Siakmar, Y. Jing, J. Yagnik, S. Kmar, D. Raichandran, and M. Aly. Video sggestion and discoery for yotbe: taking random walks throgh the iew graph. In WWW, M. Belkin, P. Niyogi, and V. Sindhwani. Manifold reglarization: A geometric framework for learning from labeled and nlabeled examples. JMLR, 7: , Y. Bengio, O. Delallea, and N. Rox. Semi-Sperised Learning, chapter Label Propogation and Qadratic Criterion, J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, P. Indyk and J. Matosek. Low-distortion embeddings of finite metric spaces. Handbook of Discrete and Comptational Geometry, T. Joachims. Transdctie inference for text classification sing spport ector machines. In ICML, T. Joachims. Transdctie learning ia spectral graph partitioning. In ICML, V. J. Katz. The history of stokes theorem. Mathematics Magazine, 523): , V. Raghaan, P. Bollmann, and G. Jng. A critical inestigation of recall and precision as measres of retrieal system performance. ACM TOIS, 73): , Y. Saad. Iteratie Methods for Sparse Linear Systems. Society for Indstrial Math., A. Sbramanya and J. Bilmes. Soft-Sperised Learning for Text Classification. In EMNLP, M. Szmmer and T. Jaakkola. Partially labeled classification with marko random walks. NIPS, P. P. Talkdar, J. Reisinger, M. Pasca, D. Raichandran, R. Bhagat, and F. Pereira. Weakly sperised acqisition of labeled class instances sing graph random walks. In EMNLP, J. Wang, T. Jebara, and S. Chang. Graph transdction ia alternating minimization. In ICML, X. Zh. Semi-sperised learning literatre srey X. Zh and Z. Ghahramani. Learning from labeled and nlabeled data with label propagation. Technical report, CMU CALD tech report, X. Zh, Z. Ghahramani, and J. Lafferty. Semi-sperised learning sing gassian fields and harmonic fnctions. ICML, 2003.

We automate the bivariate change-of-variables technique for bivariate continuous random variables with

We automate the bivariate change-of-variables technique for bivariate continuous random variables with INFORMS Jornal on Compting Vol. 4, No., Winter 0, pp. 9 ISSN 09-9856 (print) ISSN 56-558 (online) http://dx.doi.org/0.87/ijoc.046 0 INFORMS Atomating Biariate Transformations Jeff X. Yang, John H. Drew,