Understanding the Yarowsky Algorithm

Size: px

Start display at page:

Download "Understanding the Yarowsky Algorithm"

Anna Jackson
6 years ago
Views:

1 Understanding the Yarowsy Algorithm Steven Abney University of Michigan Many problems in computational linguistics are well suited for bootstrapping (semi-supervised learning) techniques. The Yarowsy algorithm is a well-nown bootstrapping algorithm, but it is not mathematically well understood. This paper analyzes it as optimizing an obective function. More specifically, a number of variants of the Yarowsy algorithm (though not the original algorithm itself) are shown to optimize either lielihood or a closely related obective function K.. Introduction Bootstrapping, or semi-supervised learning, has become an important topic in computational linguistics. For many language-processing tass, there is an abundance of unlabeled data, but labeled data is lacing and too expensive to create in large quantities, maing bootstrapping techniques desirable. The Yarowsy algorithm (Yarowsy, 995) was one of the first bootstrapping algorithms to become widely nown in computational linguistics. The Yarowsy algorithm, in brief, consists of two loops. The inner loop or base learner is a supervised learning algorithm. Specifically, Yarowsy uses a simple decision list learner that considers rules of the form, If instance x contains feature f, then predict label, and selects those rules whose precision on the training data is highest. The outer loop is given a seed set of rules to start with. In each iteration, it uses the current set of rules to assign labels to unlabeled data. It selects those instances on which the base learner s predictions are most confident, and constructs a labeled training set from them. It then calls the inner loop to construct a new classifier (that is, a new set of rules), and the cycle repeats. An alternative algorithm, co-training (Blum and Mitchell, 998), has subsequently become more popular, perhaps in part because it has proven amenable to theoretical analysis (Dasgupta, Littman, and McAllester, 200), in contrast to the Yarowsy algorithm, which is as yet mathematically poorly understood. The current paper aims to rectify that lac, increasing the attractiveness of the Yarowsy algorithm as an alternative to co-training. The Yarowsy algorithm does have the advantage of placing less restriction on the data sets it can be applied to. Co-training requires data attributes to be separable into two views that are conditionally independent given the target label; the Yarowsy algorithm maes no such assumption about its data. In previous wor, I did propose an assumption about the data called precision independence under which the Yarowsy algorithm could be shown effective (Abney, 2002). That assumption is ultimately unsatisfactory, however, not only because it restricts the data sets on which the algorithm can be shown effective, but for additional internal reasons. A detailed discussion would tae us too far afield here, but suffice it to say that precision independence is a property that it would be preferable not to assume, but rather to derive from more basic properties of a data set; and that closer empirical study shows c 0 Association for Computational Linguistics

2 Y-/DL-EM-Λ Y-/DL-EM-X Y-/DL--R Y-/DL--VS YS-P YS-R YS-FS EM inner loop that uses labeled examples only EM inner loop that uses all examples Near-original Yarowsy inner loop, no smoothing Near-original Yarowsy inner loop, variable smoothing Sequential update, anti-smoothing Sequential update, no smoothing Sequential update, original Yarowsy smoothing Table The Yarowsy algorithm variants. Y-/DL-EM reduces H; the others reduce K. that precision independence fails to be satisfied in some data sets on which the Yarowsy algorithm is effective. This paper proposes a different approach. Instead of maing assumptions about the data, it views the Yarowsy algorithm as optimizing an obective function. We will show that several variants of the algorithm (though not the algorithm in precisely its original form) optimize either negative log lielihood H or an alternative obective function, K, that upper bounds H. Ideally, we would lie to show that the Yarowsy algorithm minimizes H. Unfortunately, we are not able to do so. But we are able to show that a variant of the Yarowsy algorithm, which we call Y-/DL-EM, decreases H in each iteration. It combines the outer loop of the Yarowsy algorithm with a different inner loop based on the Expectation-Maximization (EM) algorithm. A second proposed variant of the Yarowsy algorithm, Y-/DL-, has the advantage that its inner loop is very similar to the original Yarowsy inner loop, unlie Y-/DL-EM, whose inner loop bears little resemblance to the original. Y-/DL- has the disadvantage that it does not directly reduce H, but we show that it does reduce the alternative obective function K. We also consider a third variant, YS. It differs from Y-/DL-EM and Y-/DL- in that it does sequential update (adding a single rule in each iteration), rather than parallel update (updating all rules in each iteration). Besides the intrinsic interest of sequential update, YS can be proven effective when using exactly the same smoothing method as used in the original Yarowsy algorithm, in contrast to Y-/DL-, which uses either no smoothing or a nonstandard variable smoothing. YS is proven to decrease K. The Yarowsy algorithm variants that we consider are summarized in table. To the extent that these variants capture the essence of the original algorithm, we have a better formal understanding of its effectiveness. Even if the variants are deemed to depart substantially from the original algorithm, we have at least obtained a family of new bootstrapping algorithms that are mathematically understood. 2. The Generic 2. The Original Algorithm Y-0 The original Yarowsy algorithm, which we call Y-0, is given in table 2. It is an iterative algorithm. One begins with a seed set Λ (0) of labeled examples, and a set V (0) of unlabeled examples. At each iteration, a classifier is constructed from the labeled examples; then the classifier is applied to the unlabeled examples to create a new labeled set. To discuss the algorithm formally, we require some notation. We assume first a set of examples X and a feature-set F x for each x X. The set of examples with feature f is X f. Note that x X f if and only if f F x. We also require a series of labelings Y (t), where t represents the iteration number. 2

3 () Given: examples X, and initial labeling Y (0) (2) For t {0,,...} (2.) Train classifier on labeled examples (Λ (t), Y (t) ), where Λ (t) = {x X Y (t) } The resulting classifier predicts label for example x with probability π x (t+) () (2.2) For each example x X: (2.2.) Set ŷ = arg max π (t+) () (2.2.2) Set Y (t+) x = (2.3) If Y (t+) = Y (t), stop Table 2 The generic Yarowsy algorithm (Y-0) x Y x (0) if x Λ (0) ŷ if π x (t+) (ŷ) > ζ otherwise 3

4 X set of examples, both labeled and unlabeled Y the current labeling; Y (t) is the labeling at iteration t Λ the (current) set of labeled examples V the (current) set of unlabeled examples x an example index f, g feature indices, label indices F x the features of example x Y x the label of example x, value is undefined ( ) if x is unlabeled X f, Λ f, V f examples (labeled examples, unlabeled examples) that have feature f Λ, Λ f examples with label, with feature f and label m the number of features of a given example: F x (cf. eq. 2) L the number of labels φ x() labeling distribution (eq. 5) π x() prediction distribution (eq. 2; except for DL-0, which uses eq. ) θ f score for rule f ; we view θ f as the prediction distribution of f ŷ label that maximizes π x() for given x (eq. ) Φ]] truth value of Φ: value is 0 or H obective function, negative log lielihood (eq. 6) H(p) entropy of distribution p H(p q) cross entropy: p(x) log q(x) (cf. eqq. 2 and 3) x K obective function, upper bound on H (eq. 20) q f () precision of rule f (eq. 9) q f () smoothed precision (eq. 0) ˆq f () peaed precision (eq. 25) the label that maximizes precision q f () for a given feature f (eq. 26) the label that maximizes rule score θ f for a given feature f (eq. 28) u( ) uniform distribution Table 3 Summary of notation. We write Y x (t) for the label of example x under labeling Y (t). An unlabeled example is one for which Y x (t) is undefined, in which case we write Y x (t) =. We write V (t) for the set of unlabeled examples, and Λ (t) for the set of labeled examples. It will also be useful to have a notation for the set of examples with label : Λ (t) {x X Y (t) x = } Note that Λ (t) is the disoint union of the sets Λ (t). When t is clear from context, we drop the superscript (t) and write simply Λ, V, Y x, etc. At the ris of ambiguity, we will also sometimes write Λ f for the set of labeled examples with feature f, trusting to the index to discriminate between Λ f (labeled examples with feature f) and Λ (labeled examples with label ). We always use f and g to represent features, and and to represent labels. The reader may wish to refer to table 3, which summarizes notation used throughout the paper. In each iteration, the Yarowsy algorithm uses a supervised learner to train a classifier on the labeled examples. Let us call this supervised learner the base learning algorithm; it is a function from (X, Y (t) ) to a classifier π drawn from a space of classifiers Π. It is assumed that the classifier maes confidence-weighted predictions. That is, the classifier defines a scoring function π(x, ), and the predicted label for example x is ŷ arg max π(x, ) () Ties are broen arbitrarily. Technically, we assume a fixed order over labels, and define 4

5 the maximization to return the first label in the ordering, in case of a tie. It will be convenient to assume that the scoring function is nonnegative and bounded, in which case we can normalize it to mae π(x, ) a conditional distribution over labels for a given example x. Henceforward, we write π x () instead of π(x, ), understanding π x to be a probability distribution over labels. We call this distribution the prediction distribution of the classifier on example x. To complete an iteration of the Yarowsy algorithm, one recomputes labels for examples. Specifically, the label ŷ is assigned to example x if the score π x (ŷ) exceeds a threshold ζ, called the labeling threshold. The new labeled set Λ (t+) contains all examples for which π x (ŷ) > ζ. Relabeling applies only to examples in V (0). The labels for examples in Λ (0) are indelible, because Λ (0) constitutes the original manually labeled data, as opposed to data that has been labeled by the learning algorithm itself. The algorithm continues until convergence. The particular base learning algorithm that Yarowsy uses is deterministic, in the sense that the classifier induced is a deterministic function of the labeled data. Hence, the algorithm is nown to have converged whenever the labeling remains unchanged. Note that the algorithm as stated leaves the base learning algorithm unspecified. We can distinguish between the generic Yarowsy algorithm Y-0, for which the base learning algorithm is an open parameter, and the specific Yarowsy algorithm, which includes a specification of the base learner. Informally, we call the generic algorithm the outer loop and the base learner the inner loop of the specific Yarowsy algorithm. The base learner that Yarowsy assumes is a decision list induction algorithm. We postpone discussion of it until section An Obective Function Machine learning algorithms are typically designed to optimize some obective function that represents a formal measure of performance. The maximum lielihood criterion is the most commonly used obective function. Suppose we have a set of examples Λ with labels Y x for x Λ, and a parametric family of models π θ such that π( x; θ) represents the probability of assigning label to example x, according to the model. The lielihood of θ is the probability of the full data set according to the model, viewed as a function of θ, and the maximum lielihood criterion instructs us to choose the parameter settings ˆθ that maximize lielihood, or equivalently, log lielihood: l(θ) = log x Λ π(y x x; θ) = log π(y x x; θ) x Λ = = Y x ]] log π( x; θ) x Λ (The notation Φ]] represents the truth value of the proposition Φ; it is if Φ is true and 0 otherwise.) Let us define: φ x () = = Y x ]] for x Λ Note that φ x satisfies the formal requirements of a probability distribution over labels : specifically, it is a point distribution with all its mass concentrated on Y x. We call it the labeling distribution. Now we can write: l(θ) = φ x () log π( x; θ) x Λ 5

6 = x Λ H(φ x π x ) (2) In (2) we have written π x for the distribution π( x; θ), leaving the dependence on θ implicit. We have also used the nonstandard notation H(p q) for what is sometimes called cross entropy. It is easy to verify that: H(p q) = H(p) + D(p q) (3) where H(p) is the entropy of p and D is Kullbac-Leibler divergence. Note that, when p is a point distribution, H(p) = 0 and hence H(p q) = D(p q). In particular: l(θ) = x Λ D(φ x π x ) (4) Thus when, as here, φ x is a point distribution, we can restate the maximum lielihood criterion as instructing us to choose the model that minimizes the total divergence between the empirical labeling distributions φ x and the model s prediction distributions π x. To extend l(θ) to unlabeled examples, we need only observe that unlabeled examples are ones about whose labels the data provide no information. Accordingly, we revise the definition of φ x to treat unlabeled examples as ones whose labeling distribution is the maximally uncertain distribution, which is to say, the uniform distribution: { = Yx ]] for x Λ φ x () = where L is the number of labels. Equivalently: L for x V (5) φ x () = x Λ ]] + x V]] L When we replace Λ with X, the expressions (2) and (4) are no longer equivalent; we must use (2). Since H(φ x π x ) = H(φ x ) + D(φ x π x ), and H(φ x ) is minimized when x is labeled, minimizing H(φ x π x ) forces one to label unlabeled examples. On labeled examples, H(φ x π x ) = D(φ x π x ), and D(φ x π x ) is minimized when the labels of examples agree with the predictions of the model. In short, we adopt as obective function H x X H(φ x π x ) = l(φ, θ) (6) We see to minimize H. 2.3 The Modified Algorithm Y- We can show that a modified version of the Yarowsy algorithm finds a local minimum of H. Two modifications are necessary. The labeling function Y is recomputed in each iteration as before, but with the constraint that an example once labeled stays labeled. The label may change, but a labeled example cannot become unlabeled again. We eliminate the threshold ζ, or (equivalently) fix it at /L. As a result, the only examples that remain unlabeled after the labeling step are those for which π x is the uniform distribution. The problem with an arbitrary threshold is that it prevents the algorithm from converging to a minimum of H. A threshold that gradually decreases to /L would also address the problem, but would complicate the analysis. 6

7 () Given: X, Y (0) (2) For t {0,,...} (2.) Train classifier on (Λ (t), Y (t) ); result is π (t+) (2.2) For each example x X: (2.2.) Set ŷ = arg max π (t+) () (2.2.2) Set Y (t+) x = (2.3) If Y (t+) = Y (t), stop Table 4 The modified generic Yarowsy algorithm (Y-) x Y x (0) if x Λ (0) ŷ if x Λ (t) π x (t+) (ŷ) > /L otherwise The modified algorithm, Y-, is given in table 4. To obtain a proof, it will be necessary to mae an assumption about the supervised classifier π (t+) induced by the base learner in step (2.). A natural assumption is that the base learner chooses π (t+) so as to minimize x Λ (t) D(φ(t) x π x (t+) ). A weaer assumption will suffice, however. We assume that the base learner reduces divergence, if possible. That is, we assume: D Λ D(φ (t) x π x (t+) x Λ (t) ) D(φ (t) x π (t) x Λ (t) x ) 0 (7) with equality only if there is no classifier π (t+) Π that maes D Λ < 0. Note that any learning algorithm that minimizes x Λ (t) D(φ(t) x π x (t+) ) satisfies the weaer assumption (7), inasmuch as the option of setting π x (t+) = π x (t) is always available. We also consider a somewhat stronger assumption, namely, that the base learner reduces divergence over all examples, not ust over labeled examples: D X x X D(φ (t) x π x (t+) ) x X D(φ (t) x π (t) x ) 0 (8) If a base learning algorithm satisfies (8), the proof of theorem is shorter; but (7) is the more natural condition for a base learner to satisfy. We can now state the main theorem of this section. Theorem If the base learning algorithm satisfies (7) or (8), algorithm Y- decreases H at each iteration until it reaches a critical point of H. There is a lemma that we require in order to prove the theorem. Lemma For all distributions p H(p) log with equality iff p is the uniform distribution. Proof. By definition, for all : max p() p() max p() 7

8 log p() log max p() Since this is true for all, it is true if we tae the expectation with respect to p: p() log p() H(p) log p() log max p() max p() We have equality only if p() = max p() for all, that is, only if p is the uniform distribution. We now prove the theorem. Proof of theorem. The algorithm produces a sequence of labelings φ (0), φ (),... and a sequence of classifiers π (), π (2),.... The classifier π (t+) is trained on φ (t), and the labeling φ (t+) is created using π (t+). Recall that: H = H(φ x ) + D(φ x π x )] x X In the training step (2.), we hold φ fixed and change π, and in the labeling step (2.2), we hold π fixed and change φ. We will show that the training step minimizes H as a function of π, and the labeling step minimizes H as a function of φ except on examples where it is at a critical point of H. Hence, H is non-increasing in each iteration of the algorithm, and is strictly decreasing unless (φ (t), π (t) ) is a critical point of H. Let us consider the labeling step first. In this step, π is held constant, but φ (possibly) changes, and we have: H = x X H(x) where H(x) H(φ (t+) x π x (t+) ) H(φ (t) x π x (t+) ) We can show that H is nonpositive if we can show that H(x) is nonpositive for all x. We can guarantee that H(x) 0 if φ (t+) minimizes H(p π x (t+) ) viewed as a function of p. By definition: H(p π x (t+) ) = p log π x (t+) () We wish to find the distribution p that minimizes H(p π (t+) x ). Clearly, we accomplish (). If there is ) is minimized by any distribution p that distributes (). Observe further that that by placing all the mass of p in p where minimizes log π (t+) x more than one minimizer, H(p π (t+) x all its mass among the minimizers of log π (t+) x arg min log π x (t+) () = arg max = ŷ π x (t+) () That is, we minimize H(p π x (t+) ) by setting p = = ŷ]], which is to say, by labeling x as predicted by π (t+). That is how algorithm Y- defines φ (t+) x for all examples x Λ (t+) whose labels are modifiable (that is, excluding x Λ (0) ). 8

9 Note that φ (t+) x does not minimize H(p π x (t+) ) for examples x V (t+), that is, for examples x that remain unlabeled at t +. However, in algorithm Y-, any example that is unlabeled at t + is necessarily also unlabeled at t, so for any such example, H(x) = 0. Hence, if any label changes in the labeling step, H decreases, and if no label changes, H remains unchanged; in either case, H does not increase. We can show further that even for examples x V (t+), the labeling distribution φ (t+) x assigned by Y- represents a critical point of H. For any example x V (t+), the prediction distribution π x (t+) is the uniform distribution (otherwise Y- would have labeled x). Hence the divergence between φ (t+) and π (t+) is zero, and thus at a minimum. It would be possible to decrease H(φ (t+) x π x (t+) ) by decreasing H(φ (t+) x ) at the cost of an increase in D(φ (t+) x π x (t+) ), but all directions of motion (all ways of selecting labels to receive increased probability mass) are equally good. That is to say, the gradient of H is zero; we are at a critical point. Essentially, we have reached a saddle point. We have minimized H with respect to φ x () along those dimensions with a non-zero gradient. Along the remaining dimensions, we are actually at a local maximum, but without a gradient to choose a direction of descent. Now let us consider the training step (2.). In this step, φ is held constant, so the change in H is equal to the change in D recall that H(φ π) = H(φ) + D(φ π). By the hypothesis of the theorem, there are two cases: either the base learner satisfies (7) or (8). If it satisfies (8), the base learner minimizes D as a function of π, hence it follows immediately that it minimizes H as a function of π. Suppose instead that the base learner satisfies (7). We can express H as H = x X H(φ x ) + x Λ (t) D(φ x π x ) + x V (t) D(φ x π x ) In the training step, the first term remains constant. The second term decreases, by hypothesis. But the third term may increase. However, we can show that any increase in the third term is more than offset in the labeling step. Consider an arbitrary example x in V (t). Since it is unlabeled at time t, we now that φ (t) x is the uniform distribution u: u() = L Moreover, π x (t) must also be the uniform distribution; otherwise example x would have been labeled in a previous iteration. Therefore the value of H(x) = H(φ x π x ) at the beginning of iteration t is H 0 : H 0 = φ (t) x () log After the training step, the value is H : π x (t) () = u() log u() = H(u) H = φ (t) x () log π x (t+) () If π x remains unchanged in the training step, then the new distribution π x (t+), lie the old one, is the uniform distribution, and the example remains unlabeled. Hence there is no change in H, and in particular, H is non-increasing, as desired. On the other hand, 9

10 if π x does change, then the new distribution π x (t+) is non-uniform, and the example is labeled in the labeling step. Hence the value of H(x) at the end of the iteration, after the labeling step, is H 2 : H 2 = φ (t+) x () log x () = log π x (t+) (ŷ) π (t+) By lemma, H 2 < H(u); hence H 2 < H 0. As we observed above, H > H 0, but if we consider the change overall, we find that the increase in the training step is more than offset in the labeling step: 3. The Specific H(x) = H 2 H + H H 0 < 0 3. The Original Decision List Induction Algorithm DL-0 When one speas of the Yarowsy algorithm, one often has in mind not ust the generic algorithm Y-0 (or Y-), but an algorithm whose specification includes the particular choice of base learning algorithm made by Yarowsy. Specifically, Yarowsy s base learner constructs a decision list, that is, a list of rules of form f, where f is a feature and is a label, with score θ f. A rule f matches example x if x possesses the feature f. The label predicted for a given example x is the label of the highest-scoring rule that matches x. Yarowsy uses smoothed precision for rule scoring. As the name suggests, smoothed precision q f () is a smoothed version of (raw) precision q f (), which is the probability that rule f is correct given that it matches: q f () { Λf / Λ f if Λ f > 0 /L otherwise (9) where Λ f is the set of labeled examples that possess feature f, and Λ f is the set of labeled examples with feature f and label. Smoothed precision q( f; ɛ) is defined as follows: q( f; ɛ) Λ f + ɛ Λ f + Lɛ (0) We also write q f () when ɛ is clear from context. Yarowsy defines a rule s score to be its smoothed precision: θ f = q f () Anticipating later needs, we will also consider raw precision as an alternative: θ f = q f (). Both raw and smoothed precision have the properties of a conditional probability distribution. Generally, we view θ f as a conditional distribution over labels for a fixed feature f. Yarowsy defines the confidence of the decision list to be the score of the highestscoring rule that matches the instance being classified. This is equivalent to defining: π x () max f F x θ f () 0

11 (0) Given: a fixed value for ɛ > 0 Initialize arrays Nf, ] = 0, Zf] = 0 for all f, () For each example x Λ (.) Let be the label of x (.2) Increment Nf, ], Zf], for each feature f of x (2) For each feature f and label (2.) Set θ f = Nf,]+ɛ Zf]+Lɛ (*) Define π x() max f Fx θ f Table 5 The decision list induction algorithm DL-0. The value accumulated in Nf, ] is Λ f and the value accumulated in Zf] is Λ f. (Recall that F x is the set of features of x.) Since the classifier s prediction for x is defined, in equation (), to be the label that maximizes π x (), definition () implies that the classifier s prediction is the label of the highest-scoring rule matching x, as desired. We have written in () rather than = because maximizing θ f across f F x for each label will not in general yield a probability distribution over labels though the scores will be positive and bounded, hence normalizable. Considering only the final predicted label ŷ for a given example x, the normalization will have no effect, inasmuch as all scores θ f being compared will be scaled in the same way. As characterized by Yarowsy, a decision list contains only those rules f whose score q f () exceeds the labeling threshold ζ. This can be seen purely as an efficiency measure. Including rules whose score falls below the labeling threshold will have no effect on the classifier s predictions, as the threshold will be applied when the classifier is applied to examples. For this reason, we do not prune the list. That is, we represent a decision list as a set of parameters {θ f }, one for every possible rule f in the cross product of the set of features and the set of labels. The decision-list induction algorithm used by Yarowsy is summarized in table 5; we call it DL-0. Note that the step labeled (*) is not actually a step of the induction algorithm, but rather specifies how the decision list is used to compute a prediction distribution π x for a given example x. Unfortunately, we cannot prove anything about DL-0 as it stands. In particular, we are unable to show that DL-0 reduces divergence between prediction and labeling distributions (7). In the next section, we describe an alternative decision list induction algorithm, called DL-EM, that does satisfy (7); hence we can apply theorem to the combination Y-/DL-EM to show that it reduces H. However, a disadvantage of DL-EM is that it does not resemble the algorithm DL-0 used by Yarowsy. We return in section 3.4 to a close variant of DL-0 called DL-, and show that, though it does not directly reduce H, it does reduce the upper bound K. 3.2 The Decision List Induction Algorithm DL-EM The algorithm DL-EM is a special case of the Expectation-Maximization (EM) algorithm. We consider two versions of the algorithm: DL-EM-Λ and DL-EM-X. They differ in that DL-EM-Λ is trained on labeled examples only, whereas DL-EM-X is trained on labeled and unlabeled examples. However, the basic outline of the algorithm is the same for both. First, the DL-EM algorithms do not assume the definition of π that Yarowsy assumed, given in (). As mentioned above, the parameters θ f can be thought of as defining a prediction distribution θ f () over labels for each feature f. Hence the equation () specifies how the prediction distributions θ f for the features of example x are to be combined to yield a prediction distribution π x for x. Instead of combining distri-

12 butions by maximizing θ f across f F x as in equation (), DL-EM taes a mixture of the θ f : π x () = θ f (2) m f F x Here m = F x is the number of features that x possesses; for sae of simplicity, we assume that all examples have the same number of features. Since θ f is a probability distribution for each f, and since any convex combination of distributions is also a distribution, it follows that π x as defined in (2) is a probability distribution. The two definitions for π x (), () and (2), will often have the same mode ŷ, but that is guaranteed only in the rather severely restricted case of two features and two labels. Under definition (), the prediction is determined entirely by the strongest θ f, whereas definition (2) permits a bloc of weaer θ f to outvote the strongest one. Yarowsy explicitly wished to avoid the possibility of such interactions. Nonetheless, the definition (2) used by DL-EM turns out to mae analysis of other base learners more manageable, and we will assume it henceforth, not only for DL-EM, but also for the algorithms DL- and YS discussed in subsequent sections. DL-EM also differs from DL-0 in that DL-EM does not construct a classifier from scratch, but rather sees to improve on a previous classifier. In the context of the Yarowsy algorithm, the previous classifier is the one from the previous iteration of the outer loop. We write θf old for the parameters and πx old for the prediction distributions of the previous classifier. Conceptually, DL-EM considers the label assigned to an example x to be generated by choosing a feature f F x and then assigning the label according to the feature s prediction distribution θ f (). The choice of feature f is a hidden variable. The degree to which an example labeled is imputed to feature f is determined by the old distribution: π old (f x, ) = f F x]]θ old f g g F x]]θ g old = f Fx]] mθold f π old () One can thin of π old (f x, ) either as the posterior probability that feature f was responsible for the label, or as the portion of the labeled example (x, ) that is imputed to feature f. We also write πx old(f) as a synonym for πold (f x, ). The new estimate θ f is obtained by summing imputed occurrences of (f, ) and normalizing across labels. For DL-EM-Λ, this taes the form: θ f = x Λ π old (f x, ) x Λ π old (f x, ) The algorithm is summarized in table 6. The second version of the algorithm, DL-EM-X is summarized in table 7. It is lie DL-EM-Λ, except that it uses the update rule: θ f = x Λ π old (f x, ) + L x Λ π old (f x, ) + L x x V πold (f x, ) x V πold (f x, ) ] (3) The update rule (3) includes unlabeled examples as well as labeled examples. Conceptually, it divides each unlabeled example equally among the labels, then divides the resulting fractional labeled example among the example s features. We note that both variants of the DL-EM algorithm constitute a single iteration of an EM-lie algorithm. A single iteration suffices to prove the following theorem, though multiple iterations would also be effective. 2

13 (0) Initialize Nf, ] = 0 for all f, () For each example x labeled (.) Let Z = g F x θ old g (.2) For each f F x, increment Nf, ] by Z θold f (2) For each feature f (2.) Let Z = Nf, ] (2.2) For each label, set θ f = Nf, ] Z Table 6 DL-EM-Λ decision list induction algorithm (0) Initialize Nf, ] = 0 and Uf, ] = 0, for all f, () For each example x labeled (.) Let Z = g F x θ old g (.2) For each f F x, increment Nf, ] by Z θold f (2) For each unlabeled example x (2.) Let Z = g F x θ old g (2.2) For each f F x, increment Uf, ] by Z θold f (3) For each feature f (3.) Let Z = (Nf, ] + L (3.2) For each label, set θ f = Z Uf, ]) Table 7 DL-EM-X decision list induction algorithm ( Nf, ] + L Uf, ]) 3

14 Theorem 2 The classifier produced by the DL-EM-Λ algorithm satisfies equation (7), and the classifier produced by the DL-EM-X algorithm satisfies equation (8). Combining theorems and 2 yields: Corollary The Yarowsy algorithm Y-, using DL-EM-Λ or DL-EM-X as its base learning algorithm, decreases H at each iteration until it reaches a critical point of H. Proof of theorem 2. Let θ old represent the parameter values at the beginning of the call to DL-EM, let θ represent a family of free variables that we will optimize, and let π old and π be the corresponding prediction distributions. The labeling distribution φ is fixed. For any set of examples α, let D α be the change in x α D(φ x π x ) resulting from the change in θ. We are obviously particularly interested in the two cases where α is the set of all examples X (for DL-EM-X) and where α is the set of labeled examples Λ (for DL-EM-Λ). In either case, we will show that D α 0, with equality only if no choice of θ decreases D. We first derive an expression for D α that we will put to use shortly: D(φx πx old ) D(φ x π x ) ] D α = x α = x α = x α H(φx π old x ) H(φ x ) H(φ x π x ) + H(φ x ) ] φ x () log π x () log π old x () ] (4) The EM algorithm is based on the fact that divergence is non-negative, and strictly positive if the distributions compared are not identical. 0 φ x ()D(πx old π x ) x α = φ x () πx old (f) log πold x (f) π x α x (f) f F x = φ x () ( ) θ old f x (f) log π old x α x () πx() θ f f F x π old Which yields the inequality: φ x () log π x () log πx old () ] x α φ x () x α f F x π old By (4), this can be written as: D α φ x () x α Since θ old f x (f) log θ f log θf old ] f F x π old is constant, by maximizing φ x () x α f F x π old x (f) log θ f log θf old ] (5) x (f) log θ f (6) 4

15 we maximize a lower bound on D α. It is easy to see that D α is bounded above by 0: we simply set θ f = θf old. Since divergence is zero only if the two distributions are identical, we have strict inequality in (5) unless the best choice for θ is θ old, in which case no choice of θ maes D α < 0. It remains to show that DL-EM computes the parameter set θ that maximizes (6). We wish to maximize (6) under the constraints that the values {θ f } for fixed f sum to unity across choices of, so we apply Lagrange s method. We express the constraints in the form: C f = 0 where C f θ f We see a solution to the family of equations that results from expressing the gradient of (6) as a linear combination of the gradients of the constraints: θ f φ x () x α g F x π old We derive an expression for the derivative on the left hand side: θ f φ x () x α Similarly for the right hand side: g F x π old Substituting these into equation (7): x X f α θ f = x (g) log θ g = C f θ f = C f x (g) log θ g = λ f (7) θ f x X f α φ x ()π old x (f) θ f = λ f x X f α Using the constraint C f = 0 and solving for λ f : Substituting bac into (8): x X f α λ f = φ x ()π old x (f) θ f φ x ()π old x (f) λ f (8) φ x ()π old x (f) λ f = 0 x X f α φ x ()π old x (f) x X θ f = f α φ x()πx old (f) x X f α φ x()πx old(f) (9) If we consider the case where α is the set of all examples, and expand out φ x in (9), we obtain: θ f = x (f) + Z L x (f) x Λ f π old x V f π old 5

16 where Z normalizes θ f. It is not hard to see that this is the update rule that DL-EM-X computes, using the intermediate values: Nf, ] = x (f) Uf, ] = x Λ f π old x V f π old x (f) If we consider the case where α is the set of labeled examples, and expand out φ x in (9), we obtain: θ f = x (f) Z x Λ f π old This is the update rule that DL-EM-Λ computes. Thus we see that DL-EM-X reduces D X, and DL-EM-Λ reduces D Λ. We note in closing that DL-EM-X can be simplified when used with algorithm Y-, inasmuch as it is nown that θ f = /L for all (f, ) where f F x for some x V. Then the expression for Uf, ] simplifies as follows: x (f) x V f π old = x V f = V f m ] /L g F x /L The dependence on disappears, so we can replace Uf, ] with Uf] in algorithm DL- EM-X, delete step (2.), and replace step (2.2) with the statement For each f F x, increment Uf] by /m. 3.3 The Obective Function K Y-/DL-EM is the only variation on the Yarowsy algorithm that we can show to reduce negative log lielihood, H. The variants that we discuss in the remainder of the paper, Y-/DL- and YS, reduce an alternative obective function, K, which we now define. The value K (or, more precisely, the value K/m) is an upper bound on H, which we derive using Jensen s inequality, as follows: We define: H = φ x log m θ g x X g F x φ x m log θ g x X g F x = H(φ x θ g ) m x X g F x K H(φ x θ g ) (20) x X g F x By minimizing K, we minimize an upper bound on H. Moreover, it is in principle possible to reduce K to zero. Since H(φ x θ g ) = H(φ x ) + D(φ x θ g ), K is reduced to zero if all examples are labeled, each feature concentrates its prediction distribution in a single label, 6

17 (0) Initialize Nf, ] = 0, Zf] = 0 for all f, () For each example-label pair (x, ) (.) For each feature f F x, increment Nf, ], Zf] (2) For each feature f and label (2.) Set θ f = Nf,] Zf] (*) Define π x() = m f F x θ f Table 8 The decision list induction algorithm DL--R (0) Initialize Nf, ] = 0, Zf] = 0, Uf] = 0 for all f, () For each example-label pair (x, ) (.) For each feature f F x, increment Nf, ], Zf] (2) For each unlabeled example x (2.) For each feature f F x, increment Uf] (3) For each feature f and label (3.) Set ɛ = Uf]/L (3.2) Set θ f = Nf,]+ɛ Zf]+Uf] (*) Define π x() = m f F x θ f Table 9 The decision list induction algorithm DL--VS and the label of every example agrees with the prediction of every feature it possesses. In this limiting case, any minimizer of K is also a minimizer of H. We hasten to add a proviso: it is not possible to reduce K to zero for all data sets. The following provides a necessary and sufficient condition. Consider an undirected bipartite graph G whose nodes are examples and features. There is an edge between example x and feature f ust in case f is a feature of x. Define examples x and x 2 to be neighbors if they both belong to the same connected component of G. K is reducible to zero if and only if x and x 2 have the same label according to Y (0), for all pairs of neighbors x and x 2 in Λ (0). 3.4 Algorithm DL- We consider two variants of DL-0, called DL--R and DL--VS. They differ from DL-0 in two ways. First, the DL- algorithms assume the mean definition of π x given in equation (2) rather than the max definition of equation (). This is not actually a difference in the induction algorithm itself, but in the way the decision list is used to construct a prediction distribution π x. Second, the DL- algorithms use update rules that differ from the smoothed precision of DL-0. DL--R (table 8) uses raw precision instead of smoothed precision. DL--VS (table 9) uses smoothed precision, but unlie DL-0, DL--VS does not use a fixed smoothing constant ɛ; rather ɛ varies from feature to feature. Specifically, in computing the score θ f, DL--VS uses V f /L as its value for ɛ. The value of ɛ used by DL--VS can be expressed in another way that will prove useful. Let us define: p(λ f) Λ f X f p(v f) V f X f 7

18 Lemma 2 The parameter values {θ f } computed by DL--VS can be expressed as: where u() is the uniform distribution over labels. θ f = p(λ f)q f () + p(v f)u() (2) Proof. If Λ f = 0, then p(λ f) = 0 and θ f = u(). Further, Nf, ] = Zf] = 0, so DL--VS computes θ f = u(), and the lemma is proved. Hence we need only consider the case Λ f > 0. First we show that smoothed precision can be expressed as a convex combination of raw precision (9) and the uniform distribution. Define δ = ɛ/ Λ f. Then: q f () = Λ f + ɛ Λ f + Lɛ = Λ f / Λ f + δ + Lδ = + Lδ q f () + Lδ + Lδ δ Lδ = + Lδ q f () + Lδ u() + Lδ (22) Now we show that the mixing coefficient /( + Lδ) of (22) is the same as the mixing coefficient p(λ f) of the lemma, when ɛ = V f /L as in step (3.) of DL--VS. ɛ = V f L = Λ f L p(v f) p(λ f) Lδ = p(λ f) + Lδ = p(λ f) The main theorem of this section (theorem 5) is that the specific Yarowsy algorithm Y-/DL- decreases K in each iteration until it reaches a critical point. It is proved as a corollary of two theorems. The first (theorem 3) shows that DL- minimizes K as a function of θ, holding φ constant, and the second (theorem 4) shows that Y- decreases K as a function of φ, holding θ constant. More precisely, DL--R minimizes K over labeled examples Λ, and DL--VS minimizes K over all examples X. Either is sufficient for Y- to be effective. Theorem 3 DL- minimizes K as a function of θ holding φ constant. Specifically, DL--R minimizes K over labeled examples Λ, and DL--VS minimizes K over all examples X. Proof. We wish to minimize K as a function of θ under the constraints: C f θ f = 0 8

19 for each f. As before, to minimize K under the constraints C f = 0, we express the gradient of K as a linear combination of the gradients of the constraints, and solve the resulting system of equations: K C f = λ f (23) θ f θ f First we derive expressions for the derivatives of C f and K. The variable α represents the set of examples over which we are minimizing K. C f θ f = K = θ f θ f = x X f α φ x log θ g x α g F x φ x θ f We substitute these expressions into (23) and solve for θ f : φ x = λ f θ f x X f α θ f = x X f α φ x /λ f Substituting the latter expression into the equation for C f = 0 and solving for f: φ x /λ f = x X f α X f α = λ f Substituting this bac into the expression for θ f : θ f = X f α If α = Λ, we have: θ f = Λ f = q f () x X f α x Λ f x Λ ]] φ x (24) This is the update computed by DL--R, showing that DL--R computes the parameter values {θ f } that minimize K over the labeled examples Λ. If α = X, we have: θ f = X f x Λ ]] + X f x Λ f = Λ f X f Λ f Λ f + V f X f L = p(λ f) q f () + p(v f) u() x V f L By lemma 2, this is the update computed by DL--VS, hence DL--VS minimizes K over the complete set of examples X. 9

20 Theorem 4 If the base learner decreases K over X or over Λ, where the prediction distribution is computed as π x () = m f F x θ f then algorithm Y- decreases K at each iteration until it reaches a critical point, considering K as a function of φ with θ held constant. The proof has the same structure as the proof of theorem, so we give only a setch here. We minimize K as a function of φ by minimizing it for each example separately. K(x) = g F x H(φ x θ g ) = φ x θ g g F x log To minimize K(x), we choose φ x so as to concentrate all mass in arg min log = arg max π x () θ g g F x This is the labeling rule used by Y-. If the base learner minimizes over Λ only, rather than X, it can be shown that any increase in K on unlabeled examples is compensated for in the labeling step, as in the proof of theorem. Theorem 5 The specific Yarowsy algorithms Y-/DL--R and Y-/DL--VS decrease K at each iteration until they reach a critical point. Proof. Immediate from theorems 3 and Sequential Algorithms 4. The Family YS The Yarowsy algorithm variants we have considered to now do parallel updates in the sense that the parameters {θ f } are completely recomputed at each iteration. In this section, we consider a family YS of sequential variants of the Yarowsy algorithm, in which a single feature is selected for update at each iteration. The YS algorithms resemble the Yarowsy-Cautious algorithm of Collins & Singer (999), though they differ from the Yarowsy-Cautious algorithm in that they update a single feature in each iteration, rather than a small set of features, as in Yarowsy-Cautious. The YS algorithms are intended to be as close to the Y-/DL- algorithm as is consonant with single-feature updates. The YS algorithms differ from each other, and from Y-/DL-, in the choice of update rule. An interesting range of update rules wor in the sequential setting. In particular, smoothed precision with fixed ɛ, as in the original algorithm Y-0/DL-0, wors in the sequential setting, though with a proviso that will be spelled out later. Instead of an initial labeled set, there is an initial classifier consisting of a set of selected features S 0 and initial parameter set θ (0) with θ (0) f = /L for all f S 0. 20

21 (0) Given: S (0), θ (0) with θ (0) g = /L for g S(0) () Initialization (.) Set S = S (0), θ = θ (0) (.2) For each example x X If x possesses a feature in S (0), set Y x = ŷ, else set Y x = (2) Loop: (2.) Choose a feature f S (0) such that Λ f and θ f q f If there is none, stop (2.2) Add f to S (2.3) For each label, set θ f = update(f, ) (2.4) For each example x possessing a feature in S, set Y x = ŷ Table 0 The sequential algorithm YS At each iteration, one feature is selected to be added to the selected set. A feature once selected remains in the selected set. It is permissible for a feature to be selected more than once; this permits us to continue reducing K even after all features have been selected. In short, there is a sequence of selected features f 0, f,..., and S t+ = S t {f t } The parameters for the selected feature are also updated. At iteration t, the parameters θ g with g = f t may be modified, but all other parameters remain constant. That is: It follows that, for all t: θ (t+) g = θ (t) g for g f t θ (t) g = L for g S t However, parameters for features in S 0 may not be modified, inasmuch as they play the role of manually labeled data. In each iteration, one selects a feature f t and computes (or recomputes) the prediction distribution θ ft for the selected feature f t. Then labels are recomputed as follows. Recall that ŷ arg max π x (), where we continue to assume π x () to have the mixture definition (equation 2). The label of example x is set to ŷ if any feature of x belongs to S t+. In particular, all previously labeled examples continue to be labeled (though their labels may change), and any unlabeled examples possessing feature f t become labeled. The algorithm is summarized in table 0. It is actually an algorithm schema; the definition for update needs to be supplied. We consider three different update functions: one that uses raw precision as prediction distribution, one that uses smoothed precision, and one that goes in the opposite direction, using what we might call peaed precision. As we have seen, smoothed precision can be expressed as a mixture of raw precision and the uniform (i.e., maximum-entropy) distribution (22). Peaed precision ˆq(f) mixes in a certain amount of the point (i.e., minimum-entropy) distribution that has all its mass on the label that maximizes raw precision: ˆq f () p(λ (t) f)q f () + p(v (t) f) = ]] (25) where: arg max q f () (26) 2

22 Note that peaed precision involves a variable amount of peaing ; the mixing parameters depend on the relative proportions of labeled and unlabeled examples. Note also that is a function of f, though we do not explicitly represent that dependence. The three instantiations of algorithm YS that we consider are: YS-P ( peaed ) YS-R ( raw ) YS-FS ( fixed smoothing ) θ f = ˆq f () θ f = q f () θ f = q f () We will show that the first two algorithms reduce K in each iteration. We will show that the third algorithm, YS-FS, reduces K in iterations in which f t is a new feature, not previously selected. Unfortunately, we are unable to show that YS-FS reduces K when f t is a previously-selected feature. This suggests using a mixed algorithm in which smoothed precision is used for new features but raw or peaed precision is used for previously-selected features. A final issue with the algorithm schema YS concerns the selection of features in step (2.). The schema as stated does not specify which feature is to be selected. In essence, the manner in which rules are selected does not matter, as long as one selects rules that have room for improvement, in the sense that the current prediction distribution θ f differs from raw precision q f. (The ustification for this choice is given in theorem 9 below.) The theorems of the following sections show that K decreases in each iteration, so long as any such rule can be found. One could choose greedily by choosing the feature that maximizes gain G (eq. 27), though in the next section we give lower bounds for G that are rather more easily computed (theorems 6 and 7). 4.2 Gain From this point on, we consider a single iteration of the YS algorithm, and discard the variable t. We write θ old and φ old for the parameter-set and labeling at the beginning of the iteration, and we write simply θ and φ for the new parameter-set and new labeling. The set Λ (resp., V) represents the examples that are labeled (resp., unlabeled) at the beginning of the iteration. The selected feature is f. We wish to choose a prediction distribution for f so as to guarantee that K decreases in each iteration. The gain in the current iteration is: G = x X g F x H(φ old x θg old ) H(φ x θ g ) ] (27) Gain is the negative change in K; it is positive when K decreases. In considering the reduction in K from (φ old, θ old ) to (φ, θ), it will be convenient to consider the following intermediate values. K 0 = H(φ old x θg old ) x X g F x K = H(ψ x θg old ) x X g F x K 2 = H(ψ x θ g ) x X g F x K 3 = H(φ x θ g ) x X g F x 22

23 where and One should note that: { = ]] if x Vf ψ x = otherwise φ old x arg max θ f (28) θ f is the new prediction distribution for the candidate f; θ g = θ old g for g f. φ is the new label distribution, after relabeling. It is defined as: { = ŷ(x)]] if x Λ Xf φ x = otherwise L (29) For x V f, the only selected feature at t + is f, hence = ŷ for such examples. It follows that ψ and φ agree on examples in V f. They also agree on examples that are unlabeled at t +, assigning them the uniform label distribution. If ψ and φ differ, it is only on old labeled examples (Λ) that need to be relabeled, given the addition of f. The gain G can be represented as the sum of three intermediate gains, corresponding to the intermediate values ust defined: where: G = G V + G θ + G Λ (30) G V = K 0 K G θ = K K 2 G Λ = K 2 K 3 The gain G V intuitively represents the gain that is attributable to labeling previously unlabeled examples in accordance with the predictions of θ. The gain G θ represents the gain that is attributable to changing the values θ f where f is the selected feature. The gain G Λ represents the gain that is attributable to changing the labels of previously labeled examples to mae labels agree with the predictions of the new model θ. The gain G θ corresponds to step 2.3 of algorithm YS, in which θ is changed but φ is held constant; and the combined G V and G Λ gains correspond to step 2.4 of algorithm YS, in which φ is changed while holding θ constant. In the remainder of this section, we derive two lower bounds for G. In following sections, we show that the updates YS-P, YS-R, and YS-FS guarantee that the lower bounds given below are non-negative, hence that G is non-negative. Lemma 3 G V = 0 Proof. We show that K remains unchanged if we substitute ψ for φ old in K 0. The only property of ψ that we need is that it agrees with φ old on previously labeled examples. Since ψ x = φ old x for x Λ, we need only consider examples in V. Since these examples are unlabeled at the beginning of the iteration, none of their features have been selected, 23

Understanding the Yarowsky Algorithm

Understanding the Yarowsky Algorithm Understanding the Yarowsy Algorithm Steven Abney August, 2003 Introduction Bootstrapping, or semi-supervised learning, has become an important topic in computational linguistics. For many language-processing