CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Size: px

Start display at page:

Download "CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0."

Vivian Preston
6 years ago
Views:

1 CS 224 HW:#3 ARIA HAGHIGHI SUID :# Smoothing Probability Models (a). Let r be the number of words with r counts and p r be the probability for a word with r counts in the Absolute discounting models. P abs (x) = r p r r 0 = = 1 ( ) (V 0 )δ r p r r (r δ) + (V 0)δ = 1 r r δ r + (V 0)δ ow we note that r r accounts for all the counts seen. Similarly r accounts for the V 0 of the vocabulary we have seen. Plugging in these quantities for the summations: In order to have p 0 = (V 0)δ 0 P abs (x) = (V 0)δ = 1 1 we must have the restriction: δ 0 V 0 (b). Using the same symbols as last time... P lin (x) = r p r r 0 = = 1 α + (V 0)δ r p r + 0 ( α 0 ) ( ) α r r + 0 = 1 α + 0 (α) = 1 α + α = 1 1

2 (c). Consider the task of predicting P (w n w 1,..., w n 1 ) where C(w n, ldots, w 1 ) > 0. For the absolute discounting model we have: For the linear discounting model we have: P abs (w n w 1,..., w n 1 ) = C(w n,..., w 1 ) δ C(w n 1,..., w 1 ) = MLE δ P lin (w n w 1,..., w n 1 ) = (1 α) r = (1 α)mle The absolute discounting method has the advantage that as, our predicted probability approaches the MLE. From a theoretical standpoint this is a desirable property. However from a practical consideration (where we often sparse not infinite amount of data), absolute discounting is a bad idea because depending on the value of the MLE,δ, and your predicted probability can go to zero (or even be negative). That is way an additive penalty to the MLE is in general bad idea. The linear method will apply the same scaling of your predicted probability independently of the value of and the MLE, and will never produce a 0 1 or negative number. Additionaly the choice of the α parameter is more inutitive; it represents the total probability you want to assign to all the non-occuring words in this context. Therefore, for practical purposes, linear discounting is a better smoothing method for natural language tasks. (a). 2. Linguistic and Mathematical Issues i. The obvious way in which the assumption doesn t hold is for long distance dependencies. For instance, a verb which occurs after an unlimited number of intervening modifying phrases must agree with the number of the head of the P which can occur arbitrarily far from the verb. A more technical way of stating this worry is that the the conditional probability assumptions entailed by an n-order Markov assumption do not hold. Also, the structure of some natural languages does not depend on word-order very much. For instance, in arabic there are many constructions where any permutation of the functional words is - more or less - acceptable. There aren t different parameters constraing whether a follows b or b follows a, but under the Markov Assumption, the order matters 2. This phenomena is sort of present in english. We analyze sentences according to phrasal structure and what matters is that for a S an P is complemented by a V P and we don t care really about the order of the words in the V P just that something satisfying that category appears. ii. Firstly, the Markov Assumptions makes a good computational model of natural languange because it is a tractable one, which is important from computational perspective. More importantly though, the cheif objection to Markov Models - that natural language does not satisfy conditional independence (e.g Long distance dependencies) - does not take into account that such unbounded dependence are rare - even pathological - in natural language. Most dependency among words have a short range, which can be captured be a sufficiently large Markov order assumption. 1 unless you use α = we can assign equal probabilities to both orders,but the point is they shouldn t be treated as seperate parameters at all 2

3 The Markov assumption tends to capture some rules about POS and order in English. For instance, we can encode the fact that a detemriner does not follow an adjective or that adjectives tend to preecede either other adjectives or nouns 3. Sometimes these simple rules are enough to distinguish grammatical, or likely sentences, from onew which aren t. (b). i. ote that it is sufficient to show that p H = H maximizes the log of the likelihood function since log is a monotonic function. Letting l(p H ) denote the log-likelihood of p H on data set D. l(p H ) = log p d i H (1 p H) 1 d i l (p H ) = H p H d i D Thus p H maximizes l(p H ) whenever it satisfies = (d i ) log p H + (1 d i ) log(1 p H ) Q d i D = H log p H + ( H ) log(1 p H ) H 1 p H [Diferentiating by p H ] H p H H 1 p H = 0 (1 p H ) H p H ( H ) = 0 H = p H ( H + H ) H = p H p H = H To see this is indeed a local maximum we need to check the second derivative is non-positive at this point : l (p H ) = H 0 p 2 H H (1 p H ) 2 Furthermore since we maximizing over a compact domain p H [0, 1], we have a global maximum. ii. The MLE of the data sequence is just the fraction of the data cases that land head. It overfits the data because it is based on maximizing the likelihood of that particular data sequence. For instance if we flip one coin and it lands heads, the MLE will be 1, which seems unreasonable for a parameter. The problem with Maximum-Likelihood estimate in gerneral is that it does not incorporate our beliefs about what the parameters should be, and we can be misled by it if we don t have a large enough data-set (as in the case of one coin flip) and base the estimate solely upon the sparse data. In order for the MLE to be a reasonable estimate we need a large amount of data, and since we don t have access to enough data, smoothing serves to reflect the uncertainty of not having enough data and fitting to closely to that data. 3 not necessarily the ones they modify 3

4 3. Practical Question (a). The perplexity of recap.input is ppl = , and of recap.gold is ppl = So that the gold-standard does indeed of a lower perplexity. (b). I wrote the following perl script to create the map, which assumes the text has already been tokenized. The file is called recap.map #!/usr/bin/perl -w while(my $word = <>) { chomp $word; next unless length($word); $count{lc($word)++; $map{lc($word){$word++; foreach (keys %map) { print $_ ; next if ref; while(($key,$value) = each %{$map{$_) { print " $key ". ($value/$count{$_); print "\n"; (c). I used the following command to produce the disambig output:./disambig -order 3 -keep-unk -lm ~/recap.lm -map ~/cs224n/recap.map -text../../../hw3/recap.input (d). I used the following command to tokenize files (and strip <s> tags from disambig output) to score for accuracy: perl -p -e s/<\/?s>//g; s/\s+/\n/g; s/^\n$//g; I used the following script to score the accuracy of the TrUeCaSe model (the script assumes the input files are tokenized version according to the script above). ote that I m removing word which don t contain letters because these numbers or blocks of punctuation give the models freebie s they don t deserve. #!/usr/bin/perl open F1, shift or die "Couldn t Find\n"; open F2, shift or die "Couldn t Find\n"; my $count=0; my $error=0; while(my $word1 = <F1> ) { $count++; my $word2 = <F2>; chomp $word1; chomp $word2; next unless ($word1 =~ /[a-za-z]/); if($word1 ne $word2) { $error++; print $word1. " ". $word2. "\n"; # print the pair of words that indicates an error 4

5 print ($error/$count). "\n"; The error rate of the model (which depends on how files were tokenized ) was 6.37%. (e). By far most of the errors made were because a word had never been seen before. For instance the model mapped matushita s to matushita s, because the model had never seen matushita s. However matushita had been seen before and has probability one to map to Matushita. This could easily be fixed by stripping punctuation at the end of words. Another error made was the new zealand radio network ltd. to the ew Zealand radio network ltd. (the string didn t appear in the beginning of a string). The problem here of course is that we didn t recognize a proper noun and this problem may not be solvable by a n-gram language model unless it was trained specifically on a corpus that had this proper noun phrase in it. The only thing that could have tipped it off is that The was capitalized in the middle of a sentence indicating a proper noun phrase follows and that ew Zealand is modifying Radio etwork. Another error was mapping london to London when the correct answer was LODO. Here LODO was used as a header in an article, and isn t that common in the data the model was trained on. This problem could be solved in a more sophisticated technique where we somehow recognize the sort of text we are in and apply a different language model depending on the sort of text we are in. But this involves learning to distinguish types of texts. (e). I wrote this script to capitalize the first letter of each work given from the output of disambig (after it was tokenized). I noticed most of the mistakes were made on the first character of an unknown word. (The first word of the map is the vocab of what we have seen before) #!/usr/bin/perl open I,"recap.map" or die "Couldn t Open"; while (<I>) { /^(\w+)/; $seen{$1 = 1; while(<>) { chomp; s/(^.)/\u$1/g unless defined $seen{$_; print $_."\n"; This reduces error to 5.94%, wohoo!!. 5

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp