Part of Speech Tagging of Chinese Sentences Using Genetic Algorithm

Similar documents
SYNTHER A NEW M-GRAM POS TAGGER

Statistical NLP: Hidden Markov Models. Updated 12/15

TnT Part of Speech Tagger

N-gram Language Modeling

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

The Noisy Channel Model and Markov Models

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Graphical models for part of speech tagging

N-gram Language Modeling Tutorial

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Natural Language Processing. Statistical Inference: n-grams

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

ARTIFICIAL NEURAL NETWORK WITH HYBRID TAGUCHI-GENETIC ALGORITHM FOR NONLINEAR MIMO MODEL OF MACHINING PROCESSES

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Fun with weighted FSTs

Text Mining. March 3, March 3, / 49

Statistical methods in NLP, lecture 7 Tagging and parsing

Part-of-Speech Tagging

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

10/17/04. Today s Main Points

On Using Selectional Restriction in Language Models for Speech Recognition

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Language Model. Introduction to N-grams

A fast and simple algorithm for training neural probabilistic language models

Natural Language Processing

Collapsed Variational Bayesian Inference for Hidden Markov Models

Naïve Bayes, Maxent and Neural Models

Natural Language Processing SoSe Words and Language Model

Kneser-Ney smoothing explained

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Maxent Models and Discriminative Estimation

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

PROBABILISTIC PASSWORD MODELING: PREDICTING PASSWORDS WITH MACHINE LEARNING

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

GloVe: Global Vectors for Word Representation 1

Probabilistic Models for Sequence Labeling

POS-Tagging. Fabian M. Suchanek

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Learning to translate with neural networks. Michael Auli

Processing/Speech, NLP and the Web

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

HMM and Part of Speech Tagging. Adam Meyers New York University

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 5 Neural models for NLP

Midterm sample questions

Lecture 12: Algorithms for HMMs

CSC321 Lecture 15: Recurrent Neural Networks

Lecture 12: Algorithms for HMMs

Conditional Language Modeling. Chris Dyer

Sequences and Information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Statistical Substring Reduction in Linear Time

Conditional Random Fields

Integer weight training by differential evolution algorithms

Natural Language Processing and Recurrent Neural Networks

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Particle Swarm Optimization of Hidden Markov Models: a comparative study

Temporal Modeling and Basic Speech Recognition

Hidden Markov Models

CS460/626 : Natural Language

An Introduction to Bioinformatics Algorithms Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

A Genetic Algorithm Approach for Doing Misuse Detection in Audit Trail Files

An Improved Quantum Evolutionary Algorithm with 2-Crossovers

Dynamic Programming: Hidden Markov Models

AN INTRODUCTION TO TOPIC MODELS

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

Hidden Markov Models in Language Processing

Hierachical Name Entity Recognition

1. Markov models. 1.1 Markov-chain

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Intelligent Systems (AI-2)

Hidden Markov Models Part 1: Introduction

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Intelligent Systems (AI-2)

arxiv: v1 [cs.cl] 21 May 2017

Hidden Markov Models Hamid R. Rabiee

Pair Hidden Markov Models

Lecture 13: Structured Prediction

An Effective Chromosome Representation for Evolving Flexible Job Shop Schedules

Transcription:

Abstract Part of Speech Tagging of Chinese Sentences Using Genetic Algorithm K T Lua Department of Information Systems and Computer Science National University of Singapore Kent Ridge, Singapore 119260 email: luakt@iscs.nus.sg http://www.iscs.nus.sg/~luakt This is a short report on the use genetic algorithm for part-of-speech tagging of Chinese sentences. Our preliminary work shows that we can correctly tag 93.8% of the sentences using bigrams and word-pos frequencies. The hit rate increases to 95.8% if tri-grams are used. Compared to the other approaches of tagging, such as rule-based, HMM, recurrent neural net, the GA tagging enjoys great flexibility for being able to incorporate all information (statistical and rule) into one integrated system. Therefore, the GA tagging would provide the highest possible hit rates. However, GA is very computational intensive. Our prototype GA written in FOXPRO spends more than 1 ms per word per generation in tagging. It takes several seconds to tag a sentence in about 1000 generations. 1 Introduction The problem of tagging of Chinese sentences belongs to a class of general optimization problems. Every Chinese word has a number of possible tags associated to it. In a typical sentences of about 10 words, the total number of possible tagging is a huge number. For example, if each word has 5 possible tags, the number of ways of tagging will be 5 10 ~ 10 x 10 6. Existing approaches solving this problem including HMM, Recurrent Neural Net, Annealing, Relaxation etc(brill, Eric, 1992, Kupiec, Robust, 1992, Lin, M. Y. and W. H. Tsai, 1987, Lin, Y.C., T. H. Chiang and K.Y. Su, 1992, Liu, Shing-Huan, Ken-jiann Chen, Li-ping Chang and Yeh-Hao Chin, 1995, Pong T. Y. and J. S Chang, 1993). All these approaches are designed to identify the most probable tag sequence based the statistical occurrence of the tag n-grams and word-tag frequencies In a bigram HMM model, the probability of occurrence of the tag bigram, i.e., P?tag i tag i+1? (I=inetgers) and the word-tag P?tag i word i? are used in the computation. The solution is obtained when we can maximize the expression: P = Π P?tag i tag i+1? P?tag i word i? [1] We may also convert the above expression into logarithm. Consider: I(tag i tag j ) = - log 2 P?tag i tag i+1? [2] I(tag i word i ) = - log 2 P<tag i word i > [3]

and we have I =? I(tag i tag j ) +? I(tag i word i ) [4] Again, we are look for a sequence of tags that produce maximum value of I in [4] Limitations of HMM In a bigram model, HMM looks at only 2 inputs, it cares only the connection between a tag and its preceding one. We have to move from bigrams to tri-grams, 4- grams and more to include more tags into the consideration. Then the problem of the sparse data set or inadequate parameters emerges. For example, if we have a tag set of 100, and we wish to consider a tri-gram model, the number of unique tri-grams will be 100 3, or 1 000 000. For 4-gram, this number grows to, 100 000 000. In practice, we may have a big enough corpus for tri-grams. We will definitely have problems with 4- grams as the size of the corpus must be greater than 100 millions words. Similar problem occurs in all other statistical based approaches. The second problem of an HMM is that it is purely statistical. The HMM always selects the tag sequence with maximum probability. This is fine if we have a corpus of infinity size and all possible combinations of n-grams have occurred. If this is not the case, we will have to rely on rules. However, at the present moment, incorporation of rules HMM is still quite difficult. Advantage of Genetic Algorithm The situation is quite different from the angle of genetic algorithm(ga, Fogel, David B., 1995). In GA, a fitness function is defined to guide the searching of the best answer. The function, its definition and computation, are totally independent to the process of the searching. We have the liberty of defining (and also changing) the fitness function the way we like. We can include an HMM or a complete RNN to be part of the fitness function. The only requirements on the fitness functions are: 1. The fitness function can be computed 2. We have some knowledge about the solution 3. The fittest can be defined Once we get a definition of the fittest for a fitness function, knowing that it can be expressed and computed, derived or experimentally obtained, we are really to apply GA in the search of an optimal solution. One every important point to note is that we have to suggest the solution as GA will not generate a solution for us. It simply assess our suggestions and help us in the search for the optimized solution.

Heuristic Search The search of an optimal solution in GA is heuristic by its nature. Possible solutions are suggested and fitness values obtained for the solutions. Then GA, through generations of evolution, provides the possible optimal solutions. In this way, the time complexity of O(p n ) is reduced to O(P*f(n)). f(n) gives a dependency between the number of possible selections and the generations needed to provide an optimal solution. n is the number of entities in the solution (it is equal to the number of genes in the chromosome). For a sentence of 10 words each with 5 possible tags, in GA, the computation effort will 10x20x1000=200,000, given a population size of 20 and 1000 generations. This compares very favorably to 5 10 times of computations in an exhaustive approach. Advantage of Genetic Algorithm GA s search for an optimal solution is very much longer that those of HMM and RNN. In later cases, the time consuming part is the training. After training, they normally provide the optimal solution in one round of computation. Typically, a HMM tags more than 100 sentences in one second. It is thus at least 100 times faster than the GA tagging. However, we have to consider the long training time for HMM and RNN. Sometimes, it takes several days and nights to train a RNN. In GA, after obtaining the necessary parameters(such as n-grams and wordtag frequency data), it is ready to provide the solution. There is no training in GA at all. GA is therefor readily adaptable to the changing inputs whereas, in RNN, it is more difficult to make them adaptable to the change of inputs. HMM can sometimes be as fast, but it is not as flexible as GA in including rules and functions into its computations. This short paper is our first paper on the use GA in Chinese NLP. We use it to perform POS tagging on a corpus of medium size of about 200,000 characters obtained from Tsing Hua university. We want to verify that it can be used to tag Chinese sentences to the same degree of accuracy of HMM and RNN. We also wish to add rules to the system to improve the hit rates in the near future. In the below sections, we will provide a brief description on the corpus and tagset we used. In Section 3, we provide a detailed description on the GA program we developed for tagging; In Section 4, the performance of our GA Tagger is given and in the last section, we bring in our conclusion. 2. Corpus and Tagset We obtained a POS hand tagged corpus from a researcher from Tsing Hua University. The corpus contains 179,159 words and tokens such as punctuation marks, begin and end of sentence marks. The words are tagged manually using the following tagset (see Table 1):

Table 1: Tsing Hua Tagset nf {nf} npf {np} npu {np} npr {np} ng {ng} t {t} s {s} f {f} vg {vg} vgo {vg} vgn {vg} vgv {vg} vga {vg} vgs {vg} vgd {vg} vgj {vg} va {va} vc {vc} vi {vi} vy {vy} " " vh {vh} " " vv {vv} vf {vf} ag {ag} ac {ac} z {z} b {b} mx {m} mw {m} mg {m} mm {m} mh {m} " " mo {m} " " maf {m} mam {m} mab {m} qni {qn} qnc {qn} qnk {qn} qng {qn} " " qnm {qn} qns {qn} qnv {qn} qnf {qn} qnz {qn} qv {qv} qt {qt} rm {rm} " " rn {rn} rp {rp} rd {rd} pg {p} pba {p} ( ) pbei {p} (, ) pzai {p} " " db {db} " " dd {dd} dr {dr} cf {cf} cmw {cm} ( ) cmc {cm} ( ) cms {cm} ( ) cbw {cb} ( ) cbc {cb} ( ) cbs {cb} ( ) usde {usde} " " uszh {uszh} " " ussi {ussi} " " usdi {usdi} " " usdf {usdf} " " ussu {ussu} " " ussb {ussb} " " utl {ut} " " utz {ut} " " utg {ut} " " ur {ur} y {y} o {o} e {e} h {h} kn kv i j l x xch xfl @ {k} {k} {i} {j} {l} {x} {x} {x} Readers may like to compare the Tsing Hua Tagset with the Beijing University Tagset from Yu Shiwen and the Taiwan Tagset from ROCLING(Yu Shi Wen, 1995, Liu S H, 1995) 3. Our Genetic Algorithm To evaluate the performance of GA, we start by developing a GA in its simplest form. Our GA consists of only the following necessary parts: (1) preliminary data setup, (2) population generation, (3) fitness computation and evaluation (4) mutation.

3.1 Preliminary Data Setup As this is a bigram and tri-gram POS tagging, we have to prepare the following 3 tables: 1. POS bigrams, 2. POS tri-grams and 3. Word-POS frequencies. The POS bigrams and tri-grams are extracted from the corpus, stored into a file, sorted and counted. The probabilities of occurrence are then computed. We find this is the only way of computing n-gram probabilities. Using an array could speed up the computation a bit for bigrams. But the array size will grow to a very huge one that no computer can handle if we to calculate n-gram data for n?4. 3.2 Population Generation We use a chromosome of the same word length as the sentence. Each word is given a value according its possible tag. This number is between of 1 and 94 as there are 94 POS tags. An array is created for the chromosome, chrom[i] with 1? chrom[i]? 94 and 1? i? word length of sentence. A random number generator is called to generate the preliminary population set. So, chrom[i]=int(94*rand()+1) Under Visual FOXPRO Ver 3.0, RAND() takes 9 µs to execute on a 120 MHz, Pentium PC. The total computational time for a sentence of 10 words and 1000 generations is 9 µs x 20 x 1000 = 0.18 sec. 3.3 Fitness Computation In this preliminary work, we include only the bigram(or tri-gram) probabilities and word-pos frequencies in the fitness function. The fitness function is defined as: FITNESS=? I( tag i word i ) +? I(tag i tag j ) We convert multiplication to addition by using the entropy values directly from the tables. However, even with this approach, the computation time taken is about 5 ms. So, to tag a sentence in 1000 generations, the total computation will several seconds. See detailed timing in table 2. Table 2: Analysis of Computation Time Computation Time in ms mutation 2.7 Evaluation 0.9 Fitness(bigram) 4.9 Fitness(monogram) 5.1 Whole 15.8

About 2/3 of the computation time are on the computation of the fitness values. In this work, we select the minimum entropy values for the optimal solutions, i.e.,min(fitness) 3.4 Mutation We have experimented on several approaches of mutation. Finally, we find that there is no need to keep the all the chromosomes except the most fitted one. We therefore keep the most fitted one and use it to generate another 19 new one by mutating randomly one of its genes. The gene that is selected to mutate is assigned a random number between 1 to 94. Remember we represent the tags in numbers. We have also experimented on the number of gene to be mutated. Our conclusion is that there is no need to mutate more than one gen. Perhaps this is because the word length of the sentence that we are tagging are about 10 words or fewer. For longer sentences, we may have to allow more genes to mutate. Otherwise the number of generations producing an optimal solution will be too many. We need to experiment more in this area. 3.5 GA Algorithm The complete GA program functions as below do while fitness <= a pre-determine value or no more improvement generate a population of size 20 compute fitness and select the fittest chromosome Mutation enddo 4. Results of POS Tagging We have only preliminary work shows that the POS tagging with bigram and word-pos frequencies has a hit rate of 93.8 %. Addition of POS tri-grams into the calculation increases the hit rate by only 2% to 95.8%. Figure 1 shows the hit rates versus generations for sentences of 5 words and 10 words respectively. For shorter sentences, maximum hit rates is achieved earlier, at 600 th generation. For longer sentences, the system converges much later, at 900 th generation. Also, longer sentences provide higher hit rates. We do not under this and more experiments are needed.

Fig 1. Hit rates vs Generation 100 90 80 5-worcs Hit rates % 70 60 50 40 30 20 10 0 10-words 0 100 200 300 400 500 600 700 800 900 1000 Generation 5 Conclusion Our preliminary work shows that we can use GA to POS tag a Chinese corpus with POS bigrams( or tri-grams) and the word-pos frequencies. The hit rates are about the same as those applying traditional approaches such as HMM and RNN. The advantage is that GA allows the inclusion of more attributes. We can always add in new attributes to the system without altering the basic structure of the system. The effect of the new attributes can thus be examined with accurately. Such a property allows selection and experimenting of the attributes that may have an effect on the tagging. Furthermore, rules can also be added when necessary. We can build a real hybrid system. The problem of GA tagging is that it is very slow. HMM normally tags few hundred sentences in one second. This is hundred or even thousand times faster than the GA tagging, which requires several second completing a sentence. We should therefore work harder solving this problem. Reference Brill, Eric, 1992, A simple rule-based part-of-speech tagger, Proceeding of the 3 rd Conference on Applied Natural Language Processing(ACL), pp 152-155. Fogel, Davide, B, 1995, Evolutionary Computation, IEEE Press. Kupiec, Robust, 1992, part-of-speech tagging using a Hidden Markov Modelm Computer Speech and Language, Vol 6, No 3, pp225-242.

Lin, M. Y. and W. H. Tsai, 1987, Removing the ambiguity of phonetic Chinese input by the relaxation technique, Computer Processing of Chinese and Oriental Languages, Vol 3, No 1, May, Pp1-24. Lin, Y.C., T. H. Chiang and K.Y. Su, 1992, Discrimination oriented probabilistic tagging, Proceeding of ROCLING V, Pp87-96. Liu, Shing-Huan, Ken-jiann Chen, Li-ping Chang and Yeh-Hao Chin, 1995, Automatic Part-of-speech tagging for Chinese corpora, Computer Processing of Chinese and Oriental Languages, Vol 9, No 1, pp31-47. Pong T. Y. and J. S Chang, 1993, A study of word segmentation and tagging for Chinese, Proceedings of ROCLING VI, Pp173-193 (In Chinese) Yu Shi wen, Tagged Singapore Chinese School Texts, Paper R95001, CommCOLIPS, Vol 5.