Question Classification Using Language Modeling

Size: px

Start display at page:

Download "Question Classification Using Language Modeling"

Camilla Mason
6 years ago
Views:

1 Queston Classfcaton Usng Language Modelng We L Center for Intellgent Informaton Retreval Department of Computer Scence Unversty of Massachusetts, Amherst, MA ABSTRACT Queston classfcaton assgns a partcular class to a queston based on the type of answer entty the queston represents. In ths report, I present two approaches: the tradtonal regular expresson model, whch s both effcent and effectve for some questons but nsuffcent when dealng wth others; and the language model, a probablstc approach to solvng the problem. Two types of language models have been constructed: ungram models and bgram models. Several ssues are explored, such as how to smooth the probabltes and how to combne the two types of mode ls. As expected, the language model outperforms the regular expresson model. An even better result can be acheved by combnng the two approaches together. 1. INTRODUCTION Queston answerng s a varant of nformaton retreval, whch retreves specfc nformaton rather than documents. A QA system takes a natural language queston as nput, transforms the queston nto a query and forwards t to an IR module. When a set of relevant documents s retreved, the QA system extracts an answer for ths queston. There are dfferent ways of dentfyng answers. One of them makes use of a predefned set of entty classes. Gven a partcular queston, the QA system classfes t nto those classes based on the type of entty t s lookng for, dentfes entty nstances n the documents, and selects the most lkely one from all the enttes wth the same class as the queston. So ths approach nvolves two tasks. Frst, we should be able to dentfy named enttes. Ths s a problem n the Informaton Extracton area [1], and we can make use of an exstng entty tagger. Second, we need to classfy questons nto dfferent classes, and ths s the problem I addressed here. One approach to queston classfcaton s to determne the queston type based on the sentence structure and key words, whch represent syntactc and semantc nformaton respectvely. A set of patterns are defned and hard-coded, often wth regular expressons. When a new queston comes, t s matched aganst those patterns to fnd the class t belongs to. As the pattern set gets more complete and accurate, the performance of ths approach wll become better. So to mprove ths model, we always have the problem of defnng more and more queston patterns. To make the process of queston classfcaton more dynamc and automatc, we make use of language modelng, a statstcal approach that has ganed much attenton recently n the IR area [2]. In ths approach, the models can be automatcally constructed from the tranng set, and ts performance s compettve to other approaches. As for the QA task, we buld one language model for every class of questons based on the tranng data set. To classfy a queston, the probablty of generatng t s calculated for each class based on ts language model, and the hghest probablty determnes the classfcaton.

2 For the rest of ths report, I wll present the mplementaton of these two approaches and dscuss ther performance, wth the focus on language modelng. In secton 2, I wll talk about two preparaton steps: defnng queston classes and preprocessng questons before classfcaton; secton 3 s about the regular expresson model: pros and cons; secton 4 dscusses language models two experments and combnaton wth the regular expresson model; secton 5 examnes performance n dfferent cases; secton 6 ntroduces related work and secton 7 s the concluson. 2. PREPARATION Defnng queston classes s the frst step n classfcaton. One mportant prncpal when defnng these classes s that all the classes we use to mark questons should be recognzable as enttes n the documents. Ths s because queston classfcaton s not an ndependent job, but a component for the QA task. Two knds of classes are used. Some entty classes are naturally related to queston classes, such as person, locaton, number and so on. Other classes are created for partcular types of questons. For example, a frequently asked type of queston s: Who s sb.? Typcally people want to fnd qute detaled nformaton about ths person wth ths queston. We don t have a good object class correspondng to ths type of queston, so we use the term bography to denote the type of answers for ths queston and add t nto the queston class set. Another preprocessng step s to re -form the queston to make ts underlyng pattern clearer. For example, the Who s sb.? questons always ask for a bography entty no matter what person s name appears n the queston. In other words, the mportant thng s to know that ths queston contans a person entty. We do not care about the specfc entty. So we can safely change questons of ths pattern nto Who s <PERSON>? wthout losng any nformaton useful n determnng the queston type. What we actually do s to run an entty recognzer, the major part of whch s IdentFnder [3], aganst questons and replace all enttes wth ther entty class names. 3. REGULAR EXPRESSION MODEL The basc dea of ths model s to determne a queston type based on the sentence pattern, whch ncludes the nterrogatve word, certan sequences of words and some representatve terms of partcular queston classes. Those patterns are defned wth regular expressons. For example, a queston startng wth how many s very lkely to be lookng for a number, and a queston startng wth where s probably a locaton queston. For a what queston, we can look for some key words to make our decson. For example, agency, company and unversty are related to the organzaton class. Here are some regular expressons used for certan classes of questons: Questons that start wth what and ask for a person entty: actor actresse? attorne y e) teacher... senator) s? Questons that start wth how and ask for a length entty: long short wde far close bg.* dameter radus) Ths approach s very effcent and effectve on some queston patterns, such as how many questons. It seldom makes mstakes for ths type of queston. But there are dffcult cases that t can hardly handle. For nstance, the answer to a who queston mght be a person, an organzaton, and even a locaton. Let s take the queston Who s the largest producer of laptop computers n the world? as an example. People can easly tell ths s askng for an organzaton, but our program cannot decde ts type just based on the queston pattern. We need addtonal semantc

3 nformaton, whch s not avalable n the regular expresson model. The same problem occurs wth the where questons. Many where questons are classfed as locaton whle they are actually organzaton questons. The only way to solve ths knd of problem s to buld a more complete and accurate pattern set, whch nvolves a great deal of human work. Instead of buldng a larger and larger queston pattern model, we turned to a more automatc and flexble approach: language modelng. 4. LANGUAGE MODEL The basc dea of language modelng s that every pece of text can be vewed as beng generated from a language model. If we have two peces of text, we can defne the degree of relevance between them as the probablty that they are generated by the same language model. In the nformaton retreval area, we buld one language model for each document. Gven a query, we can decde whether a document s relevant based on the probablty that ts language model generates such a query. Suppose that the query Q s composed of n tokens: w 1, w 2, w n, and we can calculate the probablty as: P Q D) = P D)* P D, )*...* P wn D,,,..., wn 1) So to buld the language model on a document, we need to estmate those term probabltes. Usually, a k-gram assumpton s made to smplfy the estmaton: P w D,, w 2,..., w 1) = P w D, w k 1), w k 2 ),..., w 1) It means that the probablty that w occurs n the document D wll only depend on the precedng k-1) tokens [4]. Smlar deas have been ntroduced nto the queston classfcaton task. We buld one language model for each category C of sample questons. When a new queston Q comes, we calculate the probablty PQ for each C and pck the one wth the hghest probablty. The major advantage of language model over the regular expresson model s ts flexblty. The regular expresson model s composed of hard-coded rules, whch need to be modfed to handle new cases. The language model, however, can be automatcally mantaned. And we beleve that, wth larger sets of tranng data, the performance of the language model can be mproved. Two experments have been conducted, and both of them nclude two language models: ungram and bgram models. The dfference between them s the smoothng technque and the combnaton method. However, the two experments provde smlar performance. The detals wll be dscussed below. 4.1 EXPERIMENT 1 The ungram and bgram models are the two smplest to construct, where P Q = P * P *...* P wn and P Q = P w * P w C, w )*...* P w C, w ) 1 n n 1 respectvely. For the ungram model, we need to estmate the probablty of a token w occurrng n the category C, Pw. Intutvely, t should be proportonal to the term frequency Fw. The trcky part s how to deal wth tokens that never occurred n ths category. We don t want them to have a probablty of 0, so some probabltes must be assgned to them and the probabltes for other words wll be adjusted accordngly. Ths knd of smoothng can be done n several ways, and for ths experment, we used an absolute dscount method. A small

4 constant amount of probabltes s assgned to all 0-occurrence tokens, and the probabltes for other tokens wll be subtracted accordngly [4]. Here s the formula: Let Total0 be the number of 0-occurrence tokens n category C and S be the smoothng dscount. So we have: P w = F w *1 S) f F w 0 S / Total 0 f F w = 0 The bgram model s bult smlarly, where we need to estmate the condtonal probablty Pw 2 C,w 1 ). Let Total0C,w 1 ) be the number of tokens that never occur after w 1 n category C, and S be the smoothng dscount. There are two cases to consder: In ths experment, we stll buld the ungram and bgram models. But a dfferent smoothng technque and combnaton method are used. For the ungram model, we make use of Good -Turng [5] to estmate the probabltes for tokens that occur small numbers of tmes or never occur. Accordng to the Good-Turng estmate, Pw should have the followng structure: P w = α F w f Count w > M q f The choces of α, ω Count w = and 0 M q and M must satsfy: P w = 1 and q 1 < q Case 1: Fw 1 0, where the probabltes for all unseen w 2 s S. So we have: P C, ) = F w C, w ) * 1 ) f F w C, w ) 0 S S Total0 C, w ) f F w C, w ) 0 / 1 = Case 2: Fw 1 = 0, where all w 2 are unseen. So P C,) should be the same for every w 2, whch s calculated as follows: P C, ) = 1/ Total0 C, ) To make the estmaton more accurate, we try to combne the two models together. Lnear combnaton s a straghtforward way, where P Q = λp Q + 1 λ) P Q u Dfferent values for λ have been tested, and the best one s chosen. 4.2 EXPERIMENT 2 b There are several ways to derve the formula, and the result s as follows: Let N be the sze of the corpus, and n be the number of tokens that occur tmes n C. Actually, we should use En ), the expected value for n. But ths value s not avalable, so we can only use the drectly observed one nstead. P w = n > M + 1 ) F w n n ) + 1 C ) n > M f + f Count w > M 1 ) N Count w = and 0 M M s the largest number that satsfes:

5 2 + 1 n ) < n 1 n+ 1 = 1,..., M and n M + 1 > M + 1 < n M > M n n Whle the ungram model s bult based on the Good-Turng estmate, a Back-Off model [4] s developed for bgrams. The basc dea of the Back-Off model s that P C,) should be proportonal to Fw 2 C,w 1 ) only when the occurrence of w 1, w 2 ) n C s larger than a certan number. Otherwse, we just use Pw 2 to estmate Pw 2 C,w 1 ). Here s the formula: P C, ) = α F w C, ) f Count w C, w ) > K 2 β P f Count w C, w ) K a s a dscount to subtract the probabltes from large-occurrence bgrams, and we used the same dscount as n the Good-Turng. ß s chosen for normalzaton: P C, ) = 1 It s a functon of w 1. K should be a small number, and we found that 0 provdes the best performance for our data. The Back-Off model naturally combnes the ungram and bgram models. So to calculate the probablty PQ, we can just use the bgram result,.e., P Q = P Q 4.3 COMBINED WITH RE MODEL Although the language model seems more attractve, t stll has drawbacks. One of them s unpredctablty. For example, as we do not have any restrcton on the classfcaton result b of the language model, t s possble to classfy a queston that starts wth how many as a person queston. On the other hand, ths knd of pattern s easy to capture by the regular expresson model. So we tred to combne them to mprove performance. The language model s modfed to generate a ranked lst of categores based on the belef score, and the regular expresson model returns all categores compatble wth the queston pattern. The combnaton polcy s that the category wth the hghest rank that s accepted by the regular expresson model s the fnal answer. In ths way, the mstake mentoned above wll be avoded. 5. EVALUATION A set of 693 TREC questons has been used for evaluaton. They belong to the followng classes: Class Name # of questons PERSON 116 LOCATION 126 DATE 73 ORGANIZATION 64 NUMBER 74 OBJECT 121 REFERENCE 119 When testng the language models, we need a tranng set to buld the models. So we dd the experments n the followng way. The whole queston set was randomly dvded nto fve equally large dsjont parts. One part s chosen to be the test data, whle the other four serve as the tranng data. Accuracy s calculated by comparng the classfcaton result wth the manually classfed result. The same process has been repeated fve tmes, each tme a dfferent test set s chosen. And the average accuracy s used to measure the performance.

6 Here are the test results for all the models dscussed above: phrase after what and the queston type s determned accordngly. Model Accuracy Regular Expresson Model only 57.57% Experment1 Experment2 LM only 81.54% LM combned wth RE Model 85.43% LM only 80.96% LM combned wth RE Model 83.56% The result shows that the language model performs better than the regular expresson model, and the performance can be further mproved f we combne them together. A lttle surprsngly, the language model n the frst experment outperforms the second one. We were expectng the reverse result snce both the Good -Turng Estmate and the Back-Off Model have been shown to perform well n practce. One possble explanaton s that our data set s nsuffcent to apply Good-Turng Estmate. As dscussed above, we used n n places of En ). These two values should be close when the data set s large enough. But n our case, where there are only around 700 questons, ths estmaton mght be qute bad. 6. RELATED WORK Queston classfcaton s a common part n QA systems. The basc dea s the same: to classfy questons and dentfy correspondng enttes n documents, but t can be acheved n dfferent ways. Many systems use technques that are smlar to the regular expresson model just mentoned. MURAX s an earler QA system that makes use of an onlne encyclopeda [6]. Its heurstc s smple: to classfy questons based on the nterrogatve words. And for what questons, whch may ask for several types of enttes, the encyclopeda s searched for the noun Another QA system usng named enttes and queston classfcaton s the GuruQA system descrbed by Prager [7]. It mantans a set of patterns and compares questons wth them to determne ther types. The queston type s used as a query term and the documents have been processed to add types to the named enttes. In ths way, the document contanng a named entty wth the same type of the queston s more lkely to be retreved. 7. CONCLUSION Queston answerng dffers from nformaton retreval n that t needs to retreve specfc fact nformaton rather than whole documents. Ths mght nvolve excessve computaton f there s no gudance for possble answers. By classfyng questons and named enttes nto the same set of classes, we can elmnate a large amount of rrelevant nformaton. Ths report has nvestgated two approaches for queston classfcaton: regular expresson model and language modelng. The regular expresson model s a smplstc approach and has been put nto practce n many systems. Language modelng s a probablstc approach mported from IR systems. The models are constructed n a more flexble and automatc way. We have bult two types of models: a lnear combnaton of ungram and bgram models wth an absolute-dscount smoothng technque; and a Back-Off bgram model wth Good -Turng estmate. The test result shows that the language model outperforms the regular expresson model. And an even better result can be acheved when the two models are combned together. Although Good-Turng and Back-Off models have been proved effectve n practce, the

7 second language model doesn t mprove the performance over the frst one. ACKOWLEDGMENTS Ths materal s based on work supported n part by the Center for Intellgent Informaton Retreval and n part by NSF grant #EIA The author would lke to thank Davd Pnto, Bruce Croft, Andres Corrada-Emmanuel and Davd Fsher for ther help and support. Any opnon, fndngs and conclusons or recommendatons expressed n ths materal are the authors and do not necessarly reflect those of the sponsors. REFERENCES [1] R. Srhar and W. L, Informaton Extracton Supported Queston Answerng. [2] J. M. Ponte and W. B. Croft, A Language Modelng Approach to Informaton Retreval. [3] BBN offcal ste about the IdentFnder: [4] C. Mannng, H. Schutze and H. Schutze, Foundatons of Statstcal Natural Language Processng. [5] F. Jelnek, Statstcal Methods for Speech Recognton. [6] J. Kupec, MURAX: A Robust Lngustc Approach For Queston Answerng Usng An On-Lne Encyclopeda. [7] J. Prager, E. Brown and A. Coden, Queston-Answerng by Predcatve Annotaton.

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod