Intelligent Typo Correction and Text Categorization Using Machine Learning and Ontology Networks. Yinghao Huang

Size: px
Start display at page:

Download "Intelligent Typo Correction and Text Categorization Using Machine Learning and Ontology Networks. Yinghao Huang"

Transcription

1 Inelligen Typo Coecion and Tex Caegoiaion Using Machine Leaning and Onology Newoks by Yinghao Huang A disseaion submied in paial fulfillmen of he equiemens fo he degee of oco of hilosophy Infomaion Sysems Engineeing in he Univesiy of Michigan-eabon 2014 ocoal Commiee: ofesso Yi Lu Muphey, Chai ofesso William Gosky Assisan ofesso Hafi Malik Associae ofesso aul Waa

2 2014 Yinghao Huang All Righs Reseved

3 EICATION I dedicae his disseaion o my beloved family o my paens, as wihou hei love and suppo ove he yeas his wok would no be possible, and o my loving wife, Yi Gao, and my dea daughe, Mackenie, who make ou life an inspiaional jouney. ii

4 ACKNOWLEGEMENTS I would like o hank my adviso. Yi Lu Muphey fo he valuable guidance, advices, suppo, and he inspiaional discussions we had. I hank he fo no only aining me wih pofessional eseach abiliies, bu also each me ules and pinciples ha will become knowledge and wisdom in my fuue. I would also like o hank membes of my docoal commiee, ofessos Bill Gosky, aul Waa and Hafi Malik fo hei suggesions and guidance. I have been lucky o be a pa of a goup of enhusiasic eseaches a he UM eabon ISL lab. I appeciae he oppouniies fo collaboaions and discussions wih vaiey of people, including Jungme ak, Chen Fang, Qi ai, Wenduo Wang, Shen Xu, ai Li, Xipeng Wang, ec. I has been a pecious expeience o wok wih hem and lean fom hem. Special hanks o Xipeng Wang, fo helping me wih expeimens and having discussion wih me abou eseach poblems and exchange ideas. Finally, I would like o hank my wife, Yi Gao, fo he suppo ove he yeas and encouagemen fo coninuing my eseach wok. iii

5 TABLE OF CONTENTS age EICATION... ii ACKNOWLEGEMENTS... iii LIST OF FIGURES... vii LIST OF TABLES... ix ABSTRACT... x CHATER 1. INTROUCTION Moivaion oblem descipion and eseach focus Majo eseach conibuions Sucue of disseaion... 8 CHATER 2. BACKGROUN AN RELATE WORK Tex mining Typo coecion Onology newoks Tex epesenaion models Veco Space Model VSM Laen Semanic Indexing LSI Saisical opic models Tex caegoiaion Tex caegoiaion based on machine leaning algoihms Tex caegoiaion based on saisical opic models Tex caegoiaion based on onology newok CHATER 3. AUTOMATIC TYO CORRECTION USING MACHINE LEARNING AN EXTERNAL KNOWLEGE BASES Machine leaning algoihms o exacing knowledge fo ypo coecion Exacing knowledge of domain-specific ems and aconyms iv

6 3.1.2 Building a lexicon of simila ypos and domain-specific abbeviaions Exacing conexual knowledge Assessing ypo coecion candidaes Inelligen Typo eecion and Coecion ITC Typo deecion and coecion candidae geneaion Wod bounday eo deecion and coecion Abbeviaion pocessing Coecion candidae weigh geneaion, anking and selecion Empiical sudy Building geneal knowledge bases Building domain specific knowledge bases Typo deecion and coecion CHATER 4. TEXT CATEGORIZATION BASE ON MACHINE LEARNING, STATISTICAL MOELING AN ONTOLOGY NETWORK Tex caegoiaion based on VSM and LSA opic modeling A VSM Model wih a new global weighing scheme A VSM augmened wih LSA opic modeling A VSM augmened wih semi-supevised LSA opic modeling A VSM augmened wih WodNe onology An augmened T maix geneaed using WodNe Geneae documen-documen connecion fo semi-supevised LSA using WodNe A sep-by-sep example of poposed ex epesenaion model geneaion pocedue Build VSM model wih CE_W global weighing scheme Build VSM model wih LSA opic modeling Build VSM model wih WodNe onology Build VSM model wih semi-supevised LSA Sysem evaluaion based on documen disance measue Geneae hybid VSM model fo classificaion CHATER 5. EMIRICAL CASE STUY AN EXERIMENTAL RESULTS EVALUATION ON TEXT CATEGORIZATION v

7 5.1 aases Expeimen seup Build VSM model wih LSA Build VSM model wih WodNe onology Build VSM model wih semi-supevised LSA Tex caegoiaion pefomance summay & analysis CHATER 6. CONCLUSION AN FUTURE WORK REFERENCES vi

8 LIST OF FIGURES Fig. 1. Examples of well-sucued and unsucued ex documens... 3 Fig. 2. Resuls of OS agging on well-sucued and unsucued ex documens Fig. 3. Example of WodNe onology insance visualiaion Fig. 4. Example of T maix geneaion Fig. 5. Mahemaical Repesenaion of Singula Value ecomposiion Fig. 6. Gaphical model epesenaion of LSA Figue 7. Gaphical model epesenaion of LSA Figue 8. Gaphical model epesenaion of LSA Fig. 9. Exacing knowledge fo ypo deecion and coecion Fig. 10. Fou ypes of opogaphical spelling eos Fig. 11. Example of n-gam saisics Fig. 12. NN_TC_Conf: a neual newok fo measuing he confidence abou a coecion candidae of a ypo Fig. 13. QWERTY keyboad disance maix Fig. 14. Oveview of ITC Inelligen Typo eecion and Coecion sysem Fig. 15. Candidae anking and selecion Fig. 16. Example of ex caegoiaion accuacy w/o ypo coecion Fig. 17. Example of ex caegoiaion feaue sie w/o ypo coecion Figue 18. oposed ex caegoiaion model famewok Figue 19. Gaphical model epesenaion of semi-supevised LSA Figue 20. oposed ex caegoiaion model famewok Figue 21. Example of geneaing elaed wods in WodNe Figue 22. Example of weighing edges in he ee sucue geneaed fo synse Figue 23. Example of concep geneaion in add ule Figue 24. Example of EM algoihm Iniialiaion Figue 25. Example of EM algoihm E-Sep Figue 26. Example of EM algoihm M-Sep Figue 27. Example of EM algoihm esul afe convegence Figue 28. EM algoihm esul fo esing documen u Figue 29. Example of EM algoihm Iniialiaion fo semi-supevised LSA Figue 30. Example of EM algoihm E-Sep fo semi-supevised LSA Figue 31. Example of EM algoihm M-Sep fo semi-supevised LSA Figue 32. Example of EM algoihm esul fo semi-supevised LSA afe convegence Figue 33. Semi-supevised LSA EM algoihm esul fo esing documen u vii

9 Figue 34. T geneaed fo semi-supevised LSA Figue 35. Compaison beween LSA and semi-supevised LSA Figue 36. VSM maix geneaion and combinaion Figue 37. Example of log-likelihood maximiaion fo LSA Figue 38. Tex caegoiaion pefomance based on diffeen numbe of opics geneaed by LSA Figue 39. Tex caegoiaion pefomance based on diffeen numbe of opics geneaed by LSA Figue 40. Tex caegoiaion pefomance based on diffeen numbe of opics geneaed by LSA viii

10 LIST OF TABLES Table 1 Example of simila em goups Table 2 Example of Levenshein disance Table 3 Example of enies in T&C_L Table 4 Example of enies in T&C_L Table 5 Example of enies in T&C_L Table 6 Example of wod bounday eos Table 7 Example of uncommon ABBREVIATIONS Table 8 Example of Typos ecognied by GTKB Table 9 Example of Typos ecognied by B Table 10 Typos ecognied by B Table 11 efomance Compaison wih Sae-of-a Spell Checkes on T Table 12 efomance compaison wih sae-of-a spell checkes on T Table 13 Coecion candidae Lis compaison Table 14 Tf-idf epesenaion fo HCI_GF Table 15 Tf-CE_W epesenaion fo HCI_GF Table 16 C maix geneaion fo T and u Table 17 Tf-CE_W epesenaion fo T and u afe using WodNe eplace ule Table 18 Connecion maix fo HCI_GF Table 19 Euclidean disance beween u and aining documens based on diffeen ex epesenaion - I Table 20 Euclidean disance beween u and aining documens based on diffeen ex epesenaion - II Table 21 Tex caegoiaion pefomance based on diffeen numbe of opics geneaed by LSA Table 22 Tex caegoiaion pefomance using WodNe based on diffeen wod class Table 23 Tex caegoiaion pefomance using WodNe based on diffeen hypenym/hyponym and meonym/holonym weighing Table 24 Tex caegoiaion accuacy using semi-supevised LSA based on diffeen hypeweigh values Table 25 Tex caegoiaion accuacy using semi-supevised LSA based on diffeen hypeweigh values Table 26 Tex caegoiaion accuacy compaison Table 27 Tex caegoiaion aveage F-1 measue compaison ix

11 ABSTRACT In his disseaion, we pesen ou eseach wok in ex mining field, mainly focusing on accuaely and efficienly pefoming he ask of ex pepocessing and ex caegoiaion, using machine leaning echniques and onology newoks. Specifically, an innovaive inelligen ypo coecion sysem, ITC, is poposed o auomaically coec misspellings in ex documens using geneal language knowledge and domain specific knowledge exaced by machine leaning algoihms. I has he capabiliy of coecing a boad ange of ypos, fom simple ypos such as duplicaion, omission, ansposiion, subsiuion chaaces, o complex spelling eos, such as wod bounday eos, unconvenional use of aconyms, and muliple vesions of abbeviaions of he same wods. I uses he geneaed knowledge fo idenifying unconvenional aconyms, gouping simila wods coecly spelled and misspelled, and anking coecion candidaes. An innovaive ex caegoiaion model, VSM_WN_TM, is also pesened. VSM_WN_TM is a special Veco Space Model VSM ha incopoaes wod fequencies, onology newoks and laen semanic infomaion. Unlike he adiional ex epesenaion using only Bag-of-wods BOW feaues, i also incopoaes semanic and synacic elaionship among wods such as synonymy, co-occuence and conex, wih he pupose of poviding moe inclusive and accuae ex epesenaion. The esuls fom he pefomed expeimens ae highly encouaging. The ITC sysem is evaluaed hough a case sudy ha involves he auomaic pocessing of auomoive faul diagnosic ex documens. The pefomance geneaed fom moe han auomoive faul x

12 diagnosic documens povided by wo diffeen auomoive manufacues show ha he poposed sysem oupefoms sae-of-a spell checking sysems. The poposed VSM_WN_TM model is evaluaed on hee publicly available daases and one domain-specific daase. Expeimen esuls show ha ou appoach significanly impoves ex classificaion by oupefoming baseline appoaches such as using only laen feaues and adiional VSM appoaches. Indexed Keywods: ex mining, ex pepocessing, ex caegoiaion, ypo coecion, machine leaning, onology newoks, saisical language modeling. xi

13 CHATER 1. INTROUCTION 1.1. Moivaion We have winessed he blooming of web-based ex esouce volume in ecen decades wihin diffeen domains, such as social newoks , Twie, Mico blogs, cusome eviews, ec., auomoive indusies vehicle diagnosics, cusome queies, ec. and medical science physician diagnosics, paien medical ecods, ec.. Wih he polifeaion of ex daa, ex mining echniques have dawn significan aenion by people in boh academic and indusial field fo quie a long ime. Tex mining is geneally defined as finding ineesing paens and ends of daa. Especially in he age of Big aa [1], when daa managemen sysems ae oveloaded daily by daa in fee ex fom, ex documen pocessing bings moe and moe necessiy fo auomaed, accuae and efficien ex mining algoihms. These algoihms could be used o fulfill diffeen objecives, such as daa pepocessing, ex cluseing and caegoiaion, infomaion eieval, ec. Tex mining is chaaceied as he pocess of analying ex o exac infomaion ha is useful fo paicula puposes. The key poin ha makes ex mining so special in he field of daa mining, which is aleady a well-developed aea ha is eneing a maue phase [2], is ha ex needs o be ansfeed ino numeical epesenaions befoe conducing fuhe analysis, and in mos of he ex mining applicaions, ex has an unsucued and causally wien fom ha causes difficulies fo exacing useful infomaion, such as clinical documen analysis [3,4], 1

14 s, insan messages, auomoive diagnosic ex mining [5], ec. Fo example, a lo of spelling eos occu in ex daa and can be poblemaic fo ex daa mining. One of he mos cucial asks in he ex mining field is ex caegoiaion. This is he ask of building a classifie ha assigns a pe-defined caegoy o each ex documen in he ex collecion, he souce of which highly depends on he poblem and he applicaion domain. ue o he fac ha he documen caegoy is usually defined based on vaious applicaion equiemens, he ypical appoach of ex caegoiaion is o deive numeical feaues ha epesen ex. Afe ha, supevised machine leaning echniques ae used o ain a classifie based on hose exaced feaues and he caegoies of pe-classified ex documens, o lae pefom caegoiaion asks on peviously unseen ex documens. As a esul, he pefomance of ex caegoiaion ask depends on he feaue geneaion mehod used and he machine leaning echniques adoped. In his disseaion, we pesen ou eseach wok in inelligen ex mining ha focuses on accuaely and efficienly pefoming he ask in wo main aeas. The fis aea includes developing a sysem ha auomaically coecs misspellings in ex documens so ha hey ae much moe compehensible by machines and much moe accuae fo fuhe pocessing such as ex caegoiaion. The second aea is o develop a ex caegoiaion model ha incopoaes wod based ex feaues wih semanic elaionship leaned fom onology newoks and laen semanic sucue infomaion fom saisical opic modeling. This hybid ex epesenaion model could significanly help impoving ex caegoiaion accuacy. 2

15 1.2. oblem descipion and eseach focus As discussed in secion 1.1, unsucued and fee-syle wien ex documens ae commonly found in many ex mining applicaions. These documens ofen do no follow gamma ules, and conain misspelled wods, abbeviaions and specific eminologies ha ae no found in sandad English dicionaies and may be baely compehensible o people ouside he applicaion field. We use an example o illusae he diffeen poblem complexiy compaing well-sucued ex documens wih unsucued ex documens. Fig. 1 illusaes hose wo documen examples. In boh documens, non-wod misspellings ae maked in ed. I is obvious ha unsucued documens ae much hade o undesand, and he ypos ae much moe difficul o be idenified and coeced. In ems of accuacy, misspelling coecion in such applicaions is indispuably essenial fo auomaic ex eieval, caegoiaion o cluseing sysems, since ypos can be misinepeed in many diffeen ways by auomaic ex pocessing sysems. In ems of efficiency, auomaed inelligen ypo coecion is also of significan necessiy in ex mining applicaions ha deals wih vas amoun of daa, which makes manual inspecion almos impossible. Fig. 1. Examples of well-sucued and unsucued ex documens 3

16 Many auomaic spelling coecion sysems have aleady been developed o help people wih hei yping asks, such as exing, web seaching, ec. Howeve, he cuen auomaic ypo coecion echnologies ae sill sho in accuacy [6,7]. In ex caegoiaion, he documen caegoy is defined by uses based on he applicaion equiemens, such as he opic ha a news aicle discusses, he impoance of a vehicle diagnosic ecod ha descibes vehicle epai deails [5], and he fac saed in a medical diagnosic documen ha whehe o no an injuy condiion susains [4], ec. Conside he following auomoive diagnosic ecods in wo caegoies: The fis ecod is defined as caegoy A, which is an impoan documen because i descibes he oo cause of he vehicle poblem: conneco cooded. The second is defined as caegoy B, which is unimpoan, because i only descibes he vehicle inspecion pocess geneally. Caegoy-A documen: pefom abs self oades found ea wheel speeds senso conneco cooded ino senso eplace senso and conneco oad ese ok clea code Caegoy-B documen: oad oades acion conol lamp on eec oades code c1280 u415 om cm conac ho line check connecion a cm check mouning bols ok clea code To solve he above caegoiaion poblem, a ypical appoach of epesening a ex documen is he Veco Space Model VSM [8], in which each documen is epesened by a weighed veco ha povides a mahemaical epesenaion which is convenien fo compuaion and analysis. Howeve, his appoach does no conside he semanic elaionship beween wods, such as synonyms, hyponyms IS-A elaionship beween wods, ec. The classificaion accuacy is 4

17 especially deeioaed in documen ses whee diffeen caegoies have simila wod occuences [4, 9]. As a esul, a ex caegoiaion model ha capues undelying semanic and synacic infomaion besides single wod occuence feaues is of gea necessiy. The majo eseach focuses in his disseaion include: Auomaic ypo deecion and coecion fo an unsucued and lage-scale ex copus by incopoaing geneal language knowledge and domain specific knowledge geneaed by machine leaning algoihms. Incopoaing onology newok infomaion and machine leaning echniques o auomaic ex caegoiaion, by capuing semanic and synacic elaionship beween wods. 5

18 1.3. Majo eseach conibuions The majo oiginal conibuions of his disseaion include eseach wok in inelligen ypo coecion and ex caegoiaion. Fis of all, we developed and implemened an inelligen ypo deecion and coecion algoihm, which is a significan sep fowad owads fully-auomaic spelling coecion fo pocessing lage sie copoa of unsucued ex documens. Secondly, we popose a sysemaic way of building accuae ex epesenaions using single wod infomaion, as well as synacic and semanic elaionship beween wods o impove he pefomances of vaious ex mining asks, such as ex caegoiaion, ex cluseing, pedicive analysis, infomaion exacion, ec. Thid, ou appoaches in hese wo fields can be combined ogehe fo ohe eal-wold applicaions ha equie ex pepocessing and ex caegoiaion. Las bu no leas, hey can be easily ansplaned and applied o ohe ex copus, besides hose discussed in his disseaion, e.g. ex used in social newoks such as insan messages and Twie. A summay of he majo eseach achievemens is pesened hee, including: oposed an auomaed inelligen ypo coecion famewok fo unsucued and lagescale ex collecions, using geneal language knowledge and domain specific knowledge exaced by machine leaning algoihms fo idenifying unconvenional aconyms, gouping simila wods coecly spelled and misspelled, and anking coecion candidaes. oposed a hybid ex caegoiaion appoach ha focuses on building VSM models wih boh onology newoks and saisical language model. We fis geneae he oiginal 6

19 BOW epesenaion by using an inuiive global weigh scheme, and hen build VSM models based on he WodNe onology and saisical opic modeling. Evaluaed ou mehods on publicly available daases and domain-specific daases: Feefom echnician vebaim poblem descipions VR Reues Nis Topic eecion and Tacking copus TT2 20 newsgoups Evaluaed ou sysem by compaing wih sae-of-a sysems and baseline appoaches: Google spell checke Aspell spell checke Tex caegoiaion based on VSM only Tex caegoiaion based on laen semanic feaue only Analysis of ex caegoiaion pefomance, including he influence of numbe of laen opics, hype weigh of documen conneciviy in opic model, wod class used fo synonym se geneaion, and onology weighing beween wods. Ceaed open souce packages fo ypo coecion sysem, called ITC, and ex caegoiaion sysem, called VSM_WN_TM, which can be adaped o ohe ex mining applicaions. 7

20 1.4. Sucue of disseaion The emainde of his disseaion is oganied as follows: Chape 2 discusses he echnical backgound and lieaue suvey in he aea of ypo coecion and ex caegoiaion. Chape 3 inoduces he ITC ypo coecion sysem. Chape 4 pesens he VSM_WN_TM ex caegoiaion model and deails of how o build ex epesenaion model using an onology newok and saisical language model. Chape 5 discusses he deails of ou empiical case sudy, he evaluaion of ou sysem and he pefomance analysis. Finally, Chape 6 gives an concise conclusion and discusses fuue wok ha can build on he ideas pesened hee. 8

21 CHATER 2. BACKGROUN AN RELATE WORK This chape descibes he backgound infomaion elaed o ex mining, ypo coecion and onology newoks. Tex epesenaion models ae also discussed including he veco space model and saisical language models. The chape also gives a lieaue suvey on elevan ex caegoiaion appoaches including machine leaning algoihms, saisical language models and onology newoks. 9

22 2.1 Tex mining Tex mining, also known as knowledge discovey fom ex KT, is fis menioned in Feldman e al. [10]. I is defined as he applicaions of ex pocessing and analysis ha uilie combined echniques fom infomaion eieval, naual language pocessing NL ha exac daa fom ex, as well as machine leaning and daa mining algoihms, wih he goal of finding useful paens fom ex [11]. ecen esimaes show ha moe han 80% of infomaion is epesened in he fom of ex [12], and his pecenage will likely incease due o he coninuous availabiliy of online exual infomaion. As a esul, hee has been significan developmen of ex mining echniques in pevious wo decades. A ypical ex mining ool fis exacs a ex documen fom ex collecion and conduc pepocessing seps, such as spell checking, emoving special symbols o puncuaions, okeniing senences ino seam of wods, ec. These pocedues aim a poviding a clean and undesandable foma of ex fo boh human and machines. Even hough hee ae pleny of effos made o exploe synacic and semanic infomaion fom ex, a he documen level, mos appoaches ae based on he concep ha a ex documen is epesened by a se of okenied wods; Tha is, a bag-of-wods BOW [13]. The nex sep is o conve he cleaned ex ino numeic epesenaions ha ae moe appopiae fo fuhe auomaed pocessing and analysis by machines. The sep is called ex encoding [14], and sae-of-a pedominan appoaches include veco space model [15] and saisical language models [16], which will be discussed in deail in secion

23 The ex analysis phase, following wih he ex encoding, is used o exac ineesing paens o ends fom ex based on diffeen applicaion equiemens. Fo insance, auomaically label peviously unseen documens based on use-defined caegoy ex classificaion, find goups of documens wih simila conen ex cluseing, exac pas of ex and assign specific aibues infomaion exacion, ec. Common appoaches hee include machine leaning, daa mining and saisical analysis. In summay, a ypical wok flow fo ex mining poblems includes he following aspecs: Exacing infomaion fo human consumpion, including ex summaiaion, documen eieval, infomaion eieval, ec. Assessing documen similaiy, including ex caegoiaion, documen cluseing, idenifying key-phases, ec. Exacing sucued infomaion, including eniy exacion, infomaion exacion, leaning ules fom ex, ec. Mining sucued ex, including documen cluseing wih links, wappe inducion, ec. The scope of his disseaion, as menioned in Chape 1, mainly includes ypo coecion and ex caegoiaion based on machine leaning, saisical language modeling echniques and exenal knowledge, which will also be discussed in deail in secion 2.3 and

24 2.2 Typo coecion As we menioned ealie in Chape 1, ypo deecion and coecion is an impoan ex pocess in many ex mining applicaions. The majo objecive hee is o educe he eos made by machines due o he misinepeaion of misspelled ex, and o povide a moe undesandable foma fo boh human and machines fo ex mining asks. In his disseaion, ou eseach focuses on coecing a boad ange of ypos, including simple duplicaion, omission, ansposiion, subsiuion chaaces, spelling eos, and unusual use of shohand and aconyms. Typos can be divided ino wo majo caegoies: non-wod eos and eal-wod eos. Convenional spelling checkes ypically use dicionaies o deec ypos. Each em wihin a ex documen is compaed agains he valid wods in a dicionay o a lexicon. Any em ha does no mach any wod in he dicionay is flagged as an eo. This kind of ypos is called non-wod eos [17,18], fo insances, abeviae insead of abbeviae, veuified insead of veified, ec. Typing eos ha esul in a valid wod, bu no he one ha he use inended, is called eal-wod eos [19,20], such as fon sea insead of fon sea, ec. The focus of ou eseach in his pape is mainly on deecing and coecing non-wod ypos. In geneal, wo ypes of applicaions equie ypo coecions: he online pogams involving ex as inpu, and he offline auomaic ex documen pocessing. Many online applicaions, such as web seach engines and ex based use ineface pogams, povide a lis of coecly spelled wods o use as he/she is yping each wod on compues/phones/handheld devices. 12

25 Many auomaic spelling coecion sysems have aleady been developed o help people wih hei yping asks, such as exing, web seaching, ec. Howeve, he cuen auomaic ypo coecion echnologies ae sill sho in accuacy [21,22]. In many ex mining applicaions, such as ex documens analysis, eieval, and caegoiaion, ypos need o be deeced and coeced auomaically in ode o achieve accuae esuls efficienly. Ou soluion is o combine domain specific knowledge wih he geneal language knowledge o achieve accuae ypo coecion. In his eseach, we focus on he applicaion of unsucued ex mining in vehicle diagnosic ecods. iscoveing knowledge fom unsucued ex documens has many impoan applicaions including auomoive faul diagnosics, medical documen pocessing, and social newok [23,24,25,26]. Lage amouns of daa have been colleced in daily opeaions in many copoaions, hospials and govenmen agencies, many of which ae unsucued ex daa. The shee volume of daa makes manual o even semi-manual caegoiaion o classificaion cumbesome and fallible. Auomaic ex caegoiaion echnologies have been developed and applied o many applicaion poblems including finding answes o simila quesions o queies, classifying news by subjec o newsgoup, caegoiing web pages, oganiing messages, ec. Many challenges exis in auomaic ex pocessing echnologies, including epesening semanics and absac conceps and pocessing wods wih semanic ambiguiy, such as polysemy and synonymy. Typos add anohe laye of complexiy in auomaic ex documen pocess. The paicula ype of documens we ae ineesed in has he following chaaceisics. These documens ae sho, and ypically yped hasily in vey sho ime peiod, e.g. in seconds o minues, by people wih vaying educaion backgound and ineess. Examples of such 13

26 documens ae sho ex messages, ex, medical descipions of paiens sympoms, vehicle diagnosic documens such as cusome s descipions of vehicle poblems, and echnicians descipions of epai pocess, and ec. These documens ofen do no follow gamma ules, and conain misspelled wods, abbeviaions and domain specific eminologies ha ae no found in sandad English dicionaies and baely compehensible o people ouside he applicaion field. Mos of he cuen sae-of-a ypo coecion sysems ae ineacive spelling checkes, which eun muliple spell coecion candidaes, and allows use o selec he inended coecion [21]. Ealy in 1960s, eseaches have aleady saed woking on ex spelling eo deecion and coecion. Many sudies have been conduced wih he pupose of developing coecion echniques fo non-wod eos. Ove ime vaious appoaches and successful sysems wee developed. opula mehods fo finding misspellings and assessing suggesion candidaes fo misspellings include pa-of-speech OS agging [27,28], minimum ediing disance [28,29], neaes neighbo seach pocedue, similaiy key mehods such as SoundEX sysems and Meaphone algoihms [28,30], and modified vesion of Longes Common Subsequence algoihm [19]. Eo coecion mehods, especially eal-wod eo coecion, ae ypically based on synacic and semanic knowledge such as n-gam based echniques [17,32,33], and saisical leaning fom aining daa ses such as web documens ae used as conex o help coecing ypos [33,34,20]. Many mehods have been developed o ceae and ank coecion candidae liss of deeced ypos, such as saisical language models [31], machine leaning echniques [18,27,32,35], complex newok appoaches [36], and noisy channel models [33,37]. The spell checkes used fo online applicaions mosly use online ex as he esouce fo ypo coecion, such as news pages fom he Web ha ae clean and well-spelled [33]; web queies 14

27 inpu o seach engines by Inene uses [20]; and Google n-gam daase [34,39,40]. These ineacive spelling checkes discussed above equie use o manually selec a candidae o eplace he ypo, which ae good fo ineacive sofwae sysems such as MS Office, bu, no applicable o auomaic documen pocessing, such as documen eieval o classificaion, in which a sysem having he capabiliy of pefoming fully auomaed eo deecion and coecion is equied. A numbe of auomaic ypo coecion algoihms have being developed so fa. A ypical appoach is o selec he bes candidae geneaed fo each ypo based on wod conex o semanic infomaion such as OS agging [21,41,42]. Two ineesing ypo coecion algoihms wee poposed by Sebasian and avid [41], one based on supevised leaning and anohe unsupevised leaning. The supevise leaning algoihm uses a evese edi disance mehod o geneae a candidae lis and hen one on he lis ha has he highes scoe based on wod occuence o wod s bigam semme scoe. The unsupevised algoihm aemps o coec spelling eos by fis using low-fequency wod as candidae ypo, wih he assumpion ha low-fequency wod is usually a ypo and a wod will be misspelled in exacly he same way vey few imes. Afe he candidae ypos ae seleced, he algoihm uses ohe wods in he documen se as valid lexicon, and selec bes candidae based on pedefined ule se of wod conex. They epoed ove 90% accuacy geneaed by he supevised leaning sysem on he 2200 misspellings wods povided fom a NASA daabase, and ove 70% accuacy by he unsupevised sysem on 5833 misspellings fom Obie Sucue daabase. An example of ex documens in he NASA daabase is shown in Fig. 1. Howeve, fo pocessing unsucued ex documens, seveal issues need o be addessed. Fis, low-fequency wods could also be valid wods, especially hose wods in documen ses ha have smalle sies. Fo example, in he 15

28 vehicle diagnosic copus we pocessed has moe han 10% of he valid wods only appea 1 ime, such as conaminae, eaic, unsecued and so foh. Secondly, a wod could be misspelled in he same way epeaedly due o he yping paen of uses. Las bu no leas, a single English dicionay as exenal souce fo ypo deecion and coecion is insufficien, especially fo unsucued domain-specific ex wih valid wods no in ypical lexicon. aick e al. developed an auomaic spell coecion sysem ha akes advanage of he enie conex suounding misspelling [42]. They also exploe he use of he pa-of-speech fo selecing candidaes. The sysem hey poposed has 97-98% of accuacy on ove 600 andomly seleced medical documens. Howeve, his appoach is based on he assumpion ha OS agging esul is eliable. Alhough he sysems discussed above all epoed high accuacy on ypo coecion, hey wee developed and evaluaed on ex documens ha wee well wien in ems of gamma, sucue, and senence and wod bounday. Fo he unsucued and causally wien ex documens, such as he engineeing diagnosic ecods [43] we ae dealing wih, hese mehods do no wok well. We use an example o illusae he diffeen poblem complexiy in hese wo ypes of documens. Fig. 1 illusaes wo documen examples: one is a well-sucued ex, and he ohe is unsucued ex. In boh documens, misspellings wee maked in ed. The pasing esuls geneaed by using Sanfod OS agge [38] on he wo documens in Fig. 1 ae shown in Fig. 2, in which incoec OS ags wee followed by ue OS ags maked by ed. Obviously, OS agging is no eliable when i is applied o he unsucued documen, i.e. he vehicle diagnosic ecods, consideing he ambiguiy of senence bounday and poo qualiy of gamma in he documen. 16

29 Fig. 2. Resuls of OS agging on well-sucued and unsucued ex documens The above discussion leads o ou wok in ypo coecion fo pocessing unsucued documens wih a focus on ypo coecion echniques fo hee ypes of non-wod ypos: wod bounday eos, self-invened abbeviaions, and ambiguous aconyms. We aemp o fill in he gap beween he ineacive and fully-auomaic spelling coecion echniques fo pocessing lage sie copoa of unsucued ex documens. The coeced ex documens can hen be used fo fuhe ex pocessing, such as ex documen caegoiaion and ex documen eieval. 17

30 2.3 Onology newoks Concening ha alhough machines can do a lo of hings unde human diecions, hey do no undesand human language. eople ae always ying o find a way ha makes machine pocess languages in a moe sophisicaed manne, besides he simple BOW epesenaion. The basic idea hee is ha, if evey documen is maked o eniched by some knowledge ha capues synacic o semanic infomaion beween wods, machines ae able o undesand ex bee. In he field of compue science and ex mining, onology newok is geneally defined as a fomal, explici specificaions of shaed concepualiaions of a domain of inees ha ae shaed by a goup of people [44]. Theefoe, i povides a soluion o faciliae ex undesanding and auomaic pocessing of exual esouces. I is explained in deail as follows by ing and Foo, in [45]: Concepualiaion efes o an absac model of phenomena in he wold by having idenified he elevan conceps of hose phenomena. Explici means ha he ype of conceps used, and he consains on hei use ae explicily defined. Fomal efes o he fac ha he onology should be machine eadable. Shaed eflecs ha onology should capue consensual knowledge acceped by he communiies. Moe specifically, he mos widely used ype of onologies in ex mining applicaions is called eminological onologies, which ae mainly specified by subype-supeype IS-A elaions and descibe conceps by using concep labels o synonyms [44]. Examples of well-known eminological onologies include WodNe [49], Semanic Wiki [50], ec. Fig. 3 shows an example of WodNe onology insance ha explicily povides elaionship beween wods such as synonym, hyponym, ec. Since ou focus in his disseaion is based on eminological 18

31 onologies, his will be fuhe discussed in deail in Chape 4. Fig. 3. Example of WodNe onology insance visualiaion In his eseach we define a concep as a se of synonyms ha have simila semanic meanings in a specific applicaion domain. In Fig. 3, conceps ae highlighed in blue. The undieced connecions connec wod o is semanic meanings, and he dieced connecion epesens IS-A elaionship beween wods, which is also called as Hyponym/Hypenym elaionship. An immediae quesion fo using onology in ex mining applicaions is ha how o consuc one ha can be effecively used fo daa mining. Onologies can be lean fom vaious esouces such as sucued, semi-sucued o unsucued ex copus in specific domain, elaional daabases, publicly available axonomies, ec. As a esul, onology leaning echniques ae divided ino wo goups: Consucing onologies fom scach using unsupevised leaning 19

32 mehods, and exending exisen onologies using supevised leaning and classificaion mehods [44,52,53]. The fis appoach usually equies a lo of manual o semi-manual wok o build wod onology fom scach, and he adapabiliy of such onology is usually esained by domain-specific esouces [50,54,55]. Theefoe, o a lage exen, cuen eseach wok mainly focuses on leaning onologies fom exising esouces such as English lexicons [56]. Fuhemoe, mos of he sae-of-a appoaches use only nouns fo onology building, and a lage exen of mehods aims a consucing IS-A-elaed concep hieachies. [44,57,58]. Cuenly, a numbe of onology leaning algoihms has aleady been well-developed and poweful ex onology sysems such as WodNe have geneaed and publicly available fo use of academic o indusy pupose. Howeve, how o appopiaely uilie such infomaion effecively o faciliae vaious ex daa mining applicaions is sill open fo exploing. Fo example, in he field of ex caegoiaion, a mosly used appoach is o use publicly available onology o geneae Concep level feaues as addiional infomaion o he ex epesenaion [46,47,48], while seveal issues sill bings in gea difficulies and challenges fo solving he above poblem, such as mapping onology elaionship o ex feaue epesenaions, weighing wod feaues based on onology elaionships, ec. [59,60]. As a esul, his disseaion eseach focuses on how o incopoae infomaion povided by onology ino ex mining asks such as ex caegoiaion, ahe han on how o lean onology fom exual esouces. We believe ha a hybid appoach ha combines convenional mehods wih appopiaely geneaed onology newok infomaion can enhance he qualiy of ex epesenaion, and hus impove he pefomances of ex mining asks using such epesenaion. This leads o seveal well-known epesenaion models developed by eseaches fo exual esouces in ecen decades, which will be discussed in he following secion

33 2.4 Tex epesenaion models Veco Space Model VSM As menioned above in secion 1.2, he convenional appoach of epesening a ex documen is he Veco Space Model VSM [8], whee a ex documen is modeled as elemens in a veco space. Each elemen is epesening an index em ha is mos useful idenifying he main heme of a documen. Baea-Yaes and Ribeio-Neo gave he definiion of veco space model as follows [61]: Fo he veco model, he pai k i, d j epesens he occuence fequencies ha em k i in he documen d j. A weigh w i,j associaed wih he pai k i, d j is posiive and non-binay. Fom he above definiion, i is clea ha VSM geneaion usually includes hee sages: em selecion, documen indexing and weighing scheme selecion. In he fis sage, ex ae okenied ino a seam of wods, and a bag of conen beaing ems, also known as indexed ems, ae exaced fom he documen ex. This em selecion sep, in he developmen of VSM, is consideed as he mos impoan sep ha lagely impacs he sysem's pefomance. Geneally speaking, i is done he wods in he following seveal ways: Sop wod emoving: Wods in a documen do no descibe he conen which ae called sop wods funcion wods ae emoved fom documen. The sop wods can be idenified wih some auomaic way. Fo example, ems which have vey high o vey low fequency can be consideed as funcion wods [62]. 21

34 Semming: I is he pocess fo educing infleced wods o hei sem. Fo example, he wods "acceleae", "acceleaion", and "acceleaing" ae semmed o he oo wod, "accel". As a esul, i is consideed as educing dimension of seleced wods and help idenifying simila wods. The mos widely used semming algoihm is oe Semming algoihm [63]. The idea is ha he suffixes in English mosly consis of a combinaion of smalle and simple suffixes. In his disseaion, he semming pocess is no ou focus, consideing ha alhough semming algoihm could educe feaue dimension, i also lose he infomaion of full ems, and addiional soage migh be equied o soe boh he semmed and unsemmed foms [64]. In he case of he classificaion poblem, eseach woks have been done in feaue selecion by using labeled aining documens. This pocess ensues ha he seleced ems ae highly elaed o he pesence of a paicula class, using a vaiey of measuing appoaches such as Gini Index, Infomaion Gain, Muual Infomaion, ec [65,66]. Again, his is no he focus in his disseaion, because ha hese measues ae mosly based on aining daa, and impoan feaues fo peviously unseen documens may possibly be emoved fom his sage. In a lo of domain-specific applicaions in which ex documens ae usually unsucued, causally wien, wih pleny of gamma and spelling misakes, as menioned in secion 1.2, misspelling coecion is also a vey essenial sep in emoving noisy ems ha could significanly help feaue selecion. Ou wok in his field will be pesened in deail in Chape 3. 22

35 The second sage is o consuc a em-documen T maix using indexed ems. Each eny in T maix indicaes how many imes ha one em occus in one documen. Moe specifically, a documen collecion conaining a oal numbe of N documens idenified by K ems is epesened as a K * N T maix. Wih he pupose of measuing how well each em descibes he documen conens, each eny of he T maix is epesened by a local weigh ha is ypically he occuence fequency of individual em in one documen [67]. Noe hee, each documen is epesened as a ow veco in he T maix. An example of T maix geneaed fom hee documens is illusaed in Fig.4. To simplify he poblem, all he wods wihin hese hee documens ae seleced as indexed ems. Fig. 4. Example of T maix geneaion Reseach wok has shown ha using only local weigh is insufficien o evaluae he impoance of indexed ems [68]. Fo insance, some ems, due o hei ae appeaance in a few documens, do a bee job o disciminae hese documens fom ohes. Some ems, on he conas, appea oo fequenly in he whole collecion o disinguish documens in diffeen caegoies. As a esul, afe he T maix is geneaed, each eny needs o be weighed using a global weigh, which is used o eflec he oveall impoance of he index em in he whole documen collecion. The idea is ha a em occuing aely should have a high global weigh and fequenly occuing ems should be weighed low. Seveal well-known global weighs ae inoduced in [69]: 23

36 Nomal: j 1 f 2 ij GfIdf: gf df i i ndocs Idf: log 2 1 df i pij log pij 1-Enopy o Noise: 1 whee log ndocs j p ij f ij gf i f ij occuence fequency of em i wihin documen j; df i documen fequency, which is he oal numbe of documens in he documen collecion ha conain em i. gf i global fequency a which em i occus in he enie documen collecion ndocs oal numbe of documens in he whole documen collecion. Geneally speaking, each eny w i,j of T maix is assigned wih wo-pa values, w, L GW, whee L ij is he local weigh and GW i is he em s global weigh. The mos i j ij i commonly used em weighing scheme is f-idf em fequency-invese documen fequency [67]. I assigns a high degee of impoance o ems occuing ae in a documen collecion. Fom he above discussion, when we ake a deep look a hose global weighing schemes, i is obvious ha hey ae all focused on he enie documen collecion. Based on ou obsevaion, impoan em wods o hei synonyms appea fequenly in documens in a specific caegoy, especially when he use defined caegoy is deemined by some specific keywods [4]. Theefoe, we designed a new global weighing scheme duing he geneaion of VSM models, which will be discussed in deail in Chape 4. 24

37 The advanage of using VSM model o epesen ex is ha, insead of fully undesanding he conen of a documen, he VSM simply geneaes a logical view of he documen. Fo hose applicaions wih sho, pooly-sucued ex documens o hose documens ha having gamma eos o spelling eos, VSM simplifies he pocedue of naual language pocessing and ease he sysem implemenaion. These simplificaions ae poved effecive by many eseach esuls [2,4,11,13,68]. This is he majo eason ha ou wok on ex caegoiaion is also using VSM as ou fundamenal ex epesenaion appoach. Howeve, VSM also has seveal limiaions [70], such as poo epesenaion of long documens, ignoing he semanic elaionship beween documens wih simila conex o synonyms, infomal weighing schemes, ec. As a esul, his disseaion pesens ou gea effo in impoving he convenional VSM model. Consideing he fac ha adiional VSM can be inaccuae unde he condiion ha a given wod is expessed in many ways synonymy o a wod has muliple meanings polysemy wih diffeen conex, appoaches ha dig ou undelying semanic elaionships beween wods ae equied in many ex mining applicaions. Laen semanic indexing LSI is one of he ealies echniques ha y o solve he above poblem Laen Semanic Indexing LSI LSI algoihm, also called laen semanic analysis LSA, is fis inoduced in 1988 fo naual language pocessing [71]. I is a vaian of he VSM ha allows he low-ank appoximaion o he oiginal T maix. In he LSI each documen is mapped ino a lowe dimensional space by decomposiion of he T maix. The assumpion made by LSI is ha, some laen semanic sucue exiss 25

38 accoding o he oveall paen of em occuence. The low-ank appoximaion aims a meging he dimensions associaed wih simila ems, as well as enhancing o eliminaing polysemy elaionships based on igh wod meaning used in he conex. The lowe dimensional space is build using singula value decomposiion SV which is associaed wih he laen semanic sucue [72]. The majo seps of LSI ae pesened as following: Given a mn maix A, i is decomposed ino he poducs of hee maices by using SV mehod. A T T T UV, whee U U V V I n, I n is a sie n ideniy maix, diag,,...,, 1 2 n 0 fo 1 i, and 0 fo 1 j n, is he ank of A, U and V conains lef and i j igh singula vecos of A, especively, and diagonal maix conains he singula values of A. Then we ge new maices U k,v k by keep only he k lages singula values of, a ank-k appoximaion maix o A is consuced wih he following fomula: A A U V. k k k T k The mahemaical epesenaion of SV is shown in Fig. 5. Tem Vecos k k k ocumen Vecos A = k U V T k m n m n Fig. 5. Mahemaical Repesenaion of Singula Value ecomposiion 26

39 The above SV mehod aemps o capue undelying semanic sucue in he associaion of ems and documens, as well as educing feaue dimensions by defining a much smalle value k han he numbe of indexed ems. I gealy educes he memoy equiemen and he compuing ime of measuing he similaiy beween an unseen documen and a known documen. Howeve, in his disseaion, we do no use his appoach o capue he semanic elaionship in ex, due o he following dawbacks of LSI: The coe echnique of LSI, SV, is a mahemaical mehod; heefoe he esuling singula values ae no inepeable. SV is vey sensiive o he change of daa. Someimes he low-ank appoximaion has negaive values which ae meaningless [72]. LSI equies elaively high compuaional pefomance and memoy in compaison o ohe infomaion eieval echniques. Wihou disibued implemenaion i is no applicable o lage-scaled documen copus [73]. I sill emains o be veified expeimenally whehe he LSI oupefoms he VSM in ex mining asks such as ex caegoiaion. The concep in he LSI educeddimension space is assumed o be a weighed aveage of muliple meanings, while losing single em infomaion and some eal meaning infomaion [68]. Unde such cases, he LSI migh exhibi a less saisfying classifying abiliy Saisical opic models We now move fom LSI o he discussion of saisical opic models, which is deived fom LSI bu inepe LSI fom saisical poin of view o povide a bee undesanding fo ex 27

40 daa. This is a majo focus of impoving convenional VSM model in his disseaion, since i povides a solid saisical foundaion fo finding hidden semanic sucue of ex documens obabilisic Laen Semanic Analysis LSA obabilisic laen semanic analysis LSA is a well-known saisical opic model fo ex cluseing and infomaion eieval [74]. I epesens a documen wih a mixue disibuion ove laen opics, which ae chaaceied by a disibuion ove he indexed ems. The laen opics povide a educed dimension epesenaion of documens in a given collecion. I is a saisical vaian of LSI developed based on a saisical geneaive model called Aspec Model [74]. The saing poin of LSA is he em-documen fequency TF maix, and i follows he bag-of-wods assumpion, in which each wod appeas independenly, and he occuing ode of each wod is no consideed. Fig. 6 shows he gaphical model epesenaion of LSA, based on Bayesian Newoks [109]. Fig. 6. Gaphical model epesenaion of LSA In he above gaphical model, he solid cicles and epesen a documen and a em ha ae obseved especively. The LSA model is a geneaive model ha assumes hee is a laen opic vaiable beween documens and ems. The wo ecangles, maked by K and N, epesen he numbe of sample wods and documens obseved, especively.,, 28

41 epesens he pobabiliies of obseving a documen, a laen opic occuing in, and wod belonging o, especively. The ypical appoach of LSA modeling, is o esimae he pobabiliy funcions and, fo all documen-opic and em-opic pai, and, hough machine leaning. Fo each pai of documen-opic,,, and em-opic,,, we aemp o find he values fo funcions and ha maximie he following log likelihood objecive funcion: L n, log, 1, whee n, denoes he em fequency of appeas in documen. The vaiables and ae wha we ae ineesed in and wan o esimae, since is no elaed o he paamee we wan o esimae and we assume ha i is consan among documens in T, we hen have: ag max L ag max n, log 2, We will use he well-known Expecaion Maximiaion EM algoihm [75] o solve his maximiaion likelihood esimaion poblem. Each ieaion of EM algoihm consiss of expecaion sep E-sep and maximiaion sep M-sep. In E-sep, based on he cuen esimaed and, he poseio pobabiliy of, is compued fo each documen-wod pai. In M-Sep, and ae updaed by maximiing equaion 2. This is an unsupevised machine leaning pocess, and he deailed seps of EM algoihm will be discussed in Chape 4. 29

42 Laen iichle Allocaion LA Simila wih LSA, Laen iichle Allocaion LA is a geneaive pobabilisic model fo ex collecion [78]. In LA, each documen is modeled as a finie mixue ove a se of laen opics. The basic idea is ha documens ae epesened as andom mixues ove laen opics, whee each opic is chaaceied by a disibuion ove wods [78]. LA assumes he following geneaive pocess fo each documen in a copus T: 1. Choose K ~ oisson fo each of he N documens in T. 2. Choose ~ i 3. Fo each of he K wods in, n a. Choose a opic ~ Mulinomial b. Choose a wod n fom p n n,, a mulinomial pobabiliy condiioned on n. Thee ae seveal assumpions o be made in his model. Fis of all, he laen opic dimensionaliy G of he iichle disibuion is known and fixed. Secondly, he wod pobabiliies ae deemined by a G * K maix whee. Finally, K is independen of all he ohe daa geneaing vaiables and [78]. The G-dimensional iichle andom vaiable has he following pobabiliy densiy funcion: G p, i i1 11 G G G i 1 i 30

43 whee is a sie G veco wih 0, and x i is he Gamma funcion. Given he paamees and, he join disibuion of a opic mixue, a se of K opics, and a se of K wods is given by: p,,, N p p n1 n p n, n Theefoe, he LA model is epesened as a pobabilisic gaphical model in Fig. 7. I is obvious ha LA is a hee-level hieachical Bayesian model. Figue 7. Gaphical model epesenaion of LSA Alhough LA is claimed o ovecome some shocomings of LSA, such as educing numbe of paamees being esimaed, and eaing he opic mixue weighs as a hidden andom vaiable ahe han a lage se of individual paamees linked o aining se [78], in his disseaion, we mainly focuses on exending basic LSA model ahe han applying LA, based on he following seveal easons: LSA appoximaion is based on maximum likelihood esimaion, while LA is based on Bayesian esimaion, by using boh pio knowledge and available daa [77]. When daa sie is lage, hei pefomance ends o be vey simila [76]. 31

44 Consideing he complexiy of vaiaional infeence of LA, LSA is much easie o be implemened and exended using semi-supevised manne. We would ahe focus on how o combine saisical opic model wih supevised infomaion fom aining se as well as exenal onology esouces, han find ou which model o esimaion mehod is bee. Wih paamees paially fixed afe aining, LSA could also be applied o pocess peviously unseen documen, hus makes ex caegoiaion possible. This will be fuhe discussed in Chape 4. The following secion will mainly discuss eseach woks have been done in he field of ex caegoiaion, which is one of he majo focuses in his disseaion. 32

45 2.5 Tex caegoiaion Tex caegoiaion, as inoduced in secion 1.1, is one of he mos popula asks in ex mining field nowadays, due o he inceased availabiliy of documens in digial fom and he ensuing need o oganie and diffeeniae hem fo fuhe analysis. Mahemaically, as descibed in [79], if T is a se of N documens, T,..., 1, 2 N, and C is a se of pedefined caegoies, C c1, c2,..., c M, he ask is o appoximae he classifie ha maps each d i T o a c j C, so ha he esimaed age mapping funcion ˆ :T C coincide he eal mapping funcion :T C as much as possible. ocumen caegoies defined by uses always vay by diffeen applicaion equiemens, such as he opic a news aicle discusses, he impoance of a vehicle diagnosic ecod ha descibes vehicle epai deails [80], he fac saed in a medical diagnosic documen ha whehe o no an injuy condiion susains [4], ec. One way o solve he ex caegoiaion poblem is o use human expes o manually classify documens. Of couse, his appoach is cosly and imeconsuming. In he eseach communiy oday, he dominan appoach o his poblem is based on machine leaning echniques, which is poved o be effecive in analying lage amoun of daa and has he saigh poabiliy o diffeen applicaion domains [81] Tex caegoiaion based on machine leaning algoihms 33

46 Machine leaning appoach fo ex caegoiaion has gained populaiy since ealy 90 s. In his appoach, a geneal inducive pocess is adoped o develop a classifie ha classifies peviously unseen documens by gleaning he chaaceisic of available aining documens. Thee ae aleady pleny of eseach woks ha focus on developing and impoving machine leaning echniques adoped fo building classificaion models, which ae geneally eviewed in [76,81]. A lis of majo echniques ha have been applied in ex caegoiaion lieaue is pesened as following: Classifies based on documen cluseing algoihms such as k-means cluseing [82], hieachical cluseing [83], self-oganiing maps [4,84], ec. Example based classifies such as k-neaes-neighbo classifies [92,93] ha classify unseen daa by finding he closes daa samples in he aining se using similaiy measues. obabilisic classifies such as Naïve Bayes classifies [85,86] ha measues he pobabiliy of a sample belongs o a ceain caegoy. ecision ee classifies, which is a hieachical decomposiion of he aining daa space. A each node of he ee, he aibue ha mos effecively splis daa samples ino especive subses is seleced, based on infomaion gain aio. The spliing is ecusively conduced unil he leaf nodes conain a ceain minimum numbe of samples, o some condiions on class puiy ae me [76]. Classifies ha ae deived fom egession elaed algoihms, such as Linea Leas Squaes Fi LLSF mehod [87], logisic egession classifie [88], neual newok classifie which is based on logisic egession and usually is consideed as a nonlinea combinaion of a numbe of logisic egession classifies [89,90,91]. 34

47 Linea classifies such as suppo veco machines SVM [94,95], which is a ype of classifies ha aemp o deemine good linea sepaaos among caegoies. I is poposed fis by Vladimi Vapnik in 1979, bu did no eceive much aenion unil lae 90 s. The basic idea is o find ou a sepaaion hypeplane fo daa samples so ha he nomal disance of any of he daa poins fom he hypeplane is he lages. Fig. 8 shows an example of 2-dimensional case fo SVM classifie leaning. The cosses and cicles epesen aining examples in diffeen caegoies, wheeas lines epesen decision sufaces, and he hicke line epesens he bes sepaaion hypeplane, since he disance fom i o he neaes daa poin is maximied. Small squaes indicae he suppo vecos, which ae aining samples lying on he maximum magin suface [81]. Figue 8. Gaphical model epesenaion of LSA In his disseaion, we consisenly use SVM as ou ex caegoiaion classifie houghou diffeen expeimens, wih he following majo easons: SVM povides much moe obus pefomance as compaed o many ohe machine leaning echniques such as ule based classifies and decision ees [97]. 35

48 SVM is quie obus o high dimensionaliy. I is ideally suied fo ex caegoiaion because of he spase high-dimensional naue of ex [96]. Tem selecion is ofen no needed in SVM classificaion, as SVMs end o be faily obus o ove-fiing and can scale up o consideable dimensionaliies [76]. Whaeve machine leaning echniques ae used o build ex classifies, he classificaion accuacy will be bolenecked if he epesenaion qualiy of he documen is poo, i.e., he epesenaion of a documen does no eflec close elaionship wih is assigned caegoy. Fuhemoe, based on [98], eseach woks have poved ha he sophisicaion of feaue selecion in ex caegoiaion is moe impoan han choosing he bes classifie. Theefoe, ou eseach focus in his disseaion on he ex caegoiaion ask is ahe impoving ex epesenaion and combining wih a consisen and pomising machine leaning appoach fo evaluaion, han choosing and impoving he classifie iself Tex caegoiaion based on saisical opic models Saisical opic models, including LSA, LA, ec., povide a solid pobabilisic foundaion fo documen modeling and epesenaion, in ems of digging ou laen semanic sucues fom ex. In he lieaue, hese models ae mainly used fo unsupevised ex cluseing, infomaion eieval and dimension educion, and eseach woks of uiliing hem fo ex caegoiaion is sill ae. Mos of he wok been done on applying opic model o ex caegoiaion is o puely use esimaed laen feaues fo classificaion, such as [99,100,101,102], which is claimed in [103] o be less accuae han using bag-of-wod BOW feaues, especially when aining daa sie is lage. The majo limiaion of he above wok is 36

49 ha, ex caegoiaion equies a sophisicaed feaue selecion appoach ha geneaes a welldeveloped ex epesenaion. Using semanic o synacic sucue leaned fom ex is no sufficien in fully epesening he conen of he documen, and diffeen applicaion equiemens fo ex caegoiaion make i essenial o develop a hybid classifie wih eniched feaues fom muliple esouces such as single wods, elaionship beween wods, semanic sucue, wod conex, ec Tex caegoiaion based on onology newok Consideing he necessiy of impoving BOW feaues fo moe accuae ex caegoiaion, as menioned in Secion 1.2, eseaches have been woking on uiliing exenal o backgound knowledge o help build ex classifie [104,105]. These exenal knowledge has a gea advanage in helping exac semanic elaionships, mach impoan phases, senghen cooccuences, ec. As discussed in secion 2.3, WodNe is one of he bes known souces of exenal knowledge used fo ex caegoiaion. I is a lage lexical daabase of English fis developed and mainained by inceon Univesiy [49]. The main elaion among wods in WodNe is synonymy, e.g., as beween he wods shu and close o ca and auomobile. Nouns, vebs, adjecives and advebs ae gouped ino ses of cogniive synonyms synses, each expessing a disinc concep, ogehe wih sho explanaions and geneal definiions. Fo he puposes of ex caegoiaion, i is successfully used o unify he vocabulay acoss he documens by modifying he documen feaues wih use of he elaed wods. The key ules defined in he mos well-known and widely used appoach in [46] ae lised as follows: Add ules: exend each ow veco W 37 in T maix defined in secion by adding new enies fo WodNe conceps c appeaing in he documen se. W is eplaced by

50 he concaenaion of W and C, whee C [ cf, c1, cf, c2... cf, cv ], whee cf, c denoes he fequency ha a concep c appeas in documen, and V denoes he oal numbe of conceps geneaed. Replace: Each ow veco W in T maix is eplaced by he concaenaion of W C, whee W denoes he new veco afe emoving all em enies fom W appea in WodNe. Tems ha do no appea in WodNe ae no discaded. and ha Concep only: emove all ems fom he veco epesenaion ha do no appea in WodNe. Only C is used o epesen documen. I is obvious ha he above ules have pleny oom of impovemen. Seveal open issues include: Which ule/combinaion of ules should be seleced o apply o he documen feaues? Wha value should be assigned o he concep feaues added o T maix afe global weighing scheme fo indexed ems is applied? How o deemine he scope of synse wih muliple meanings, fo a wod wih muliple wod caegoies in a documen? How o make full use of wod elaionships ohe han synonymy e.g., Hypenym/hyponym? These poblems discussed above ae impoan indicaions ha ex caegoiaion using onology newoks is woh fuhe invesigaion and pefecion. Theefoe, i is one of he majo focuses in his disseaion, and will be discussed in deail in Chape 4, wih poposed soluions. 38

51 CHATER 3. AUTOMATIC TYO CORRECTION USING MACHINE LEARNING AN EXTERNAL KNOWLEGE BASES In his chape, we pesen ou eseach in ypo coecion fo pocessing unsucued documens wih a focus on hee ypes of non-wod ypos: wod bounday eos, self-invened abbeviaions, and ambiguous aconyms. We pesen an innovaive auomaic ypo coecion sysem, ITC Inelligen Typo eecion and Coecion, ha uses hybid knowledge, i.e. geneal language lexicon and domain-specific knowledge exaced hough machine leaning. We developed algoihms fo misspelling deecion and coecion candidae geneaion, appoximae wod maching, opogaphically simila wod gouping, exacing conexual knowledge fom geneal and applicaion domains, and candidae anking based on machine leaning and saisical analysis. The coeced ex documens can hen be used fo fuhe ex pocessing, such as ex documen caegoiaion and ex documen eieval. 39

52 3.1 Machine leaning algoihms o exacing knowledge fo ypo coecion Auomaic ypo deecion and coecion is a challenging poblem, in paicula in unsucued ex documens ha conain aconyms, abbeviaions and symbols ha ae specific o applicaion domains. We popose o use a hybid of knowledge, geneal language specific ypo knowledge, and domain specific knowledge. Two geneal language knowledge bases, GTKB1 Geneal Typo Knowledge Bases, and GTKB2 ae impoan fo ypo deecion and coecions. GTKB1 is a valid wod lexicon ha can be auomaically geneaed fom online English dicionaies such as WinEd English_US and/o English_UK [20]. GTKB1 can be used fo ypo deecion and geneaing ypo coecion candidaes. The mos challenging pa in ypo pocessing is o coec deeced ypos, which can be eihe ypogaphical eos o non-wod ems ha can wee used as abbeviaions o symbols meaningful only o a specific applicaion domain. To coec ypogaphical eos, a geneal language knowledge base of commonly misspelled wods, GTKB2, can be auomaically geneaed fom inene souces including Wikipedia Common Typo Lis, Oxfod English Copus Misspelling Lis, and Suden's Book of College English Misspelling lis and [39-41]. The focus of ou eseach is o exploe he use of machine leaning echnologies o exac domain specific knowledge ha ae useful fo coecing challenging ypos, such as wod bounday eos, self-invened o domain specific aconyms and abbeviaions. Fig. 9 gives a summay of he knowledge useful fo ypo coecion. In his secion we inoduce he machine leaning algoihms developed o exac domain specific ypo 40

53 knowledge fom ex copus colleced wihin a specific domain, including ue wods fo he aconyms and abbeviaions, opogaphically simila wods/ypos goups, and ypos due o wong wod boundaies. Fig. 9. Exacing knowledge fo ypo deecion and coecion Exacing knowledge of domain-specific ems and aconyms We popose o build an enhanced domain-specific dicionay ha can used o idenify valid ems commonly used in an applicaion domain, and aconyms ha ae valid and epesen wods meaningful wihin an applicaion domain. Fo example, in auomoive diagnosic applicaions, NF is a common aconym known as No oblem Found, and CAN known as Conolle Aea Newok, which can be ambiguously inepeed. Fom a given aining daa se we fis geneae a lis of valid wods and hei fequencies occuing in a aining daa se, which is denoed as B 1. We developed a semi-auomaic pocess o exac a lis of aconyms fom a given copus of aining documens colleced fom a specific applicaion domain. I fis exacs all ypos wih sho lenghs fom he aining 41

54 documens. Fo each of hese ypos, we seach fo he valid phases in he aining se ha can fom he ypo wih is iniials. Fo example, fo he ypo NF, we seached in he aining documens and found he phase No oblem Found occuing 5 imes, Noise ossibly Fom 5 imes, and No u Fuel 3 imes. The exaced phases ae hen anked accoding o hei fequencies of occuences. The phase wih he highes fequency of occuences is assigned o he ypo as he coecion candidae. If hee is a ie, such as he example given above, a domain expe manually idenifies he bes coecion candidae. This knowledge base is denoed as B 2 : B { ac1, ex ac1, ac2, ex ac2,... ac, ex ac }. 1 2 H H Whee ac i denoes an aconym and exac i denoes he coec phase ac i epesens Building a lexicon of simila ypos and domain-specific abbeviaions One of he chaaceisics of unsucued ex in a specific applicaion domain, fo insance, vehicle diagnosic ex ecods, is ha hey conain many abbeviaions and casual wod paens ha canno be easily coeced by only using common English dicionaies. Fo example, wheel -> whl, check -> chk, diagnose -> diagn, ec. As a esul, i is necessay o build a domain-specific lexicon [22] o coec hese wod eos. The following is an algoihm we developed o build a domain-specific lexicon, which is epesened in goups of wods and ypos such ha he wods and ypos in each goup shae he same sem wod. Fo example, ypos cusome, cuome, cusme, cusomes, cusome and cusome all shaes he same sem wod cusom. The algoihm uses fou similaiy measues o deec fou basic ypes of yping/spelling eos, deleion Type 1, inseion Type 2, subsiuion Type 3 and ansposiion Type 4. Reseach 42

55 showed ha moe han 80% of ypos belong o one of hese fou eo ypes [38]. Suppose A and B ae wo wods, A... A1 A2 A is a ypo and n 1 2 m B B B... B is a valid wod, whee A i and B j, i 1,2,... n; j 1,2,... m, ae lees wihin he alphabe. If A is fomed by deleing one o moe lees fom B, such as: A = ENINE, and B = ENGINE, hen A belongs o Type 1 opogaphical spelling eo. If A is fomed by inseing one o moe lees ino B, such as: A = FOUON, and B = FOUN, hen A belongs o Type 2. If A is fomed by subsiuing one o moe lees wih wong lees, such as: A = STSTEM, and B = SYSTEM, hen A belongs o Type 3. If A is fomed by ansposing wo lees in B, such as: A = ERFROM, and B = ERFORM, hen A belongs o Type 4. Fig. 10 illusaes he fou ypes of misspelling wods. Fig. 10. Fou ypes of opogaphical spelling eos We designed he following algoihm o calculae he disance beween wo wods, A and B, based on he above fou ypes of spelling eo. Similaiy measue calculaed based on ype 1 spelling eos: 1 Le s 1 0. Saing fom he fis lee of A and B, x 1, y 1. 2 Compae A x and B y, A, hen s s 1, and x x 1, y y If x By 2.2 If x y 1 1 A B, hen go o he nex lee of B, y y If x n and y m, go o sep 2.1, ohewise exi. 43

56 Similaiy measue calculaed based on ype 2 spelling eos: 1 Le s 2 0. Saing fom he fis lee of A and B, x 1, y 1. 2 Compae A x and B y, A, hen s s 1, and x x 1, y y If x By 2.2 If x By 2 2 A, hen go o he nex lee of A, x x If x n and y m, go o sep 2.1, ohewise exi. Similaiy measue calculaed based on ype 3 spelling eos: 1 Le s 0. Saing fom he fis lee of A and B, x 1, y Compae A x and B y, A, hen s s 1, and x x 1, y y If x By 2.2 If x By 3 3 A, hen go o he nex lee of boh A and B, x x 1, y y If x n and y m, go o sep 2.1, ohewise exi. Similaiy measue calculaed based on ype 4 spelling eos: 1 Le s 4 0. Saing fom he fis lee of A and B, x 1, y 1. 2 Compae A x and B y, A, hen s s 1, and x x 1, y y If x By 2.2 If x y 4 4 A B, A x1 By 1 and A x B y 1, Ax 1 By, hen s s 1, and x x 1, y y Else if Ax By, hen go o he nex lees in boh A and B, x x 1, y y If x n and y m, go o sep 2.1, ohewise exi. The disance beween wo ems A and B is hen calculaed as: 44

57 s isa,b 1, 2 max A, B whee max s i 1,2,3,4 s i. Fo example, fo ems EEFROM and ERFORM, we have similaiy measues s 1, s 2, s 3 and s 4 equal o 2, 3, 4 and 5, especively. As a esul, s 5, which means he bes maching case of ypo EEFROM is ype 4. The em disance measue funcion, is, is used in he following algoihm fo gouping simila ems, so he ypogaphically simila ems, which can be valid wods o ypos, ae placed in he same em goup wih he same coecion candidae. Typo coecion algoihm based on gouping simila ems Le be a se of aining documens. 1 Exac a lis of disinc ems fom, denoed as T. 2 Geneae goups of em wods in T ha ae opogaphically simila. The following seps find he opogaphically simila ems in T and geneae goups of simila ems, g,..., 1, g2 g K. 2.1 Le j 1, k Ceae a new goup g k. Take he em j fom em lis T, and add i ino goup g k. 2.3 Incemen j. If j T, go o sep Take he em j fom em lis T. 2.5 If lengh j l 2.3., whee l is a heshold used fo fileing ou sho ems, go o sep 2.6 Calculae i, he aveage disance beween j and each wod q in each exising em 45

58 goup g i. i Q i q1 is Q i j1, q, q g, q 1,2,... Q ; i 1,2,.... i i 3 Whee η is he numbe of he cuenly exising goups, and Q i denoes he numbe of ems in goup g i. 2.7 Find he closes goup g C o j, i.e. C i fo all i, 1 i., whee is he maximum disance allowed beween he ems of he 2.8 If C same em goup, add j+1 ino g C, and go o sep 3. Ohewise, incemen k, add j o g k and go o sep Assigning a coec wod o each goup. 3.1 Le i 1, 3.2 If i k, oupu: B3 { g1, g _ w1, g2, g _ w2,... gk, g _ wk } 4 And hen exi. 3.3 If hee ae valid wods in g i, find a valid wod in g i ha has he highes fequency of occuences in aining daa, and denoe he wod as g_w i. Add g i, g_w i o B 3, and go o sep Find a valid wod in T ha has he closes disance go each em in g i, assign he wod o g_w i, and add g i, g_w i o B Incemen i and go o sep 3.2. Table 1 shows he wo wod/ypo goups geneaed by he above algoihm fom a se vehicle faul diagnosic ex documens. Fo insance, opogaphically simila ems such as 46

59 ENGI, ENGNIE ae gouped ogehe and labeled as he mos fequenly occued valid wod ENGINE. Goup sample Table 1 EXAMLE OF SIMILAR TERM GROUS Elemens Label 1 2 ACCELERATION, ACCELLERATE, ACCELLERATING, ACCELLERATION, ACCELLING, ACCELING, ACCELORATION, ACCELRATE, ACCELRATION ENGI, ENGIEN, ENGIN, ENGINE, ENGINES, ENGING, ENGINR, ENGNIE, ENGNINE, ENIGINE, ENIGNE, ENINE, ACCELERATE ENGINE Exacing conexual knowledge Conexual infomaion such as n-gam has been found useful in deecing eal-wod eos [24, 25]. An n-gam is defined as a sequence of n adjacen wods fom a given ex. A 1-gam is efeed o as a "unigam"; 2-gam is a "bigam"; 3-gam is a "igam". Google has been using n- gam wod models in a vaiey of pojecs, including saisical machine anslaion, misspelling coecion, eniy deecion, infomaion eieval, ec. We developed algoihms o exac wo ypes of conexual knowledge and use he knowledge o ank he coecion candidaes fo a given ypo. The fis ype is geneaed based on he Google book n-gam copus [23], which povides he fequencies of wods and phases appeaing in Ameican English books published fom 1500s o 2000s. The second ype of conexual knowledge is epesened by he 47

60 pobabiliies of n-gams occuing in he aining documens. Algoihm fo exacing conexual knowledge fom Google book n-gam 1 eec and exac all ypos fom he aining documen se hough dicionay seach, and soe hem in a lis, _ypos. 2 Fo each ypo in _ypos, 2.1 Fo n = 3 o G, Exac hee ypes of n-gams, pefix n-gams, suffix n-gams and ceneed n-gams fom aining daa : ngam, such ha R n 1, whee R n-1 epesens n-1 wods pe R n 1 immediaely pecede. ngam suf S n 1, such ha Sn 1, whee S n-1 epesens n-1 wods immediaely follow. cen m m ngam R, S, such ha R S, whee R m epesens m wods m immediaely pecede, S m epesens m wods immediaely follow, whee 2 m 1 n. Noe hee, if eihe R m o S m is empy, hen ngam cen is empy. Exac R n1 *,* S n1 and R m * Sm fom Google book n-gam copus, whee * epesens any valid wod. 2.2 The n-gams geneaed by sep 2.1 ae denoed as ng i, i 1,2,..., L. Fo each ng i, we exac is nomalied fequency of occuences in he Google book n-gam copus using he following fomula: Y f y ngi 1 j H ngi y 5 j oal _ coun j1 yj m 48

61 Whee j 1,2,... Y, Y epesens he numbe of yeas of he Google book n-gam ae used, y Y epesen he mos ecen yea and y 1 epesen he mos leas ecen yea in he las Y yeas. f y j i ng epesens he nomalied fequency of ng i appeas in yea y j in he Google book n-gam copus. Hee, ng f y j i is added by 1 o avoid 0 values. oal _ coun yj epesens he oal coun of wods appea in yea y j, and is a weigh coefficien ha can be used yea o give diffeen weighs o diffeen yeas, e.g., one can assign highe weighs o n-gam fequencies of moe ecenly yeas by using he exponenial funcion, yea y y y j y 1 Y e Y. The oupu fom he above sep is pesened as knowledge base, B 4 : B { ng1, H ng1, ng2, H ng2,... ng, H ng }. 6 4 L L We will use an example o illusae he conexual knowledge geneaing pocess. Suppose fo a ypo x = whele occued in a aining documen, and he aining daa se conained a 3-gam phase igh fon whele. We seached he Google book n-gam copus fom y 1 = 1980 o y Y = 2008, and found gam phases, igh fon *, whee * epesen a valid wod. Examples of such 3-gams include igh fon panel, igh fon wall, igh fon wheel, igh fon whee and igh fon window. Fig. 11 illusaes fequencies of occuences of hese n-gams in each yea in he Google book n-gam copus. I is obvious ha he 3-gam phase igh fon wheel occued fa moe fequen han ohe phases. 49

62 Fig. 11. Example of n-gam saisics The second ype of conexual knowledge is epesened by he pobabiliies of n-gams occuing in he aining documens. Thus a knowledge base B 5 is geneaed o conain domain-specific conexual infomaion. Specifically, we exac hee ypes of saisics fo each valid wod x in : obabiliy of pefix 2-gams 2 Rx : This is he pobabiliy of he occuences gam of all he 2-gams consising of x and he peceding oken R. Fo insance, fo he wod fon in a 2-gam lef fon, R is he wod lef. obabiliy of suffix 2-gams 2 gam xs : This is he pobabiliy of he appeaance of all he 2-gams consising of x and he subsequen oken S. Fo insance, fo he wod fon in he 2-gam fon wheel, S is he wod wheel. obabiliy of ceneed 3-gams 3 RxS gam : This is he pobabiliy of he appeaance of all he 3-gams consising of he peceding oken R, he wod x, and he 50

63 subsequen oken S. Fo insance, fo he wod fon in he 3-gam lef fon wheel, R is he wod lef, S is he wod wheel. Using Bayes fomula, we obain: f Rx f R Rx x R R 7 f R* f * 2 gam f xs f S xs x S S 8 f * S f * 2 gam 3 gam f RxS f * xs RxS R xs x S S f xs f * S f S f * 9 Whee f *, f R*, f * S denoes he fequency of all wods in, he fequency of any 2-gam in ha sas wih R, and he fequency of any 2-gam in ha ends wih S, especively. As a esul, B 5 is epesened as: B5 { x T x 2 gam Rx, 2 gam xs, 3 gam RxS } 10 B 5 is used o geneae saisical feaues fo he neual newok ained fo assessing ypo coecion candidaes, which will be discussed in secion Assessing ypo coecion candidaes In many cases, he closes coecion candidae geneaed by opogaphical similaiy maching may no be he coec one. The knowledge bases descibed above wee geneaed based on vaious aspecs of chaaceisics of ypos, each one of he candidaes geneaed based on hese knowledge bases has a possibiliy o be he igh coecion. We developed a Neual Newok 51

64 sysem, NN_TC_Conf, o measue he confidence abou a candidae wod fo a given ypo ove a boad ange of ypo-wod feaues. Fig. 12 illusaes he achiecue of he NN_TC_Conf sysem. The neual newok uses he following en feaues o chaaceie he weigh of a coecion candidae. Fig. 12. NN_TC_Conf: a neual newok fo measuing he confidence abou a coecion candidae of a ypo Typo lengh: We chose his feaue o ake ino consideaion ha he effec of ypo lengh may have on is coecion candidaes. Based on ou obsevaion, he longe a ypo is, less coecion candidaes i migh have, bu he possibiliy of a candidae being he coec wod inceases. Levenshein disance: The second feaue we consideed is he Levenshein disance [21] beween he coecion candidae and he ypo, which is defined as follows. Le 52

65 A... A1 A2 A n be a ypo and B B1 B2... Bm be a valid wod, whee A i and B j, i 1,2,... n; j 1,2,... m, ae lees wihin he alphabe, and A 0 and B 0 denoe nil. The Levenshein disance beween A and B, L _ is A, B L _ A n, B, whee L_A i, B j is a ecusive funcion defined as follows: fo each i, j, 0, i j 0 i, j 0 & i 0 L _ Ai, B j j, i 0 & j 0 U, ohewise U Min L _ A, B V, L _ A 0, Ai B j V 1, ohewise i1 j1 i1, B 1, L _ A, B j i j1 1 m Whee L_A i, B j denoes he numbe of single-chaace edis inseion, deleion, subsiuion equied o ansfe sequence A 1 A 2 A i o sequence B 1 B 2 B j. Fo example, fo he ypo wod A = engne, if he candidae B = engine, he disance beween A and B, L _ is A, B L _ A, B6 5 1, as shown in Table 2, because only one inseion edi of lee i is needed fom A o B. Similaly, if he candidae B = enginee, L _ is A, B L_ A, B As a esul, engine is moe likely o be he coec wod fo ypo engne. The smalle his disance is, he moe likely he candidae is he coec wod. Table 2 EXAMLE OF LEVENSHTEIN ISTANCE j=0 j=1 j=2 j=3 j=4 j=5 j=6 e n g i n e i= i=1 e i=2 n i=3 g i=4 n i=5 e

66 Topogaphic similaiy based disance: This is he disance beween a coecion candidae, A, and he ypo, B, calculaed using he disance funcion isa, B inoduced in 2 in secion 2.2. Eo posiion: This feaue, denoed as E, gives he posiion of he fis eo in he ypo wih espec o he coecion candidae. Based on [7], his feaue is impoan because sudies have shown ha eos wee moe fequen in ceain posiions. Fo a ypo A and is coecion candidae B, e E, whee e is he index of he fis lee ha saisfy Ae Be. A Fequency of a ypo: This is he fequency of he ypo occuing wihin he aining documen se, which can be exaced fom B 1. Fequency of a coecion candidae: This is he fequency of he coecion candidae occued wihin he aining documen se, which can be exaced fom B 1. This feaue is useful when a ypo is acually an abbeviaion o a em fequenly used in he applicaion domain. In his case, he ypo s occuence in he aining documens may be high, bu he suggesed coec candidae may no occu vey ofen. Keyboad disance: The value of his feaue, denoed as K_, is calculaed based on he disance beween keys on a keyboad using he QWERTY keyboad mapping [26], as shown in Fig. 13. Fig. 13. QWERTY keyboad disance maix 54

67 Fo a ypo A and a coecion candidae B, he K_ is calculaed as follows: 1 Calculae he fou similaiy measues, s 1, s 2, s 3 and s 4 beween A and B as discussed in Secion If s si, i 2,3, 4, hen A is a deleion Type 1 eo eec B 1 B2... B wihin candidae B such ha afe deleing hese lees fom B we can obain A. 2.2 Obain he coodinaes of B, B 1, B, B 1 based on Fig. 7. Fo example, if B "a", is coodinaes ae co 2, Calculae he K_ value beween A and B: B K _ co, co co co 11 B B, 1 B B1 3 If s si, i 1,3, 4, hen A is an inseion Type 2 eo eec A 1 A 2... A in A such ha afe deleing hese lees fom A we can obain B. 3.2 Obain he coodinaes of A, A 1, A, A 1 based on Fig Calculae he K_ value beween A and B: K _ co, co co co 12 A A, 1 A A 1 4 If s si, i 1,2, 4, hen A is a subsiuion Type 3 eo eec B 1 B2... B in B such ha B will equal o A afe subsiuing hese lees in B by A 1 A... A fom A Ge he coodinaes of all lees in A 1 A 2... A and B 1 B2... B based on Fig

68 4.3 Calculae he K_ value beween A and B: i1 K _ co, co 13 Ai Bi 5 If s si, i 1,2, 3, hen A is a ansposiion Type 4 eo eec he anspose A 1A 2 in A, 5.2 Ge he coodinaes of A 1A 2 based on Fig Calculae he K_ value beween A and B using he following fomula: K _ co A co 14, 1 A 2 The K_ feaue is useful when wo candidaes have he same disance o he misspelled wod, bu one is moe likely o be a good coecion if he eo chaace disance on he keyboad is smalle han he ohe candidae. Fo insance, fo he misspelled wod ans, and is moe likely o be a good coecion han an because d is adjacen o s on he QWERTY keyboad. obabiliy of pefix 2-gams: This is he pobabiliy of he appeaance of he 2-gam made up of he coecion candidae C and he peceding oken R of he ypo, which can be found in B 5. obabiliy of suffix 2-gams: This is he pobabiliy of he appeaance of he 2-gam made up of he coecion candidae C and he subsequen oken S of he ypo, which can be found in B 5. obabiliy of ceneed 3-gams: This is he pobabiliy of he appeaance of 3-gam made up of he peceding oken R of he ypo, he coecion candidae C, and he subsequen oken S of he ypo, which can be found in B 5. 56

69 The NN_TC_Conf is ained wih a se of misspelled wods and hei coecion candidaes. The misspelled wods wee deeced in he aining documens, and each candidae of each misspelled wod is labeled wih a confidence value. Specifically, he aining daa consis of uples in he fom of x, Cx, confcx, a ypo x, is coecion candidae Cx, and confcx, he confidence abou he candidae Cx being he coec wod fo ypo x. Fo each pai of x, Cx, we geneae he above 10 feaues as he inpu veco o he neual newok, and use confcx as he age confidence value of Cx being he coec wod o eplace ypo x. Afe aining, fo any pai of ypo x and a coecion candidae Cx, he neual newok NN_TC_Conf is used o geneae a eal value conf C x' [0,1 ] based on he en feaues exaced fom x, Cx, whee conf C x' epesens he confidence abou Cx being he coec wod o eplace ypo x. 57

70 3.2 Inelligen Typo eecion and Coecion ITC Fig. 14 gives an oveview of he poposed auomaic ypo coecion sysem, Inelligen Typo eecion and Coecion ITC. The ITC sysem conains fou majo compuaional componens, ypo deecion and coecion candidae geneaion, wod bounday eo coecion, abbeviaion pocessing, and coecion candidae confidence geneaion and candidae selecion. Fo a given ex documen, he ITC sysem fis deecs ypos and geneaes a candidae lis using he geneal and domain specific knowledge abou ypos descibed in Secion 2. I hen deecs and coecs wo ypes of spelling eos: wod bounday eos and uncommonly used abbeviaions. Coecion candidaes fo emaining non-wod eos ae hen geneaed and weighed, and he bes candidae is seleced as he oupu fom he ITC sysem. Fig. 14. Oveview of ITC Inelligen Typo eecion and Coecion sysem 58

71 3.2.1 Typo deecion and coecion candidae geneaion When a documen is send o ITC sysem fo ypo deecion and coecion, each em wihin he documen is checked fo is validiy using he geneal knowledge base, GTKB 1, he lexicon. If a wod is no in GTKB 1, hen i is consideed a ypo. The ITC sysem hen geneaes a lis of coecion candidaes based on GTKB 2, he geneal language ypo lis, B 2, he lis of aconyms commonly used in he applicaion domain, and B 3, he lis of simila wod goups and hei coecion candidaes. The algoihm is descibed as follows. Algoihm fo ypo deecion and coecion candidae geneaion 1 Fo each wod x on he em lis of he inpu documen, check whehe x GTKB1. If i is, hen exi, since x is no a ypo. 2 If x is a commonly misspelled wod found in GTKB 2, hen use is coecion candidae, Cx, povided by GTKB 2 as he coecion candidae fo x, add x, Cx o T&C_L 1, and exi. 3 If hee is a ac, ex ac B2 such ha x = ac, hen Cx = exac, add x, Cx o T&C_L 1, and exi. 4 If hee is a g, g _ w B3 such ha x g, add x, g_w o T&C_L 1, and exi. 5 Seach in GTKB 1 o find he N valid wods ha bes mach wih x using he Levenshein disance funcion, which ae denoed as C i x, i 1,2,... N 1 2 x, add x,{ C x, C x,... CN } o T&C_L 1, and exi. The Oupu fom he above algoihm is T&C_L 1, which conains all he ypos found in an inpu documen along wih up o N coecion candidaes fo each ypo. Examples of enies in he lis T&C_L 1 ae shown in Table 3. 59

72 Table 3 EXAMLE OF ENTRIES IN T&C_L 1 Misspelling abandonned KOEC engien ssem noisehad diagn Coecion candidae Lis abandoned Geneaed fom GTKB2, in Sep 2 above Key On Engine Canking Geneaed fom B2, in Sep 3 above Engine Geneaed fom B3, in Sep 4 above sysem, sysems, sae, saes, sem Geneaed fom GTKB1, in Sep 5 above noseband Geneaed fom GTKB1, in Sep 5 above, no a good candidae diag, dag, dain, dags, dagon Geneaed fom GTKB1, in Sep 5 above, no a good candidae lis Wod bounday eo deecion and coecion Wod bounday eos ae defined as wods eihe missing whie space chaaces beween muliple wods un-on eo, o a valid wod spli by a whie space spli eo [16]. I is vey impoan o pocess wod bounday eos sepaaely, since hey ae vey diffeen fom he ohe ypos. Fo insance, fo he spli eo o deed, he coecion candidae lis could only be geneaed on ypo deed, because o is a valid wod. The following descibes he wo algoihms we developed o solve wod bounday eos, one fo spli eo coecion, and anohe fo un-on eo coecion. Spli Eo Coecion algoihm 1 Fo each ypo x in T&C_L 1, find is adjacen wods in he inpu documen X, w 1 and w 2. 2 Check he dicionay fo wx w1 x, and xw x w2. 60

73 If wx is a valid wod, calculae he pobabiliy of appeaances of w 1, x and wx, denoed as pw 1, px and pwx, whee f w1 p w1, fw 1 denoes he fequency of wod w 1 f * appeaing in he knowledge base B 1, and f* denoes fequencies of all wods in B 1. px and pwx ae calculaed using he same way. 3 If w 1 is a valid wod and p wx p x, updae T&C_L 1 by seing wx as he only coecion candidae of wo ems w 1 x and exi. 4 If w 1 is also ypo and p wx p x, p wx p w 1, updae T&C_L 1 by seing wx as he only coecion candidae fo he wo ems w 1 x and exi. 5 Sep 6: Repea Sep 3 o pocess xw. We use one example o illusae he aionale behind Sep 4 and 5. Le w 1 = o, x = deed. We have wx = odeed. If p odeed > p deed, which implies ha only if he combined wod wx appeas moe ofen han he ypo x in he aining documens, hen we use wx as coecion fo w 1 x. If boh w 1 and x ae invalid, Sep 5 makes sue ha wx has o appea moe ofen han boh w 1 and x. Because if wx neve appeaed in, bu x appeaed many imes, hen x migh have is own coecion candidae insead of wx. Run-on Eo Coecion algoihm To deec and coec un-on ypos, he sysem checks each ypo, x, on he lis T&C_L 1 o see if i can be spli ino wo valid wods w 1 and w 2, while p w 0, p w 0. If so, le w 1 w 2 be he 1 2 only coecion candidae fo ypo x, and updae T&C_L 1 accodingly. If x can be sepaaed ino muliple pais of valid wods, fo insance, boh x w w and x w w, we use Bayes fomula o calculae hei occuence pobabiliy in aining daa : 61

74 w w w 1 w w w w w 2 w w f w1w 2 f w2, f * w f * 2 f w3w4 f * w 4 f w4 f * 15 Whee f*w 2 denoes he fequency of any wo wod sequence ha has w 2 as he second wod. The pai of wods wih highe appeaance pobabiliy is seleced as he candidae o eplace ypo x. The oupu of hese wo algoihms is an updaed ypo lis of T&C_L 1, denoed as T&C_L 2, in which he wod bounday ypos ae coeced. Table 4 shows he esuls afe applying he wo wod bounday algoihms o he T&C_L 1 shown in Table 3. Table 4 EXAMLE OF ENTRIES IN T&C_L 2 Misspelling abandonned Coecion candidae Lis abandoned KOEC Key On Engine Canking engien Engine ssem noisehad diagn sysem, sysems, sae, saes, sem noise had updaed afe wod bounday eo coecion diag, dag, dain, dags, dagon Abbeviaion pocessing Abbeviaions ae defined as he segmen of a valid wod saing wih he fis chaace of he valid wod [4]. Fo insance, diagn is he abbeviaion of diagnose. Fo hese abbeviaions, simple similaiy compaison mehods may no find he coec candidaes. Fo example, a 62

75 candidae lis fo diagn geneaed based on wod disances may conain diag, dag, dain, dags, dagon, bu no coec candidae. In many noe aking o ecods keeping applicaions such as vehicle diagnosic ecods, people ofen ype in he fis seveal chaaces of a wod as is abbeviaion. Fo example, conn epesens connecion, comm epesens communicaion, and ec. In ode o deec hose uncommon abbeviaions, we compae each ypo, x, in _ypos wih evey valid wod, w, in B 1. If x maches he beginning x lees in w, add w o x s candidae lis, whee x denoes he numbe of lees in x. The oupu of his sage is a fuhe updaed ypo lis T&C_L 3 whee he abbeviaions ae pocessed. Table 5 shows he esul of his pocess on he wods shown in Table 4. Table 5 EXAMLE OF ENTRIES IN T&C_L 3 Misspelling abandonned KOEC engien ssem noisehad diagn Coecion candidae Lis abandoned Key On Engine Canking Engine sysem, sysems, sae, saes, sem noise had diagnose, diag, dag, dain, dags, dagon updaed afe abbeviaion pocessing Coecion candidae weigh geneaion, anking and selecion In an auomaic ypo coecion sysem, when moe han one coecion candidaes ae geneaed, a cucial ask is o deemine which wod wihin he coecion candidae lis should be seleced o 63

76 eplace he misspelled wod. Ou soluion o his poblem is o use conexual knowledge o eank he candidaes. The following algoihm is developed o ank he candidaes of ypo x based on he n-gam saisics knowledge conained in B 4 and he neual newok NN_TC_Conf. Algoihm fo anking ypo candidaes based on conexual knowledge Fo n = 3 ~ G, whee G denoes he maximum lengh of n-gam we used in geneaing B 4 : 1 Fo a ypo x deeced in he esing documen X, exac he pefix n-gams, suffix n-gams and ceneed n-gams of x fom X, which ae denoed as ngam pe x R, n1 ngam, and ngam x R, S. This is simila o he pocess of suf x Sn 1 cen m m geneaing n-gams fo ypos in aining documens discussed in secion Fo each candidae wod Cx geneaed fo x, le C x 0, whee H n Cx epesens he nomalied fequency of Cx appeas in he Google book n-gam copus. ' 2.1 If R n C equals o any ng i in B 4, whee i 1,2,... L 1 x H n, H C x H ng. n i 2.2 If C equals o any ng i in B 4, whee i 1,2,... L, ' x S n 1 H n n i C x H C x H ng. R equals o any ng i in B 4, whee i 1,2,... L, ' ' 2.3 If m C x Sm H n n i C x H C x H ng. The oupu fom his pocess is a weighed fequency F x Cx fo each Cx geneaed fo ypo G x, whee Fx C x H n C x. Take he same example ypo whele as we discussed in n3 secion 2.3, wo candidaes whee and wheel geneaed fo whele ae found in phases of B 4. Suppose G 3, we have F whele whee=1.63e-09, and F whele wheel=2.68e-07, which indicaes ha he wheel is bee han whee as he coecion candidae fo ypo whele. 64

77 Le { C i x, i 1,2,... N} be N coecion candidaes fo each ypo x in T&C_L 3. Fo each C i x geneaed fo x, wihin in he ime peiod y 1 o y Y, calculae weighed fequency F x Cx in google n-gam copus. Fo each pai of x, C i x, exac he 10 feaues discussed in secion 2.4 and apply hem o NN_TC_Conf, which oupus a confidence value confc i x. Calculae he weigh VC i x fo each C i x, whee V Ci x conf C x* F C x. Re-ank hese N coecion candidaes C i x in descending ode based on VC i x. i x i Fo all C j x having he same value of VC j x, calculae isx, C i x as discussed above in 2 of secion 2.2, and e-ank hese coecion candidaes C j x fo ypo x based on his similaiy disance. Selec op N coecion candidaes fom e-anked candidae lis, as { x, i 1,2,... N' }. C i Afe coecion candidae weigh geneaion, anking and selecion, fo each misspelled wod x, he auo coecion is quie saighfowad: Check whehe he fis candidae C in he e-anked candidae lis has conf C 1 x If so, eplace he 1 x misspelled wod x in ex using C. Ohewise, x is flagged bu no coeced, and 1 x op N coecion candidaes ae soed fo manual coecion afewads. The flowcha of candidae anking and selecion algoihm fo each ypo x in he esing documen X is summaied in Fig

78 Fig. 15. Candidae anking and selecion 66

79 3.3 Empiical sudy We conduced an empiical sudy in he applicaion domain of auomoive faul diagnosics ex daa mining, whee he ex documens ae waany claims and epaiing descipions ecoded in vebaim. These ex ecods conain ich infomaion abou he vaious cases of vehicle malfuncions, oo causes, and epai pocesses, ye ae difficul o manually exac knowledge o fomulae eliable ules o associae an effecive epai pocedue wih a given poblem descipion. Moeove, he ecods have poo gamma sucue, and conain many ypos, self-invened abbeviaions and domain specific eminologies, which ae big challenges o daa mining sysems. We implemened he poposed ITC sysem and applied i o he ex documens in his applicaion domain. The following subsecions descibe he consuced knowledge bases and expeimen esuls Building geneal knowledge bases In his sudy, we consuced he valid lexicon, GTKB 1, fom he WinEd English_US and English_UK dicionay developed by aick aly [20]. WinEd is a poweful and vesaile ex edio fo Windows, wih a song pedisposiion owads he ceaion of LaTeX documens. GTKB 1 conains moe han 150,000 valid English wods, and is used o ecognie non-wod eos. A lis of common ypos colleced by Wikipedia Typo Team was used as ou geneal knowledge base GTKB 2. GTKB 2 conains 4238 misspellings fequenly appeaing in online documens houghou Wikipedia. 67

80 3.3.2 Building domain specific knowledge bases In his empiical sudy, we used a se of 200,000 cusomes' claim epos on vaious vehicle poblems povided by an auomoive company as he aining documen se o geneae domain specific knowledge bases B 1, B 2 and B 3. Iniially disinc index em wods wee exaced fom, and soed in B 1 along wih hei em fequencies. By dicionay seaching, we found 1763 disinc ypos wih lengh less han l_ac = 5, and, among hem, 144 aconyms wee finally idenified and exaced. Examples of nonivial aconyms in he B 2 knowledge base ae: CEL - Check Engine Ligh, CKT - Cicui, LC - aa Link Conneco, KOEC is used fequenly in he conex of TC iagnosic Touble Code exacion efeing o Key On Engine Canking, ec. We also applied wod gouping algoihm pesened in secion 2.2 o obain 3972 wod goups, which ae used as B 3. In he expeimens, we used Google Book copus: Ameican English, which conains wods ha occued in books published duing 1980 ~ 2008 o geneae he conexual knowledge base B 4, and, due o he space limiaion of soing he daase, only he 3-gam phases of each ypo wee exaced. The aining daa was used o geneae domain specific conexual knowledge base B Typo deecion and coecion Two ses of es documens, T 1 and T 2, povided by wo diffeen auomoive manufacues wee used o evaluae he ITC sysem. The fis esing se T 1 conains 603 feefom echnician vebaim poblem descipions. The second esing se T 2 conains 580,000 vehicle faul diagnosic ecods. We se he hesholds fo wod lengh and aveage wod gouping disance 68

81 discussed in secion 2.2 as follows: 2, whee only wods wih lengh lage han 2 will be l pocessed, and in appoximae sing similaiy maching, 0. 3, which is he aveage disance heshold fo wod compaing. ITC used GTKB 1 o deec ypos in T 1 and T 2. Thee wee 392 ypos deeced in T 1 and ypos deeced in T 2. Fom hose ypos, he wod bounday pocessing algoihms deeced and coeced 74 spli eos and 59 un-on eos in T 1, and 1292 spli eos and 5273 un-on eos in T 2. The false alam in T 1 and T 2 was 0% and 0.4%, especively. Noe hee, mos of he wong coecions in T 2 wee hose ypos being deeced incoecly as un-on eos. Fo examples, holline, which should have been deeced and coeced as holine, was deeced as a un-on eo, so i was spli ino hol line ; pefomace, which should have been coeced o pefomance, was deeced incoecly as a un-on eo, and so i was spli ino pefom ace. The abbeviaion pocess coeced 14 non-wod eos in T 1, and 242 non-wod eos in T 2. Examples of deeced wod bounday eos and abbeviaions ae shown in Table 6 and Table 7. These ems ae domain specific, so he domain specific knowledge played an impoan ole in coecing hese ypos. aiculaly, hese abbeviaions ae neihe commonly used in news aicles no in geneal web documens. 69

82 Spli eo Table 6 EXAMLE OF WOR BOUNARY ERRORS Coecion ho le epla ced shie ld hea vily exc essive Run-on eo sensoand conneciona oades diffeenialpossible divebel hole eplaced shield heavily excessive Coecion senso and connecion a oad es diffeenial possible dive bel 70

83 Table 7 EXAMLE OF UNCOMMON ABBREVIATIONS Abbeviaions conn diagn con comm veh diff eng cus sig Coecion connecion diagnose coninue communicaion vehicle diffeen engine cusome signal The knowledge bases GTKB 2 geneal misspellings, B 2 omain-specific aconyms and B 3 omain-specific wod goups wee used o coec 29 non-wod eos in T 1, and 1127 non-wod eos in T 2. Some examples ae lised in Table

84 Eo Table 8 EXAMLE OF TYOS RECOGNIZE BY GTKB 2 CORRECTION ecieved coninous inegaion ocasionally beween pocede fomed neccesay ah hsi eponse eceived coninuous inegaion occasionally beween poceed fomed necessay ha his esponse Table 9 EXAMLE OF TYOS RECOGNIZE BY B 2 Eo ECU EVA IC KOEC T VSS TCC CORRECTION Engine Conol Uni Evapoaive Emission Insumen anel Cluse Key On Engine Canking Toque Convee Cluch Vehicle Speed Senso Toque Convee Cluch 72

85 Eo Table 10 TYOS RECOGNIZE BY B 3 CORRECTION eacic ac whl pogamed manuves chekc fse exhuas syem yed eaic ack wheel pogammed maneuve check fuse exhaus sysem ied The aining daa fo he neual newok, NN_TC_Conf ae pocessed as follows. Fis we colleced a se of 2230 feefom echnician vebaim documens fom he aining daa se, which conained 1547 ypos. We used Levenshein disance algoihm o geneae up o five candidaes fo each misspelled wod by finding he closes valid wods in B 1 o he ypo. These ypos and hei especive candidaes ae manually labeled as high o low confidence fo being he coec candidaes. This pocess geneaed 783 likely candidaes and 6166 unlikely candidaes. As discussed in secion 3.4, fo each ypo deeced, we se G 3, whee G denoes he maximum lengh of n-gam we look ino, and geneaed 20 coecion candidaes fis, i.e. N 20. Afe he candidae weigh geneaion and anking, we seleced op 5 coecion candidaes, whee N 5. Noe G, N and N wee used in secion

86 Fo he pupose of compaison, we applied he wo sae-of-a spell checkes: Google Spell check and Aspell check, o he same wo esing documen ses, T 1 and T 2. We use he following hee Suggesion Inelligence Fis SIF measues [6,7] o evaluae he pefomances of he ypo coecion sysems: SIF Toal coec suggesions found fis Toal numbe of ypos on lis SIF3 Toalcoec suggesions found in op 3of Toal numbe of ypos lis SIF5 Toalcoec suggesion s found in op 5 of Toal numbe of ypos lis The pefomance esuls of he hee sysems ae shown in Table 11 and Table 12. The poposed ITC sysem made 3.83% false auo coecion on T 1, and 3.54% false auo coecion on T 2. Table 11 efomance Compaison wih Sae-of-a Spell Checkes on T1 SIF5 SIF3 SIF Google Spell 50.77% 48.72% 45.41% Aspell 27.81% 26.02% 25.51% ITC wihou candidae anking pocess ITC afe candidae anking pocess 54.85% 52.81% 50.77% 65.56% 64.28% 62.24% 74

87 Table 12 efomance compaison wih sae-of-a spell checkes on T2 SIF5 SIF3 SIF Google Spell 57.96% 55.05% 51.29% Aspell 44.25% 42.20% 39.51% ITC wihou candidae anking pocess ITC afe candidae anking pocess 63.87% 61.61% 60.47% 68.43% 67.34% 65.10% Table 13 shows a few examples of ypo coecions made by he ITC sysem, he Google and Aspell sysems. I is obvious ha ITC sysem oupefoms hese wo sysems by abbeviaion pocessing, wod bounday eos and coecion candidae e-anking based on neual leaning of he confidence of candidae coecness and n-gam saisical analysis. Typo nece whele speend sevi Table 13 Coecion candidae Lis compaison Coecion Candidaes Google neck Becca Mecca mecca enc Aspell ITC neck necessay Google whale wheel while whole Wheele Aspell whee wheel whale whelp while ITC wheel while whee whole whelp Google spend speed spen spawned spuned Aspell spend speed ITC speed spead spend spen seed Google sevo seve seve sef sevos Aspell seve ITC sevice seve sevo sweve The accuacy of all hee spell checke sysems evaluaed above may no be consideed vey high. This is due o he fac ha he esing se conains many non-wod ypos o domain- 75

88 specific abbeviaions ha canno be ecognied easily, such as bjb, bwn, ops, ssms, ec. These eos wee deeced and maked as ypos by he ITC sysem, bu lef wihou coecion. In ode o pove he assumpion ha ypo coecion could povide a bee qualiy and moe compehensible ex fo boh human and machines and help wih fuhe ex pocessing asks, we conduced ex caegoiaion on he same ex collecions, T 1 and T 2, as hose used in he above ypo coecion expeimens, in which each documen has a caegoy label. Fo boh of he daase, we use convenional VSM model discussed in secion as ex epesenaion appoach, wih he same local and global weighing scheme, f-idf appoach [67]. Fom each daase, we choose 2/3 documens fom each class as aining se, and he emaining 1/3 documens as esing se, and conduc 3-fold coss validaion o ge he aveage accuacy of he sysem. Moe deails abou ex caegoiaion could also be found in he following Chape 4. The accuacy of he ex caegoiaion is measued using he following evaluaion meic: Accuacy Toal numbe of esingdocumens coecly classified Toal numbe of esing documens The expeimen esuls ae shown in he following Fig. 16 and Fig. 17, fom which i is obvious ha ypo coecion impoves ex caegoiaion accuacy by 2% and 5%, and educe em feaue space by 7.7% and 14%, especively. We can obseve ha his impovemen is moe significan especially in lage daase whee ypos ae moe fequenly occus, e.g., ex collecion T 2, since i gealy educes he noisy ems and meges opogaphically simila ems. In ems of efficiency of ITC, all he knowledge bases geneaed above ae auomaed pogams ha exac and soe geneal language knowledge and domain specific knowledge fom digial esouces effecively. The only manual wok involved ae o deemine which 76

89 Tem feaue sie Accuacy % esouces o use and pepaing he daa, as well as labeling aining daa fo NN_TC_Conf. These seps usually ake only 3-4 hous befoe ypo deecion and coecion. 86 Tex caegoiaion accuacy w/o ypo coecion Befoe ypo coecion Afe ypo coecion 74 T1 aase T2 Fig. 16. Example of ex caegoiaion accuacy w/o ypo coecion Sie of indexed em feaues w/o ypo coecion 0 T1 aase T2 Befoe ypo coecion Afe ypo coecion Fig. 17. Example of ex caegoiaion feaue sie w/o ypo coecion 77

90 CHATER 4. TEXT CATEGORIZATION BASE ON MACHINE LEARNING, STATISTICAL MOELING AN ONTOLOGY NETWORK As menioned in secion 2.5.1, he adiional VSM model does have is sengh: i is efficien and povides a compac way of ex epesenaion, insead of fully undesanding he conen using naual language pocessing echniques, which is usually ime and space consuming. Howeve, since only single wod infomaion is consideed, ex caegoiaion accuacy may be affeced if single wods do no fully inepe he documen conen, meaning infomaion such as wod co-occuence, wod conex wihin documens and semanic ambiguiy of wods including synonymy and polysemy, ae missing. Theefoe, a ex epesenaion model ha has sophisicaed sucue and as inclusive as possible in ems of ex infomaion is of gea necessiy. In his chape, we pesen ou eseach wok in ex caegoiaion by inoducing VSM-based ex epesenaion models using machine leaning, saisical modeling and onology newoks. We use an innovaive hybid ex mining famewok, which conains a global weighing scheme, a VSM model buil fom WodNe onology newok, and a VSM model augmened wih saisical opic modeling. Fig. 18 illusaes he poposed famewok. Ou sysem akes in he aining documen collecion and geneaes a lis of indexed ems. Afe ha, each indexed ems ae weighed, and he documen copus is modeled by adiional VSM as a weighed T maix. LSA model is applied o geneae a laen opic level LT maix, and WodNe onology is feed ino he sysem o geneae a new em-documen maix, and a concep level C maix. These maixes ae hen combined ogehe fo final documen epesenaion, and used fo SVM classifie aining. Moe deails will be fuhe discussed in he following secions. 78

91 Figue 18. oposed ex caegoiaion model famewok 79

92 4.1 Tex caegoiaion based on VSM and LSA opic modeling This secion discusses ou wok in building a VSM model using an enopy-based global weighing scheme, using LSA opic modeling o exac laen opics and build a VSM model ha eflec wod elaionships, as well as using semi-supevised LSA opic modeling o build a VSM model ha eflec documen elaionships, based on pe-defined documen conneciviy infomaion A VSM Model wih a new global weighing scheme As discussed in Secion 2.4.1, ex documen is usually epesened by VSM fo he ease of compuaion and analysis. A veco space model should be buil based on caefully seleced ems and weighing schemes. Moe specifically, fo a given se of aining documens T,, 2,..., N T1 T2 TC T 1..., whee l is he l h aining documen, C is he numbe of documen caegoies, and T c is he se of documens ha belong o caegoy c, veco space model is buil hough he following machine leaning pocess: 80 c 1,2,..., C, ou Fis of all, we geneae an indexed em lis fom T, denoed as T_L, T L {,,..., }, whee i is he _ 1 2 K h i indexed em, hough a numbe of pepocessing asks, including wod okeniaion, symbol and puncuaion emoving, auomaic ypo coecion as discussed in Chape 3, sopping wod emoval and low-fequency em emoval, ec. While geneaing he lis of indexed em wods, we need o keep only conen beaing wods, implying ha he funcion wods having boh low and high fequency have o be emoved [68]. As a esul, we

93 emoved he high fequency sop wods a he fis sage, and hen se up a em fequency heshold as a file fo low fequency wods. Secondly, fo a documen W w, i l Whee [ w l f, w 1, 2, il l,..., w * GW, i 1,2,..., K. i l, K l l T, is veco epesenaion is defines as following: ], f il is he occuence fequency of em i wihin i, and K is he numbe of em feaues. l, GW i is a global weigh fo em VSM models based on appopiae em weighing schemes is paiculaly essenial fo infomaion eieval and ex caegoiaion [67]. An appopiae global weighing scheme should be applied o each indexed em wih he pupose of educing o enhancing he effec hey have on paicula documens. As menioned in secion 2.4.1, alhough hee ae pleny of global weigh appoaches available, mos of hem ae designed fo he enie daase, i.e., in idf global ndocs weighing, idf log 2 1, df i denoes he oal numbe of documens in he documen df i collecion ha conain em i, and ndocs epesens oal numbe of documens in he whole documen collecion T. Howeve, based on ou obsevaion, impoan em wods o hei synonyms ofen appea fequenly in documens wihin a specific caegoy, especially when he use defined caegoies ae highly elevan o some specific keywods [24]. As a esul, we developed he following caegoy-enopy global weighing scheme, denoed as CE_W: Fo each em i in he em lis T_L, calculae he popoion of he documens in T ha conain i wihin C diffeen caegoies. N _ cij pij, j 1,2,... C, N _ c j 81

94 whee N _ cij is he numbe of documens wihin he N _ c j is he oal numbe of documens in he Nomalie p ij, so ha pij ij C. p j1 ij h j caegoy. h j caegoies ha conains i, and Calculae he enopy wih espec o i : E i C j1 log. ij ij The enopy measue is a good indicao of how em i is disibued ove diffeen documen caegoies. The highe he enopy, he less impoan iem i is, since i is moe evenly disibued among diffeen documen caegoies. Calculae he global weigh CE_W i fo i : Ei CE _ Wi 1, whee C is he oal numbe log C of caegoies. This global weigh funcion gives moe weighs o ems ha have small enopy values. We will show in Chape 5 ha he caegoy-enopy based global weigh funcion pefoms much bee han he mos widely used invese documen fequency idf mehod. T T T A he end of VSM geneaion sep, he oupu is a T maix M 0, M W, W,..., W ]. Fo a peviously unseen documen u, we geneae is veco epesenaion manne, W w, w,..., w ], fo he esing pupose. u [ 1, u 2, u K, u 0 [ 1 2 W u N in he same A VSM augmened wih LSA opic modeling 82

95 In his secion, we mainly discuss he deails of how o use LSA opic modeling inoduced in secion o exac laen semanic opics fom ex, which ae used o geneae an augmened VSM model fo ex epesenaion and help impoving he accuacy of ex caegoiaion asks Leaning LSA model fom aining documens As aleady pesened in secion 2.4.3, LSA model is a well-known saisical language model mosly used fo unsupevised ex cluseing and infomaion eieval. The saing poin of LSA is he em-documen fequency TF maix befoe applying global weigh scheme, and i follows he bag-of-wods assumpion, in which each wod appeas independenly, and he occuing ode of each wod is no consideed. As shown in Fig. 6,,, epesens he pobabiliies of obseving a documen, a laen opic occuing in, and wod belonging o, especively. The geneaive pocess of each documen-wod pai in he ex copus, T, is shown as following: 1. Selec a documen fom T based on. 2. ick a opic accoding o. 3. Given, geneae a wod based on. The hidden vaiable se duing his pocess, denoed as,, ae ineesed in and wan o esimae, fo each wod-opic pai and opic-documen pai., is wha we Again, we know ha he join pobabiliy of each documen-wod pai could be deived as following, based on Bayes Theoem [40]:,, 83

96 Hee, is deived based on Bayes ule, wih he following wo assumpions [74]: Fis, obsevaion of documen-wod pais, ae assumed o be geneaed independenly, which is coesponding o he bag-of-wods appoach. Secondly, given laen opic, is also geneaed independenly of. Theefoe, we have:,,,, The likelihood funcion of he enie documen collecion, T, could also be deived as: L, n, n,, based on he obsevaion of all documen-wod pais, whee denoes he fequency of wod appeas in documen. Ou objecive is hus esimaing he hidden vaiables by maximiing his likelihood funcion. Because i is difficul o maximiing he above exponenial likelihood funcion, i is moe convenien o wok wih is logaihm, called he log-likelihood. The objecive of his esimaion is hus o maximie he log likelihood funcion of, as shown in fomula 1: L n, log, 1, Since is no elaed o he paamee we wan o esimae and we assume ha i is consan among documens in T, and also we assume ha fo each documen, is a consan value. As a esul, we hen have: ag max L ag max n, log 2, 84

97 As menioned in secion 2.4.3, 2 can be solved using Expecaion Maximiaion EM algoihm [75]. We now ake a deep look a he deivaion of why and how EM algoihm can be applied fo esimaing. eivaion of EM algoihm Fom 2, we can see ha i is sill difficul o find he soluion fo wih he log foma. As a esul, we inoduce a disibuion ove opic ino 2, denoed as A, whee A 0 and A 1. We hen have: 2 ag max n, log A, 3 A, Based on he law of unconscious saisician [106], E g g A. Theefoe, le g, we have A 3 ag max, n, log E 4 A Fom 4, i is sill difficul o find he soluion fo wih log E g foma. Howeve, because log E is a concave funcion, based on Jensen s inequaliy [110], we A could find a lowe bound funcion fo 4: 4 ag max ag max,, n, Elog A n, A log A 85

98 86 Noe hee, when A, whee is a consan, we have: A A n log, ag max 4, 5 Because 1 A,. Also, based on Bayes ule and he independency of,, we have:,,,,, Thus, ;, A, which shows ha o find he maximum soluion fo 4, a evey ieaion, A should be he poseio pobabiliy of, wih obseved documen-wod pai,. Theefoe, EM algoihm could be used as following: Compuaional seps of EM algoihm Each ieaion of EM algoihm consiss of expecaion sep E-sep and maximiaion sep Msep. In E-sep, based on he cuen esimaed and, he poseio pobabiliy of, is compued fo each documen-wod pai a each ieaion. In M-Sep, and ae updaed by maximiing 4, which will be used in he E-sep of nex ieaion unil convegence. eailed seps of EM algoihm ae discussed below:

99 87 Iniialiaion: efine he maximum numbe of ieaions R, and he numbe of laen opics G o be geneaed. Fo each documen-opic and opic-wod pai, assign andom values o 0 and 0 beween 0 and 1, wih he consains 1 0, and 1 0. E-sep: A ieaion, fo each obseved opic, wod and documen, p, b, and a, compue: a b a p p b b a p, whee 1 a p and 1 p b ae deived fom ieaion -1. M-sep: A ieaion, fo each documen-opic and opic-wod pai, compue a p and p b based on he following updaing fomulas: p b p b p b n n,,,,, 7 a a p a a p n n,,, 8 Moe deailed deivaion of 7 and 8 can be efeed o [106,107,110]. The above E-sep and M-sep epea unil he maximum ieaion R, o he log-likelihood funcion L in 1 me he cieion ha 1 L L, whee is se as he convegence goal of he model.

100 The oupu fom his sage afe EM leaning ha is used fo building VSM model is a laen opic-documen LT maix M d, in which each documen has a veco epesenaion is mapped fom indexed em space o laen opic space, H,,..., ], and l [ 1, l 2, l G, l l, i 1,2,... G, whee R denoes he maximum numbe of ieaions EM wen i, R i l hough, and G denoes he numbe of opics geneaed. As a esul, we have: H l ha T T T M H, H,..., H ]. d [ 1 2 N Geneae opic-documen veco fo peviously unseen documen Alhough LSA is oiginally designed fo unsupevised leaning, i can be exended o peviously unseen esing documens. Fo a esing documen u, we un hough he same EM algoihm o geneae he condiional pobabiliy of each laen opic given u. Howeve, duing paamee esimaion, all ohe paamees ae kep fixed excep. In iniialiaion sep, u only is assigned wih andom values beween 0 and 1, wih he consains 0 u 1. In E-sep, based on, he poseio pobabiliy of u, is 0 u 1 u compued. In M-Sep, only is calculaed by equaion 8. Theefoe, a veco epesenaion H u is geneaed fo u, wih he same dimension as H l A VSM augmened wih semi-supevised LSA opic modeling The LSA algoihm inoduced in he above secion geneaes laen opics by exploing he co-occuence elaionship of wods in he documen collecion unde a pobabilisic famewok, in ode o discove he undelying semanic sucue. Howeve, i is oiginally 88

101 designed fo unsupevised leaning, since i assume ha none pio knowledge of he ex is available. Theefoe, documens in he same caegoy migh have diffeen laen opic disibuion due o diffeen wod occuence. In he applicaion of ex caegoiaion, usually we will have some infomaion abou he ine-conneciviy beween documens, such as caegoy label, ciaion links and efeences, web page links and so foh [108,111]. As a esul, a mixed pobabiliy model ha couples he condiional pobabiliies fo boh wods and ine-conneciviy beween documens could be exemely useful, in ems of poviding moe meaningful feaues and bee undesanding fom ex. The mos well-known model ha incopoaes such infomaion is poposed by Hoffman [108], which pesens a join pobabilisic model of documen conen and conneciviy. Howeve, hee ae seveal issues needs o be solved. Fis of all, in he ask of ex caegoiaion, usually only aining documens have ine-conneciviy available, ha model is no able o model peviously unseen documens. Secondly, he ine-conneciviy vaiable, denoed as c, inceases he dimension of paamees need o be esimaed, so ha he efficiency of he sysem is deceased. Las bu no leas, he connecion beween documens should also be weighed, insead of simply using binay values 1 as conneced, and 0 as no conneced. In his disseaion, we popose a semi-supevised LSA algoihm ha addesses he above issues, while incopoaing he elaionship beween documens deived fom boh caegoy labels and onology newoks. eails will be discussed boh in his secion and secion Leaning semi-supevised LSA model fom aining documens Fo a given laen opic, he pobabiliy of documen conneciviy is inepeed as he documen s auhoiy on ha opic [108]. By inoducing a join pobabiliy model fo documen 89

102 conen and conneciviy, as well as a hype-weigh ha balance he affecion, he semisupevised LSA model is pesened in Figue 19. Similaly as LSA algoihm, he geneaive pocess of each obseved documen-wod pai and conneced documen-documen pai in he ex copus, T, is shown as following: Figue 19. Gaphical model epesenaion of semi-supevised LSA 1. Selec a documen fom T based on. 2. ick a opic accoding o. 3. Given, geneae a wod based on. 4. Given, geneae a documen, based on '. This epesens he pobabiliy of obseving ha is conneced wih, given laen opic. The vaiables, and ' ae wha we wan o esimae. As a esul, we come up wih he following join log-likelihood funcion: L L c,, ' n, log l, 'log L L 1 L, c ', 9 90

103 Whee l, ' indicaes whehe documen and ae conneced l, ' 1 o no l, ' 0. Fo now, o simplify ou poblem, l, ' is a binay value, and whehe is conneced o o no is based on whehe hey boh fall ino he same caegoy. l, ' will be fuhe updaed by he wod connecion beween documens in secion Based on 9, we wan o find, and ' ha maximie he loglikelihood funcion L. we hen have: ag max L ag max L 1, ' l, 'log 1 L ag max ' c, n, log 10 Similaly, we inoduce A and A c as wo pobabiliy disibuions of, so ha ag max L ag max 1 l, ' log n, log c, ' c A A, ' A A 11 Based on he law of unconscious saisician, he above equaion 11 yields o: 11 ag max 1, ', n, log E A ' l, ' log E A c 12 Using Jensen s Inequaliy, we find a lowe bound funcion fo he above equaion 12, and when C A and ' C, we have: A c 91

104 92 ' log ', 1 log, ag max ' log ', 1 log, ag max 12 ',, ',, c c c A A l A A n A E l A E n 13 Since we know ha A 1, c A 1, we have:, A, 14 ', ' ' A c. Hence, A and A c ae poseio pobabiliies of when we maximie 13, given he obsevaion of each documen-wod pai and conneced documen-documen pai, especively. Howeve, fom he above equaion 14 and 15, we could see he pobabiliy of and ' needs o be esimaed sepaaely fo each documen-opic pai, simila as he appoach in [108]. I will be much easie if we could make some ansfomaion so ha we could esimae he same paamee insead. Based on he Bayes Theoem, we have: c A A ' ' ',,. 15

105 This poblem is hen ansfeed ino esimaing, and, which significanly educes he sie of paamees. Suppose he sie of T is N and he numbe of opics geneaed is G, we now only need o esimae G addiional paamees fom insead of N G paamees fom. Simila o he LSA, fo semi-supevised LSA, he maximiaion likelihood esimaion in 9 can hus be solved using EM algoihm. uing EM algoihm leaning, in E-sep, based on he cuen esimaed, and, he poseio pobabiliy of, and, ' is compued fo each documen-wod pai and conneced documen-documen pai a each ieaion. In M-Sep,, and ae updaed by maximiing 13, which will be used in he E-sep of nex ieaion unil convegence. Consideing ha 1 1,, he final condiional pobabiliy fo each documen-opic pai can be calculaed using:, which means ha could be consideed as a nomaliaion consan fo. Compuaional seps of EM algoihm fo semi-supevised LSA Iniialiaion: efine he maximum numbe of ieaions R, and numbe of opics G o be geneaed. Fo each documen-opic and opic-wod pai, assign andom values o and 0 93

106 94 0 beween 0 and 1, wih he consains 1 0, and 1 0. Fo each opic, iniialie G 1 0, which evenly disibue he pobabiliy of each opic a he beginning. E-sep: A ieaion, fo each obseved opic, wod and documen, p, b, a, and m ha is conneced o a, compue: a b p p a p b b a p p A, , 16 m a p p m p a m a p p c A, , 17 whee 1 p a, 1 p m, 1 p b and 1 p ae deived fom ieaion -1. M-sep: A ieaion, fo each documen-opic and opic-wod pai, we wan o calculae p a, p b and p in ode o maximie 13. As a esul, a ieaion, ',, ',, log log ' log ', ', 1 log log log,, ag max ', ' 'log, ', 1, log,, ag max 13 l n l n 18 Le 18, since we know ha 1, 1 and 1, he above opimiaion poblem could be solved using Lagange muliplies [107], such ha:

107 Take deivaive fo 19 wih espec o p b, p a and p leads o he following saionay equaions:,, b p b p b n p 20 ' ', ', 1 2,, a p a a p a p a l n p 21 ', ', ', 1 p p l 22 Noe hee, a and ' ae acually epesening he same paamee, consideing hey ae conneced wih each ohe and exchangeable. Theefoe, in 21 hey ae meged ogehe, so ha 1 is muliplied by 2. By summing up 20 by, summing up 21 by, and summing up 22 by, we ae able o solve he Lagange muliplies, and, and finally we ge he updaing equaions fo p B, p a and p : p b p b p b n n,,,,, 23 ',, ' ', ', 1 2,, ', ', 1 2,, p a p a p a a p a p a l n l n 24

108 p, n, p,, 2 1 n, 2 1, ', ' l, ' l, ' p, ' 25 The above E-sep and M-sep keep ieaing unil he maximum ieaion R, o he log-likelihood funcion L in 9 me he cieion ha L L 1, whee is se as he convegence goal of he model. The final condiional pobabiliy fo each opic given l T can be l calculaed using: l l l. l Same as LSA algoihm, he oupu fom semi-supevised LSA afe EM leaning is a semisupevised opic-documen SST maix M ssd, in which each documen has a veco epesenaion H l ha is mapped fom indexed em space o laen opic space, H,,..., ], and l, i 1,2,... G, whee R denoes he maximum l [ 1, l 2, l G, l i, R i l numbe of ieaions EM wen hough, and G denoes he numbe of opics geneaed. As a esul, we have: T T T M H, H,..., H ]. ssd [ 1 2 N Geneae semi-supevised opic-documen veco fo peviously unseen documen In ode o exend semi-supevised LSA o peviously unseen esing documens, simila o LSA, fo a esing documen u, we un hough he EM algoihm o geneae he condiional pobabiliy fo u given laen opic. uing paamee esimaion, all ohe paamees ae kep 96

109 fixed, excep u. Howeve, fo a unseen documen u, u canno be iniialied diecly. The following seps explain how EM algoihm is applied o u : In he iniialiaion sep, since u u u u, we assign andom value o beween 0 and 1, wih he consain 1. Then 0 u 0 u u is calculaed using ha has aleady been esimaed duing aining, based 0 on he following fomula: 1 0 u 1 0 u 0 u. In E-sep, a ieaion, based on 1 u, he poseio pobabiliies,, and, ' ae compued, whee ' T, and ' is conneced o u In M-Sep, only is calculaed by equaion 24. u The final condiional pobabiliy u fo each opic given u can be calculaed u. u using: u u u u u. Theefoe, a veco epesenaion H u is geneaed fo u, wih he same dimension as I is obvious ha hee is a poblem abou how o find ou wha H l. ' in T is conneced o he esing documen, and how. This is done by using wod semanic infomaion exaced fom onology newok, which will be discussed in deail in secion As a esul, he sysem famewok of using semi-supevised LSA is shown in Fig. 20, whee he VSM augmened wih 97

110 LSA opic modeling also akes he C maix geneaed fom VSM augmened wih WodNe onology. Figue 20. oposed ex caegoiaion model famewok The above discussion in secion 4.1 mainly focuses on how o geneae laen semanic feaues fom ex documens, which ae incopoaed ino convenional VSM model. The conneciviy beween documens, ogehe wih he elaionship beween wods, allows us o geneae a mixed join pobabilisic model fo a ex collecion ha povides a solid foundaion fo an accuae and meaningful ex epesenaion. 98

111 4.2 A VSM augmened wih WodNe onology Wod onology newoks, as discussed in secion 2.3, povide semanic wod elaionships ha could be uilied o faciliae ex mining applicaions. This secion pesens he deail of building ex caegoiaion model using WodNe onology newok, in ems of geneaing an augmened T maix and a concep level C maix An augmened T maix geneaed using WodNe In ou poposed ex caegoiaion model, WodNe is used in wo ways, deived fom and modified based on basic appoaches inoduced in [46]: Add and Replace ules, consideing he Concep only ule los single em infomaion, and did no achieve as good pefomance as ohe wo ules in ex cluseing asks, as epoed in [46]. This answes he quesion in secion abou which ules should be seleced o apply o VSM feaues. Fo each indexed em i geneaed by VSM model, we fis use OS agging such as Sanfod OS agge [112] o idenify is lexical caegoy. Afe ha, i is feed ino WodNe onology o find is lis of synonyms, S j. Noe hee, wod sense disambiguaion WS can be applied o obain moe accuae synonym geneaion, howeve i is beyond ou eseach scope, and i is no ou inenion o find a mos appopiae WS model. As a esul, we find he synse S j based on he fis meaning of i, o simplify ou poblem. 99

112 In some applicaions, OS agging may no be vey eliable, e.g., ex documens ae noisy and lack of gamma sucue and senence bounday [5]. In hose cases, we geneae he synse fo i only consideing one wod class fom Noun, Veb o Adjecive, and choose he bes one based on sysem pefomance evaluaion. Though ou expeimen we found ou ha Noun synses always have he bes accuacy, which will be pesened in Chape 5. The above discussion answes he quesion aised in secion abou deemining he scope of synse fo a wod wih muliple wod caegoies. Alhough in WodNe, he majo elaionship is synonymy, we also find ou ha mos synses ae conneced o ohe synses via a numbe of semanic elaions. These elaions vay based on he ype of wod. Fo Noun synses, he elaions mainly include hypenym/hyponym wod A is a kind of wod B o vice vesa, e.g., dog vs. canine, and meonym/holonym wod A is a pa of wod B o vice vesa, e.g., window vs. building, which ae ou majo focus in his disseaion. Fo Veb o Adjecive synses, we only conside he synonymy elaion. This solves he issue of making full use of wod elaionships ohe han synonymy, as menioned in secion The ulimae goal of finding he elaed synses fo a given em, including synonymy, hypenym/hyponym and meonym/holonym, is o find ou all ems wihin hese synses, and use his em elaionship infomaion o augmen VSM model. The following Fig. 21 illusaes an example of geneaing he lis of elaed wods, L_syn, in WodNe fo an indexed em i, especially when i is a noun. Hee,, denoes he y h em included in he x h synse. Saing x y fom S j, denoed as oo level 0, we find ou is hypenym/hyponym and meonym/holonym synses, exac all unique ems included in hese synses, and add hem ino L_syn. The nex level sas fom synses S a, S b, S c and S d 100, and hei hypenym/hyponym and

113 meonym/holonym synses ae exaced especively. This pocedue coninues o exploe hough he enie gaph of S j unil hee ae no synses elaed o he cuen synse. All unique ems included in he lis of elaed synses geneaed fo S j ae hen added ino L_syn. Figue 21. Example of geneaing elaed wods in WodNe In ode o build a hieachical semanic elaionship beween synses, we assign a weigh fo each edge in he gaph geneaed fo S j. The basic idea hee is ha, ems found in diffeen level of synses, should be assigned wih diffeen semanic weigh, he deepe he synse level is, he lowe weigh we ae expecing. One example is shown in Fig. 22. Two coefficiens, 0,1 ] and 0,0.5], ae defined o epesen he weigh of he edge fo hypenym/hyponym and meonym/holonym elaionships, especively. Hee, consideing ha meonym/holonym elaionship is less significan han hypenym/hyponym in ems of semanic similaiy, e.g., documen alks abou window migh no have any elaionship wih documen alks abou 1 building, we use houghou ou expeimens

114 Afe assigning weigh o each edge, we hen assign a weigh fo each synse ha is elaed o S j, in ode o eflec he weigh decease along he pah fom S j o is elaed synses. Saing fom he oo synse S j, is weigh, denoed as S j, equals o 1. Afe ha, he weigh S x fo any synse S x ha is elaed o S j, is calculaed by muliplying he weighs of all he edges along he shoes pah fom S x o S j. If muliple pahs ae found, hen he maximum value is seleced fo S x. Fo example, in Fig. 22, f 2 S fo synse S f, e S fo synse S e. Figue 22. Example of weighing edges in he ee sucue geneaed fo synse Geneae concep-documen C maix add ule 102

115 Unde his ule, a C maix M c is geneaed using WodNe by inoducing he concep level feaues, which epesens he V elaed goups of ems geneaed fom he em lis T_L of documen collecion T, denoed as L _ syni, i 1,2,... V. Mahemaically, fo a documen l T, is concep veco epesenaion Q l is defined as following: Q [ q,, q2,,..., qv, ], qi, w S L syni S x i V l, _,, 1,2,...,,, 26 l x l 1 l l l whee V epesens he oal numbe of synses geneaed fom he em lis T_L, q i, l denoes he weigh of concep L _ syni, which is calculaed by fis muliplying weighed em fequency value w, e.g., f-idf fo each em l in L _ syni wih he weigh of synse S x ha includes, and hen summing hem ogehe. Fo example, in he following Fig. 23, assume he concep L _ syn1 geneaed fom em ca conains he following ems: ca, mooca, moobus, bus, minibus, window and quaeligh, heefoe, fo a documen l T, q 1, ca, mooca, bus, moobus, l w w window, l w l w l w quaeligh, l l w 2 w minibus., l l. 103

116 Figue 23. Example of concep geneaion in add ule This pocedue makes he full use of synonymy, hypenym/hyponym and meonym/holonym o assign an appopiae weigh fo each concep feaues used fo augmened VSM model. I answes he quesion menioned in secion abou wha value should be assigned o he concep feaues added o T maix. Thus, fo he oupu of his sage, we have: T T T M Q, Q,..., Q ]. c [ 1 2 N Similaly, fo a peviously unseen documen u, we geneae he concep veco epesenaion Q, Q q, q,..., q ], fo he esing pupose. u u [ 1, u 2, u V, u Geneae augmened T maix modified epl ule 104

117 Unde his ule, he em-documen maix M 0 in secion is eplaced by a new emdocumen maix geneaed using WodNe, in a way ha fo a em i having synse S i, is weigh in documen l is updaed using he following equaion: w max w, S, T _. ', i L i l l The above equaion ensues ha semanically simila ems shae he same weighing value, so ha hey ae consideed as equally impoan. Fo example, if em x = enie appeas in documen A and y = oal appeas in documen B, and suppose S x S y x, }, hen we will { y have w w w w. The oupu fom his sage is a T maix M 1 ha has he same ' ' ' ' x, A y, A x, B y, B T T T dimension as M 0, whee M1 W, W,..., W ], and [ 1 2 ' ' ' fo documen i in T, W [ w, w,..., w ]. l 1, l 2, l K, l N W i denoes he veco epesenaion Geneae documen-documen connecion fo semi-supevised LSA using WodNe We menioned in secion ha when applying semi-supevised LSA o peviously unseen documen u, hee is a poblem abou how o find ou wha documens in T ae conneced o he esing documen, and how. This leads o he poblem of deemining, as shown in he log-likelihood funcion in 9, which indicaes whehe 105 l u u and ae conneced o no. Fom he pespecive of join pobabiliy of all obseved documens pais ha ae conneced wih each ohe, we popose he following appoach of geneaing l u, fo he pai of u and each T.

118 Suppose we have V elaed conceps geneaed fom he em lis T_L of documen collecion T, denoed as L _ syni, i 1,2,... V : Exac weighed fequency based veco epesenaion fo u and, denoed as W u and W, as discussed in secion 4.1.1, whee W M 0. Geneae a sub-lis if T_L, denoed as T_L_sub ha has O ems, whee fo each em o T _ L _ sub, o 1,2,... O, o does no belong o any of he V conceps. Geneae concep veco epesenaion fo u and, denoed as Q u and Q, especively, based on equaion 26, whee Q M. c The connecion value l u, beween u and is hus calculaed as following: V j1 O l, min Q j, Q j min W k, W k, whee Q j denoes he j h u u k 1 u concep in veco Q, and W k denoes he k h weighed em fequency in T_L_sub. The basic idea hee is ha, l u, is he weighed fequency ha we obseve boh u and have concep occuence o non-concep em occuence, which epesens he connecion value beween u and. Afe geneaing l u, fo he pai of u and, i is obvious ha l, ' geneaed fo aining documen collecion T should also be updaed. Insead of using binay values l, ' 1 o l, ' 0, fo all pai of and, whee boh, T, if l, ' 1, l, ' is updaed using he same appoach discussed above. Wih he pocedues discussed above, we ae able o apply semi-supevised LSA model fo boh aining se T and unseen 106

119 documen veco epesenaion u, by geneaing he semi-supevised opic-documen SST maix H u, especively. M ssd, and a 107

120 4.3 A sep-by-sep example of poposed ex epesenaion model geneaion pocedue To povide a moe compehensive illusaion of how he above secion 4.1 and 4.2 woks, we heeby designed a oy daase ha is deived fom [113] and walk hough he whole pocedue of VSM model geneaion using LSA, semi-supevised LSA and WodNe onology. This daase is named as Human Compue Ineface and Gaph Theoy HCI_GT. The HCI_GT conains 9 documens as aining daa sepaaed ino wo caegoies, and one unseen documen u fo esing, defined as following, whee bolded wods denoe indexed ems: Caegoy A: Human Compue Ineface HCI Caegoy B: Gaph TheoyGT A1: Human machine Ineface fo ABC compue applicaions A2: A suvey of use opinion of compue sysem esponse ime A3: The ES use ineface managemen sysem A4: Sysem and human sysem engineeing esing of ES A5: Relaion of use peceived esponse ime o eo managemen B1: Sysem of andom, binay, odeed ee B2: The inesecion gaph of pahs in ee B3: Gaph minos IV: Widhs of ee and well-quasi-odeing B4: Gaph minos: A sudy u : A suvey of decision ee sysem 108

121 4.3.1 Build VSM model wih CE_W global weighing scheme Hence, in his aining se T, N 9, K 13. As discussed in secion 4.1.1, he veco epesenaion geneaed fo each documen in T and fo esing documen u, based on em fequency f wih idf global weighing and f wih CE_W global weighing, ae shown in he following Table Table 14 Tf-idf epesenaion fo HCI_GF human ineface compue use sysem esponse ime ES suvey ees gaph minos sudy A A A A A B B B B u Table 15 Tf-CE_W epesenaion fo HCI_GF human ineface compue use sysem esponse ime ES suvey ees gaph minos sudy A A A A A B B B B u

122 Compaing wih idf global weighing, when applying CE_W, he wod sysem is assigned wih a lowe weigh because ha i occus in boh caegoy A and B and heefoe no as impoan as ohes in ems of diffeeniaing documens in A and B caegoies. All ohe ems ae assigned wih global weigh equals o Build VSM model wih LSA opic modeling Following he EM algoihm fo LSA discussed in secion , we fis define he numbe of maximum ieaion R = 500, and = 1E-5. We found ou ha in ou empiical case sudies, geneally speaking, he numbe of laen opics should be defined as aound K, whee K is he numbe of indexed ems. We will fuhe discuss he effec of selecing diffeen numbe of opics in secion 5.3. Hee, because ha he HCI_GT has 9 indexed ems, i is easonable o define G = 3. The hee laen opics ae denoed as 1, 2 and 3, especively. The seps of building LT maix fo aining documens in T ae descibed as following: A he iniialiaion sep, fo each documen-opic and opic-wod pai, assign andom 3 values o 0 and 0 beween 0 and 1, wih he consains 0 1, and 0 1, 1, 2, 3. As a esul, we have he following wo maixes, M _ and M _, as shown in he following Fig. 24: 1 110

123 Figue 24. Example of EM algoihm Iniialiaion A E-sep expecaion sep, a ieaion,,,,,, ae calculaed fo each em-documen pai using equaion 6, which use he and iniialied befoe. Fig. 25 shows his E-sep pocess in ieaion

124 Figue 25. Example of EM algoihm E-Sep A M-sep maximiaion sep, a ieaion, fo each documen-opic and opic-wod pai, compue and based on he updaing fomulas in 7 and 8. Appaenly, his sep uses he calculaed condiional poseio pobabiliy, o updae he pobabiliy and ino and, Fig shows his M-sep pocess in ieaion 1, in M _ and M _. 112

125 Figue 26. Example of EM algoihm M-Sep The above E-sep and M-sep epea unil he maximum ieaion R, o he log-likelihood funcion L in 1 me he cieion ha L, a ieaion. Hee, we L 1 denoe he final esimaion fo each opic-wod pai afe he above aining pocess as T. The final esuls of M _ and M _ afe convegence, wih sie 9 3 and 13 3 especively, is shown in Fig

126 Figue 27. Example of EM algoihm esul afe convegence In M _, laen opic 1 and 2 can be consideed as sub-caegoies in caegoy A, and mos of caegoy B documens only have occuence of laen opic 3, excep documen B 1, which conains boh em sysem and ee, heefoe boh 2 and 3 have 0.5 pobabiliy of occuence. In M _, i is obvious ha laen opic 3 conains all ems ha only occus in caegoy B, and em compue, use and sysem all have pobabiliy on laen opic 1 and 2. Fo esing documen u, we un hough he EM algoihm o geneae he condiional pobabiliy of each laen opic fo u using he following pocedue wih he same paamee seing fo R, G and : 114

127 A he iniialiaion sep, fo each opic fo he given u, assign andom values o 3 0 u beween 0 and 1, wih he consains 0 u 1. A E-sep, a ieaion,,,,,, ae calculaed using 1 u 2 u 3 u equaion 6 and T fom he aining pocess.. A M-sep, a ieaion, compue based on he updaing fomula in 8. u 1 The above E-sep and M-sep epea unil he maximum ieaion R, o he log-likelihood funcion L in 1 me he cieion ha L, a ieaion. Hee, we L 1 denoe he final esimaion fo each opic- u pai as. F u Theefoe, fo esing documen u, he final opic-documen epesenaion is an 1 3 veco V, whee,, u V u, as shown in Fig. 28. F 1 u F 2 u F 3 u Figue 28. EM algoihm esul fo esing documen u Build VSM model wih WodNe onology 115

128 In his example daase, HCI_GT, wo ems could be found in WodNe as synonym: suvey and sudy. These wo ems occus in documen A 2, B 4 and esing documen u. No ohe hypenym/hyponym and meonym/holonym elaionship is found in HCI_GT. As discussed in secion , based on add ule, he C maix documens in T and he concep veco epesenaion M c geneaed fo aining Q u geneaed fo esing documen u, using CE_W global weighing, heefoe only have one concep ha conains suvey and q1, A 1 q1, A2 M, c A1 A2 B4... q1, B 4 T T T sudy, as shown in Table 16, whee Q, Q,..., Q q 1, w l suvey, w l sudy, l, and Q q u 1, w u suvey, w u sudy, u. Table 16 C maix geneaion fo T and u ocumen Concep suvey, sudy A 1 0 A 2 1 A 3 0 A 4 0 A 5 0 B 1 0 B 2 0 B 3 0 B 4 1 u 1 As discussed in secion , based on eplace ule, he veco epesenaion geneaed fo HCI_GT based on CE_W global weighing is updaed using he following equaion: 116

129 ' ' w suvey, ws udy, max wsuvey,, wsudy,. The augmened T veco fo each documen in HCI_GT is shown in he following Table 17, whee boh suvey and sudy has weigh equals o 1 in documen A 2, B 4 and u : Table 17 Tf-CE_W epesenaion fo T and u afe using WodNe eplace ule human ineface compue use sysem esponse ime ES suvey ees gaph minos sudy A A A A A B B B B u Build VSM model wih semi-supevised LSA Following he EM algoihm fo semi-supevised LSA discussed in secion , we fis define he numbe of maximum ieaion R = 500, numbe of laen opics G = 3, and = 1E-5. Fo he simpliciy of poblem, we define = 0.9. The hee laen opics ae denoed as 1, 2 and 3, especively. We also need o geneae he connecion maix M conn beween each documen- ' documen pai, in HCI_GT, whee each eny denoes connecion value l, beween and, as discussed in secion The esul is shown in Table 18: 117

130 Table 18 Connecion maix fo HCI_GF A 1 A 2 A 3 A 4 A 5 B 1 B 2 B 3 B 4 A 1 A 2 1 A A A B B B B u The seps of building SST maix fo aining documens in T ae descibed as following: A he iniialiaion sep, fo each documen-opic and opic-wod pai, assign andom values o and beween 0 and 1, wih he consains 1, and 1. Fo each opic, iniialie 0 1 0, 1, 2, 3. As a esul, we have 3 he following wo maixes, M _ and M _, as shown in he following Fig A E-sep, a ieaion,,,,,, ae calculaed fo each emdocumen pai using equaion 16. Also,, ',, ',, ' ae calculaed fo each documen-documen pai in T using equaion 17, whee, ' T, l, ' 0. The pocess of E-sep in ieaion 1 is shown in Fig. 30. A M-sep, a ieaion, fo each documen-opic and opic-wod pai, compue and based on he updaing fomulas in 23 and 24, and compue based on he updaing fomula in 25. Appaenly, his sep uses he calculaed condiional 118

131 poseio pobabiliy, and, ' o updae he pobabiliy, 1 and ino, and. The pocess of M-sep in 1 1 ieaion 1 is shown in Fig. 31. Figue 29. Example of EM algoihm Iniialiaion fo semi-supevised LSA The above E-sep and M-sep epea unil he maximum ieaion R, o he log-likelihood funcion L in 1 me he cieion ha L, a ieaion. Hee, we L 1 denoe he final esimaion fo each opic-documen pai, each opic-wod pai and each laen opic afe he above aining pocess as T, T and T,,. 1 2, 3 We hen geneae maix M _, whee each eny epesens he final condiional pobabiliy, and T T T T T T. T T The final esuls of M _ and M _ afe convegence, wih sie 9 3 and 13 3 especively, is shown in Fig

132 Figue 30. Example of EM algoihm E-Sep fo semi-supevised LSA 120

133 Figue 31. Example of EM algoihm M-Sep fo semi-supevised LSA 121

134 Figue 32. Example of EM algoihm esul fo semi-supevised LSA afe convegence In M _, compaing wih Fig. 27 fo LSA, we obseved an incease on he pobabiliy of laen opic 1 given documen A 1 and A 3. This is because ha A 1 is conneced o A 2 by em compue, and A 3 is conneced o A 2 by em use. These wo ems all have pobabiliy of occuence on 1, which conibues o he incease of 1 A and 1 A. Also, he pobabiliy of laen opic 3 given documen B 1 inceased, because ha B 1 only connecs o B 2 and B 3 by em ee, which only occus in laen opic 3. Fo esing documen u, we un hough he EM algoihm o geneae he condiional pobabiliy of each laen opic fo u using he following pocedue wih he same paamee seing fo R, G and :

135 A he iniialiaion sep, fo each opic given u, assign andom values o 0 u 3 beween 0 and 1, wih he consains 0 u 1. Then 0 u is calculaed 1 using he following fomula: u 0 u T 0 u, 1, 2, 3 T. u 0 u 0 u 1, T A E-sep, a ieaion,,,, and 3, ae calculaed fo 1 u 2 u u each em given using equaion 16, which use T and T geneaed fom u ' ' ' he aining pocess. Also,,,, and, ae 1 u u 2 u u 3 u u ' ' calculaed fo each pai of, using equaion 17, whee,, ' T l 0, u u ' which use T and T geneaed fom he aining pocess. u A M-sep, a ieaion, compue based on he updaing fomulas in 24. u u u u The above E-sep and M-sep epea unil he maximum ieaion R, o he log-likelihood funcion L in 1 me he cieion ha L, a ieaion. Hee, we L 1 denoe he final esimaion fo u given laen opic as F u. The final condiional pobabiliy fo each opic given u can be calculaed F u using: F u T u F u T F. u F u T Theefoe, fo esing documen u, he final opic-documen epesenaion is an 1 3 veco V, whee,, u V u, as shown in Fig. 33. F 1 u F 2 u F 3 u 123

136 Figue 33. Semi-supevised LSA EM algoihm esul fo esing documen u Compaing wih Fig. 28 fo LSA, he pobabiliy of laen opic 3 inceases, while he pobabiliy of laen opic 2 deceases, given esing documen u. This is due o he fac ha, when we ake a look a T geneaed fo aining documens in Fig. 34 as well as he connecion maix in Table 18, u is conneced o B 2, B 3 and B 4 ha have a high pobabiliy on 3, and is conneced o A 2 ha has a high pobabiliy on 1, wih high connecion values. Also, alhough u is conneced o A 3 and A 4 ha have high pobabiliy on 2, he em sysem has a low weigh in T maix, which conibues o he low value of l u, A and l u, A. These 3 4 above facos all conibues o he final esimaion of, whee is much lowe han and. F 1 u F 3 u F u F 2 u 124

137 Figue 34. T geneaed fo semi-supevised LSA We can see ha insead of only consideing wod co-occuence infomaion in LSA, he semisupevised LSA appoach also incopoaes documen conneciviy infomaion exaced fom boh semanic infomaion povided by WodNe and documen caegoy labels, hus povides moe easonable opic-documen feaues han LSA. An oveview of LSA and semi-supevised LSA compaison is pesened in he following Fig

138 Figue 35. Compaison beween LSA and semi-supevised LSA 126

139 4.3.5 Sysem evaluaion based on documen disance measue In ode o evaluae each pocedue duing VSM geneaion, we calculae he Euclidean disance of he veco epesenaion beween esing documen u and each of he 9 aining documens in T, as shown in he following Table 19 and 20. Table 19 Euclidean disance beween u and aining documens based on diffeen ex epesenaion - I ocumen Conen Euclidean disance o u f f - idf f - CE_W A 1 Human machine Ineface fo ABC compue applicaions A 2 A suvey of use opinion of compue sysem esponse ime A 3 The ES use ineface managemen sysem A 4 Sysem and human sysem engineeing esing of ES A 5 Relaion of use peceived esponse ime o eo managemen B 1 Sysem of andom, binay, odeed ee B 2 The inesecion gaph of pahs in ee B 3 Gaph minos IV: Widhs of ee and well-quasi-odeing B 4 Gaph minos: A sudy u A suvey of decision ee sysem Fom Table 19, he effec of global weighing scheme is obviously significan. Wih only f epesenaion, i canno diffeeniae A 2, A 3 and A 4, as well as A 1, A 5 and B 4. Wih f-idf epesenaion, A 2, A 3 and A 4 ae diffeeniaed, wih A 2 idenified as he closes documen o u, because ha wo ems, suvey and sysem occus in boh A 2 and u, which is no a vey good assignmen. Wih f-ce_w epesenaion, he closes documen o u is changed o B 1, because of 127

140 he low weigh assigned o sysem, which is moe easonable. Howeve, A 2 and A 3, as well as A 1, A 5 and B 4 sill canno be diffeeniaed. Table 20 Euclidean disance beween u and aining documens based on diffeen ex epesenaion - II ocumen Conen f-idf Euclidean disance o u f-idf +LSA f-ce_w +LSA +WodNe f-ce_w +semisupevised LSA +WodNe A 1 Human machine Ineface fo ABC compue applicaions A 2 A suvey of use opinion of compue sysem esponse ime A 3 The ES use ineface managemen sysem A 4 Sysem and human sysem engineeing esing of ES A 5 Relaion of use peceived esponse ime o eo managemen B 1 Sysem of andom, binay, odeed ee B 2 The inesecion gaph of pahs in ee B 3 Gaph minos IV: Widhs of ee and well-quasi-odeing B 4 Gaph minos: A sudy u A suvey of decision ee sysem Fom Table 20, we can see he significan effec of LSA and WodNe onology. Wih f-idf epesenaion, plus he hee laen opic feaues geneaed fom LSA, simila o he effec of applying CE_W, he closes documen o u is changed o B 1, because of hei simila opic pobabiliy disibuion boh have pobabiliy on 2 and 3. Also A 2 and A 3, as well as A 1, A 5 and B 4 ae diffeeniaed fom each ohe, based on he pobabiliy of occuing ems in hese documens on he hee laen opics. 128

141 By adding C maix and modifying T maix using WodNe, appaenly, A 2 and B 4 ae anked close o u, due o he semanic similaiy beween suvey and sudy. Afe his sage, we have a song confidence ha u has a highe pobabiliy of belonging o caegoy B. Fom he final column of Table 20, we can see ha he semi-supevised LSA has he simila effec as LSA, showing ha his appoach is as obus as LSA. Moeove, A 1 and A 3 ae anked close o u han A 5, because ha conneciviy among A caegoy documens incease hei pobabiliy on laen opic 1. isance fom u o A 4 inceases because of he pobabiliy decease on, F 2 u and disances fom u o B 2, B 3 and B 4 deceases because of he pobabiliy incease on. F 3 u Theefoe, we may conclude ha semi-supevised LSA povides a moe easonable ex epesenaion by geneaing semi-supevised opic-documen feaues based on semanic elaionship beween wods and conneciviy beween documens. 129

142 4.4 Geneae hybid VSM model fo classificaion Consideing he ask of ex caegoiaion, he pocedues discussed in secion 4.1 and 4.2 ae combined ogehe o geneae a hybid VSM ex epesenaion. Fo aining se T, he pocess of VSM maix geneaion and combinaion in ou final sysem, using WodNe onology and semi-supevised LSA, is shown in he following Fig. 36. Saing fom T fequency maix, ou sysem geneaes T maix M 0 wih global weighing scheme. T maix M 0 is hen used o geneae concep-documen maix M c using WodNe onology, and M 0 is also eplaced by M 1 based on wod elaionships exaced fom WodNe onology. The opic-documen maix o M ssd using LSA / semi-supevised LSA modeling is hen geneaed based on M 0 and Noe hee, he feaue sie of he final maix is he sum of oiginal indexed ems, numbe of conceps geneaed fom T and numbe of laen opics we defined. These feaues could be fuhe seleced using feaue selecion echniques such as Gini Index, Infomaion Gain, Muual Infomaion, ec [65,66]. Howeve, o make he ex epesenaion feaues as inclusive as possible, we sill keep all of hem. The hybid ex epesenaion model is hen used o lean and evaluae classifies, e.g., SVM, Neual Newoks, Naïve Bayes classifie, ec. M Ld M c. 130

143 Figue 36. VSM maix geneaion and combinaion 131

144 CHATER 5. EMIRICAL CASE STUY AN EXERIMENTAL RESULTS EVALUATION ON TEXT CATEGORIZATION In his secion, we pesen expeimens we conduced using ou poposed ex caegoiaion famewok in Fig. 18 and Fig. 20, and classificaion esuls analysis on seveal publicly available o domain-specific daases. This is an exension of expeimens on HCI_GT discussed in secion 4.3, in ems of invesigaing how he poposed ex epesenaion could help impoving ex caegoiaion accuacy. The expeimens ae designed as compaisons of he poposed ex epesenaion wih convenional ex epesenaion mehods. We pesen he expeimen esuls of uning paamees fo LSA, semi-supevised LSA and WodNe onology, and povide a deailed pefomance analysis of ou poposed ex epesenaion appoach. 132

145 5.1 aases In his empiical sudy, we fis use hee publicly available and widely used daases o evaluae ou poposed sysem. These daases include Reues [115], Nis Topic eecion and Tacking copus TT2 [114], and 20 newsgoups [116]. Reues copus conains documens in 135 caegoies. Afe emoving documens wih muliple caegoy labels, i lef 8,293 documens in 65 caegoies. In TT2, hose documens appeaing in wo o moe caegoies wee emoved, and only he lages 30 caegoies wee kep, hus leaving 9,394 documens in oal. 20 newsgoups daase is a collecion of 18,846 newsgoup documens, paiioned nealy evenly acoss 20 diffeen newsgoups. To evaluae he sysem pefomance on domain-specific daases ha has cusomied caegoy definiion such as [24,80], we also used a daase named VR, ha conains 600 vehicle diagnosic ecods, in which documens ha conain descipions ha eveal sysemaic engineeing o manufacuing failues ae defined as of ineess Caegoy-A, and all ohe documens belong o Caegoy-B. The majo challenge in his poblem is ha he documens of ineess ae no explicily defined by eihe opics o geneal descipions, as shown in he following examples: Caegoy-A documen: pefom abs self oades found ea wheel speeds senso conneco cooded ino senso eplace senso and conneco oad ese ok clea code 133

146 Caegoy-B documen: oad oades acion conol lamp on eec oades code c1280 u415 om cm conac ho line check connecion a cm check mouning bols ok clea code In all of hese daases discussed above, pepocessing asks menioned in secion ae conduced and sop wods ae emoved. Noe hee, all wods having occuence fequency lowe han 5 ae emoved, excep VR daase. T maix weighed by CE_W is hen geneaed fo each daase. T maix based on f only is also geneaed fo LSA model leaning, and T maix weighed by idf is geneaed fo evaluaion pupose. 134

147 5.2 Expeimen seup Consideing ha he focus of his wok is no impoving o compaing machine leaning algoihms, we use SVM as ou classificaion model houghou diffeen expeimens. SVM aining is caied ou wih LIBSVM package, which is developed by Chih-Chung Chang and Chih-Jen Lin fom Naional Taiwan Univesiy [117]. Fo each daase, we did 3-fold coss validaion, and in each fold, we choose 2/3 documens fom each class as aining se, and he emaining 1/3 documens as esing se. We apply he Gaussian Radial Basis kenel funcion RBF and une he paamee gamma o 0.001, 0.001, 0.1 and 0.1, fo Reues, TT2, 20newsgoups and VR, especively, based on he aveage esing accuacy of he 3 folds. All expeimens ae pefomed on a deskop wih InelR Coe i7 pocesso opeaing a 3.40GH and 16 GB of memoy, wih 64-bi Windows 7 sysem, JK7.0 + Nebeans 7.3.1, and Malab 2009a. 135

148 5.3 Build VSM model wih LSA Simila o he example discussed in secion 4.3.2, in LSA leaning, fo all daases, we define he numbe of maximum ieaion R = 500 fo aining se, and R = 200 fo esing se. The convegence goal is defined as = 1E-5. An example of log-likelihood funcion maximiaion on Reues daase is shown in Figue 37. I is obvious ha he log-likelihood funcion conveges vey fas and become vey sable afe 300 ieaions. Figue 37. Example of log-likelihood maximiaion fo LSA We pefomed expeimens on ex caegoiaion by invesigaing wha opic numbe is he mos appopiae fo diffeen daases, using opic-documen feaues geneaed by LSA as ex epesenaion. The esuls ae illusaed as following in Table 21 and Fig. 38. Fom he esuls, we obseve ha fo Reues, TT2, 20news and VR daase, 60 opics, 140 opics, 140 opics and 30 opics yields he bes caegoiaion accuacy, especively. As a esul, geneally we 136

149 should define lage numbe of opics duing LSA modeling on daase wih lage numbe of ems, bu he elaionship beween em sie and numbe of opics is no simply monoonic and linea. In ode o obain pomising caegoiaion pefomance, selecing an appopiae numbe of opics ha could bes diffeeniae documens in diffeen classes is vey impoan. In pacice, we may conclude ha i is easonable o se he numbe of opics aound K, whee K is he numbe of indexed ems. Fo he convenience of evaluaion and analysis, in he lae expeimens, we keep on using 60 opics fo Reues, 140 opics fo TT2 and 20news daase, and 30 opics fo VR daase. Table 21 Tex caegoiaion pefomance based on diffeen numbe of opics geneaed by LSA LT maix geneaed by LSA # of Topic = 20 # of Topic = 30 # of Topic = 40 # of Topic = 60 # of Topic = 80 # of Topic = 100 # of Topic = 120 # of Topic = 140 # of Topic = 160 # of Topic = 180 Reues 8558 ems TT ems 20news ems VR 1062 ems 81.27% 80.91% 81.42% 84.60% 83.09% 83.33% 82.87% 81.43% 81.55% 80.12% 79.34% 84.39% 85.01% 87.01% 88.62% 89.28% 88.38% 90.24% 88.23% 89.97% 63.28% 63.67% 67.12% 74.24% 78.51% 78.07% 77.33% 79.46% 76.90% 72.13% 76.79% 82.32% 81.21% 80.11% 77.34% 74.03% 72.87% 70.25% 71.22% 69.55% 137

150 Figue 38. Tex caegoiaion pefomance based on diffeen numbe of opics geneaed by LSA 138

CS 188: Artificial Intelligence Fall Probabilistic Models

CS 188: Artificial Intelligence Fall Probabilistic Models CS 188: Aificial Inelligence Fall 2007 Lecue 15: Bayes Nes 10/18/2007 Dan Klein UC Bekeley Pobabilisic Models A pobabilisic model is a join disibuion ove a se of vaiables Given a join disibuion, we can

More information

Representing Knowledge. CS 188: Artificial Intelligence Fall Properties of BNs. Independence? Reachability (the Bayes Ball) Example

Representing Knowledge. CS 188: Artificial Intelligence Fall Properties of BNs. Independence? Reachability (the Bayes Ball) Example C 188: Aificial Inelligence Fall 2007 epesening Knowledge ecue 17: ayes Nes III 10/25/2007 an Klein UC ekeley Popeies of Ns Independence? ayes nes: pecify complex join disibuions using simple local condiional

More information

Probabilistic Models. CS 188: Artificial Intelligence Fall Independence. Example: Independence. Example: Independence? Conditional Independence

Probabilistic Models. CS 188: Artificial Intelligence Fall Independence. Example: Independence. Example: Independence? Conditional Independence C 188: Aificial Inelligence Fall 2007 obabilisic Models A pobabilisic model is a join disibuion ove a se of vaiables Lecue 15: Bayes Nes 10/18/2007 Given a join disibuion, we can eason abou unobseved vaiables

More information

STUDY OF THE STRESS-STRENGTH RELIABILITY AMONG THE PARAMETERS OF GENERALIZED INVERSE WEIBULL DISTRIBUTION

STUDY OF THE STRESS-STRENGTH RELIABILITY AMONG THE PARAMETERS OF GENERALIZED INVERSE WEIBULL DISTRIBUTION Inenaional Jounal of Science, Technology & Managemen Volume No 04, Special Issue No. 0, Mach 205 ISSN (online): 2394-537 STUDY OF THE STRESS-STRENGTH RELIABILITY AMONG THE PARAMETERS OF GENERALIZED INVERSE

More information

Computer Propagation Analysis Tools

Computer Propagation Analysis Tools Compue Popagaion Analysis Tools. Compue Popagaion Analysis Tools Inoducion By now you ae pobably geing he idea ha pedicing eceived signal sengh is a eally impoan as in he design of a wieless communicaion

More information

Low-complexity Algorithms for MIMO Multiplexing Systems

Low-complexity Algorithms for MIMO Multiplexing Systems Low-complexiy Algoihms fo MIMO Muliplexing Sysems Ouline Inoducion QRD-M M algoihm Algoihm I: : o educe he numbe of suviving pahs. Algoihm II: : o educe he numbe of candidaes fo each ansmied signal. :

More information

Reinforcement learning

Reinforcement learning Lecue 3 Reinfocemen leaning Milos Hauskech milos@cs.pi.edu 539 Senno Squae Reinfocemen leaning We wan o lean he conol policy: : X A We see examples of x (bu oupus a ae no given) Insead of a we ge a feedback

More information

Lecture-V Stochastic Processes and the Basic Term-Structure Equation 1 Stochastic Processes Any variable whose value changes over time in an uncertain

Lecture-V Stochastic Processes and the Basic Term-Structure Equation 1 Stochastic Processes Any variable whose value changes over time in an uncertain Lecue-V Sochasic Pocesses and he Basic Tem-Sucue Equaion 1 Sochasic Pocesses Any vaiable whose value changes ove ime in an unceain way is called a Sochasic Pocess. Sochasic Pocesses can be classied as

More information

Lecture 18: Kinetics of Phase Growth in a Two-component System: general kinetics analysis based on the dilute-solution approximation

Lecture 18: Kinetics of Phase Growth in a Two-component System: general kinetics analysis based on the dilute-solution approximation Lecue 8: Kineics of Phase Gowh in a Two-componen Sysem: geneal kineics analysis based on he dilue-soluion appoximaion Today s opics: In he las Lecues, we leaned hee diffeen ways o descibe he diffusion

More information

On The Estimation of Two Missing Values in Randomized Complete Block Designs

On The Estimation of Two Missing Values in Randomized Complete Block Designs Mahemaical Theoy and Modeling ISSN 45804 (Pape ISSN 505 (Online Vol.6, No.7, 06 www.iise.og On The Esimaion of Two Missing Values in Randomized Complee Bloc Designs EFFANGA, EFFANGA OKON AND BASSE, E.

More information

An Automatic Door Sensor Using Image Processing

An Automatic Door Sensor Using Image Processing An Auomaic Doo Senso Using Image Pocessing Depamen o Elecical and Eleconic Engineeing Faculy o Engineeing Tooi Univesiy MENDEL 2004 -Insiue o Auomaion and Compue Science- in BRNO CZECH REPUBLIC 1. Inoducion

More information

, on the power of the transmitter P t fed to it, and on the distance R between the antenna and the observation point as. r r t

, on the power of the transmitter P t fed to it, and on the distance R between the antenna and the observation point as. r r t Lecue 6: Fiis Tansmission Equaion and Rada Range Equaion (Fiis equaion. Maximum ange of a wieless link. Rada coss secion. Rada equaion. Maximum ange of a ada. 1. Fiis ansmission equaion Fiis ansmission

More information

AN EVOLUTIONARY APPROACH FOR SOLVING DIFFERENTIAL EQUATIONS

AN EVOLUTIONARY APPROACH FOR SOLVING DIFFERENTIAL EQUATIONS AN EVOLUTIONARY APPROACH FOR SOLVING DIFFERENTIAL EQUATIONS M. KAMESWAR RAO AND K.P. RAVINDRAN Depamen of Mechanical Engineeing, Calicu Regional Engineeing College, Keala-67 6, INDIA. Absac:- We eploe

More information

Sections 3.1 and 3.4 Exponential Functions (Growth and Decay)

Sections 3.1 and 3.4 Exponential Functions (Growth and Decay) Secions 3.1 and 3.4 Eponenial Funcions (Gowh and Decay) Chape 3. Secions 1 and 4 Page 1 of 5 Wha Would You Rahe Have... $1million, o double you money evey day fo 31 days saing wih 1cen? Day Cens Day Cens

More information

On Control Problem Described by Infinite System of First-Order Differential Equations

On Control Problem Described by Infinite System of First-Order Differential Equations Ausalian Jounal of Basic and Applied Sciences 5(): 736-74 ISS 99-878 On Conol Poblem Descibed by Infinie Sysem of Fis-Ode Diffeenial Equaions Gafujan Ibagimov and Abbas Badaaya J'afau Insiue fo Mahemaical

More information

Lecture 22 Electromagnetic Waves

Lecture 22 Electromagnetic Waves Lecue Elecomagneic Waves Pogam: 1. Enegy caied by he wave (Poyning veco).. Maxwell s equaions and Bounday condiions a inefaces. 3. Maeials boundaies: eflecion and efacion. Snell s Law. Quesions you should

More information

Risk tolerance and optimal portfolio choice

Risk tolerance and optimal portfolio choice Risk oleance and opimal pofolio choice Maek Musiela BNP Paibas London Copoae and Invesmen Join wok wih T. Zaiphopoulou (UT usin) Invesmens and fowad uiliies Pepin 6 Backwad and fowad dynamic uiliies and

More information

The sudden release of a large amount of energy E into a background fluid of density

The sudden release of a large amount of energy E into a background fluid of density 10 Poin explosion The sudden elease of a lage amoun of enegy E ino a backgound fluid of densiy ceaes a song explosion, chaaceized by a song shock wave (a blas wave ) emanaing fom he poin whee he enegy

More information

Online Completion of Ill-conditioned Low-Rank Matrices

Online Completion of Ill-conditioned Low-Rank Matrices Online Compleion of Ill-condiioned Low-Rank Maices Ryan Kennedy and Camillo J. Taylo Compue and Infomaion Science Univesiy of Pennsylvania Philadelphia, PA, USA keny, cjaylo}@cis.upenn.edu Laua Balzano

More information

KINEMATICS OF RIGID BODIES

KINEMATICS OF RIGID BODIES KINEMTICS OF RIGID ODIES In igid body kinemaics, we use he elaionships govening he displacemen, velociy and acceleaion, bu mus also accoun fo he oaional moion of he body. Descipion of he moion of igid

More information

General Non-Arbitrage Model. I. Partial Differential Equation for Pricing A. Traded Underlying Security

General Non-Arbitrage Model. I. Partial Differential Equation for Pricing A. Traded Underlying Security 1 Geneal Non-Abiage Model I. Paial Diffeenial Equaion fo Picing A. aded Undelying Secuiy 1. Dynamics of he Asse Given by: a. ds = µ (S, )d + σ (S, )dz b. he asse can be eihe a sock, o a cuency, an index,

More information

[ ] 0. = (2) = a q dimensional vector of observable instrumental variables that are in the information set m constituents of u

[ ] 0. = (2) = a q dimensional vector of observable instrumental variables that are in the information set m constituents of u Genealized Mehods of Momens he genealized mehod momens (GMM) appoach of Hansen (98) can be hough of a geneal pocedue fo esing economics and financial models. he GMM is especially appopiae fo models ha

More information

Combinatorial Approach to M/M/1 Queues. Using Hypergeometric Functions

Combinatorial Approach to M/M/1 Queues. Using Hypergeometric Functions Inenaional Mahemaical Foum, Vol 8, 03, no 0, 463-47 HIKARI Ld, wwwm-hikaicom Combinaoial Appoach o M/M/ Queues Using Hypegeomeic Funcions Jagdish Saan and Kamal Nain Depamen of Saisics, Univesiy of Delhi,

More information

r P + '% 2 r v(r) End pressures P 1 (high) and P 2 (low) P 1 , which must be independent of z, so # dz dz = P 2 " P 1 = " #P L L,

r P + '% 2 r v(r) End pressures P 1 (high) and P 2 (low) P 1 , which must be independent of z, so # dz dz = P 2  P 1 =  #P L L, Lecue 36 Pipe Flow and Low-eynolds numbe hydodynamics 36.1 eading fo Lecues 34-35: PKT Chape 12. Will y fo Monday?: new daa shee and daf fomula shee fo final exam. Ou saing poin fo hydodynamics ae wo equaions:

More information

Quantum Algorithms for Matrix Products over Semirings

Quantum Algorithms for Matrix Products over Semirings CHICAGO JOURNAL OF THEORETICAL COMPUTER SCIENCE 2017, Aicle 1, pages 1 25 hp://cjcscsuchicagoedu/ Quanum Algoihms fo Maix Poducs ove Semiings Fançois Le Gall Haumichi Nishimua Received July 24, 2015; Revised

More information

Orthotropic Materials

Orthotropic Materials Kapiel 2 Ohoopic Maeials 2. Elasic Sain maix Elasic sains ae elaed o sesses by Hooke's law, as saed below. The sesssain elaionship is in each maeial poin fomulaed in he local caesian coodinae sysem. ε

More information

Unsupervised Segmentation of Moving MPEG Blocks Based on Classification of Temporal Information

Unsupervised Segmentation of Moving MPEG Blocks Based on Classification of Temporal Information Unsupevised Segmenaion of Moving MPEG Blocs Based on Classificaion of Tempoal Infomaion Ofe Mille 1, Ami Avebuch 1, and Yosi Kelle 2 1 School of Compue Science,Tel-Aviv Univesiy, Tel-Aviv 69978, Isael

More information

Dynamic Estimation of OD Matrices for Freeways and Arterials

Dynamic Estimation of OD Matrices for Freeways and Arterials Novembe 2007 Final Repo: ITS Dynamic Esimaion of OD Maices fo Feeways and Aeials Auhos: Juan Calos Heea, Sauabh Amin, Alexande Bayen, Same Madana, Michael Zhang, Yu Nie, Zhen Qian, Yingyan Lou, Yafeng

More information

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model 1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval

More information

Monochromatic Wave over One and Two Bars

Monochromatic Wave over One and Two Bars Applied Mahemaical Sciences, Vol. 8, 204, no. 6, 307-3025 HIKARI Ld, www.m-hikai.com hp://dx.doi.og/0.2988/ams.204.44245 Monochomaic Wave ove One and Two Bas L.H. Wiyano Faculy of Mahemaics and Naual Sciences,

More information

A Weighted Moving Average Process for Forecasting. Shou Hsing Shih Chris P. Tsokos

A Weighted Moving Average Process for Forecasting. Shou Hsing Shih Chris P. Tsokos A Weighed Moving Aveage Pocess fo Foecasing Shou Hsing Shih Chis P. Tsokos Depamen of Mahemaics and Saisics Univesiy of Souh Floida, USA Absac The objec of he pesen sudy is o popose a foecasing model fo

More information

Lecture 17: Kinetics of Phase Growth in a Two-component System:

Lecture 17: Kinetics of Phase Growth in a Two-component System: Lecue 17: Kineics of Phase Gowh in a Two-componen Sysem: descipion of diffusion flux acoss he α/ ineface Today s opics Majo asks of oday s Lecue: how o deive he diffusion flux of aoms. Once an incipien

More information

Variance and Covariance Processes

Variance and Covariance Processes Vaiance and Covaiance Pocesses Pakash Balachandan Depamen of Mahemaics Duke Univesiy May 26, 2008 These noes ae based on Due s Sochasic Calculus, Revuz and Yo s Coninuous Maingales and Bownian Moion, Kaazas

More information

7 Wave Equation in Higher Dimensions

7 Wave Equation in Higher Dimensions 7 Wave Equaion in Highe Dimensions We now conside he iniial-value poblem fo he wave equaion in n dimensions, u c u x R n u(x, φ(x u (x, ψ(x whee u n i u x i x i. (7. 7. Mehod of Spheical Means Ref: Evans,

More information

Bayes Nets. CS 188: Artificial Intelligence Spring Example: Alarm Network. Building the (Entire) Joint

Bayes Nets. CS 188: Artificial Intelligence Spring Example: Alarm Network. Building the (Entire) Joint C 188: Aificial Inelligence ping 2008 Bayes Nes 2/5/08, 2/7/08 Dan Klein UC Bekeley Bayes Nes A Bayes ne is an efficien encoding of a pobabilisic model of a domain Quesions we can ask: Infeence: given

More information

Research on the Algorithm of Evaluating and Analyzing Stationary Operational Availability Based on Mission Requirement

Research on the Algorithm of Evaluating and Analyzing Stationary Operational Availability Based on Mission Requirement Reseach on he Algoihm of Evaluaing and Analyzing Saionay Opeaional Availabiliy Based on ission Requiemen Wang Naichao, Jia Zhiyu, Wang Yan, ao Yilan, Depamen of Sysem Engineeing of Engineeing Technology,

More information

156 There are 9 books stacked on a shelf. The thickness of each book is either 1 inch or 2

156 There are 9 books stacked on a shelf. The thickness of each book is either 1 inch or 2 156 Thee ae 9 books sacked on a shelf. The hickness of each book is eihe 1 inch o 2 F inches. The heigh of he sack of 9 books is 14 inches. Which sysem of equaions can be used o deemine x, he numbe of

More information

Kalman Filter: an instance of Bayes Filter. Kalman Filter: an instance of Bayes Filter. Kalman Filter. Linear dynamics with Gaussian noise

Kalman Filter: an instance of Bayes Filter. Kalman Filter: an instance of Bayes Filter. Kalman Filter. Linear dynamics with Gaussian noise COM47 Inoducion o Roboics and Inelligen ysems he alman File alman File: an insance of Bayes File alman File: an insance of Bayes File Linea dynamics wih Gaussian noise alman File Linea dynamics wih Gaussian

More information

The Production of Polarization

The Production of Polarization Physics 36: Waves Lecue 13 3/31/211 The Poducion of Polaizaion Today we will alk abou he poducion of polaized ligh. We aleady inoduced he concep of he polaizaion of ligh, a ansvese EM wave. To biefly eview

More information

MEEN 617 Handout #11 MODAL ANALYSIS OF MDOF Systems with VISCOUS DAMPING

MEEN 617 Handout #11 MODAL ANALYSIS OF MDOF Systems with VISCOUS DAMPING MEEN 67 Handou # MODAL ANALYSIS OF MDOF Sysems wih VISCOS DAMPING ^ Symmeic Moion of a n-dof linea sysem is descibed by he second ode diffeenial equaions M+C+K=F whee () and F () ae n ows vecos of displacemens

More information

International Journal of Pure and Applied Sciences and Technology

International Journal of Pure and Applied Sciences and Technology In. J. Pue Appl. Sci. Technol., 4 (211, pp. 23-29 Inenaional Jounal of Pue and Applied Sciences and Technology ISS 2229-617 Available online a www.ijopaasa.in eseach Pape Opizaion of he Uiliy of a Sucual

More information

Exponential and Logarithmic Equations and Properties of Logarithms. Properties. Properties. log. Exponential. Logarithmic.

Exponential and Logarithmic Equations and Properties of Logarithms. Properties. Properties. log. Exponential. Logarithmic. Eponenial and Logaihmic Equaions and Popeies of Logaihms Popeies Eponenial a a s = a +s a /a s = a -s (a ) s = a s a b = (ab) Logaihmic log s = log + logs log/s = log - logs log s = s log log a b = loga

More information

EFFECT OF PERMISSIBLE DELAY ON TWO-WAREHOUSE INVENTORY MODEL FOR DETERIORATING ITEMS WITH SHORTAGES

EFFECT OF PERMISSIBLE DELAY ON TWO-WAREHOUSE INVENTORY MODEL FOR DETERIORATING ITEMS WITH SHORTAGES Volume, ssue 3, Mach 03 SSN 39-4847 EFFEC OF PERMSSBLE DELAY ON WO-WAREHOUSE NVENORY MODEL FOR DEERORANG EMS WH SHORAGES D. Ajay Singh Yadav, Ms. Anupam Swami Assisan Pofesso, Depamen of Mahemaics, SRM

More information

Extremal problems for t-partite and t-colorable hypergraphs

Extremal problems for t-partite and t-colorable hypergraphs Exemal poblems fo -paie and -coloable hypegaphs Dhuv Mubayi John Talbo June, 007 Absac Fix ineges and an -unifom hypegaph F. We pove ha he maximum numbe of edges in a -paie -unifom hypegaph on n veices

More information

Today - Lecture 13. Today s lecture continue with rotations, torque, Note that chapters 11, 12, 13 all involve rotations

Today - Lecture 13. Today s lecture continue with rotations, torque, Note that chapters 11, 12, 13 all involve rotations Today - Lecue 13 Today s lecue coninue wih oaions, oque, Noe ha chapes 11, 1, 13 all inole oaions slide 1 eiew Roaions Chapes 11 & 1 Viewed fom aboe (+z) Roaional, o angula elociy, gies angenial elociy

More information

Two-dimensional Effects on the CSR Interaction Forces for an Energy-Chirped Bunch. Rui Li, J. Bisognano, R. Legg, and R. Bosch

Two-dimensional Effects on the CSR Interaction Forces for an Energy-Chirped Bunch. Rui Li, J. Bisognano, R. Legg, and R. Bosch Two-dimensional Effecs on he CS Ineacion Foces fo an Enegy-Chiped Bunch ui Li, J. Bisognano,. Legg, and. Bosch Ouline 1. Inoducion 2. Pevious 1D and 2D esuls fo Effecive CS Foce 3. Bunch Disibuion Vaiaion

More information

On Energy-Efficient Node Deployment in Wireless Sesnor Networks

On Energy-Efficient Node Deployment in Wireless Sesnor Networks I J Communicaions, Newok and Sysem Sciences, 008, 3, 07-83 Published Online Augus 008 in Scies (hp://wwwscipog/jounal/ijcns/) On Enegy-Efficien Node Deploymen in Wieless Sesno Newoks Hui WANG 1, KeZhong

More information

Measures the linear dependence or the correlation between r t and r t-p. (summarizes serial dependence)

Measures the linear dependence or the correlation between r t and r t-p. (summarizes serial dependence) . Definiions Saionay Time Seies- A ime seies is saionay if he popeies of he pocess such as he mean and vaiance ae consan houghou ime. i. If he auocoelaion dies ou quickly he seies should be consideed saionay

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

New and Faster Filters for. Multiple Approximate String Matching. University of Chile. Blanco Encalada Santiago - Chile

New and Faster Filters for. Multiple Approximate String Matching. University of Chile. Blanco Encalada Santiago - Chile New and Fase Files fo Muliple Appoximae Sing Maching Ricado Baeza-Yaes Gonzalo Navao Depamen of Compue Science Univesiy of Chile Blanco Encalada 22 - Saniago - Chile fbaeza,gnavaog@dcc.uchile.cl Absac

More information

ENGI 4430 Advanced Calculus for Engineering Faculty of Engineering and Applied Science Problem Set 9 Solutions [Theorems of Gauss and Stokes]

ENGI 4430 Advanced Calculus for Engineering Faculty of Engineering and Applied Science Problem Set 9 Solutions [Theorems of Gauss and Stokes] ENGI 44 Avance alculus fo Engineeing Faculy of Engineeing an Applie cience Poblem e 9 oluions [Theoems of Gauss an okes]. A fla aea A is boune by he iangle whose veices ae he poins P(,, ), Q(,, ) an R(,,

More information

The k-filtering Applied to Wave Electric and Magnetic Field Measurements from Cluster

The k-filtering Applied to Wave Electric and Magnetic Field Measurements from Cluster The -fileing pplied o Wave lecic and Magneic Field Measuemens fom Cluse Jean-Louis PINÇON and ndes TJULIN LPC-CNRS 3 av. de la Recheche Scienifique 4507 Oléans Fance jlpincon@cns-oleans.f OUTLINS The -fileing

More information

4/18/2005. Statistical Learning Theory

4/18/2005. Statistical Learning Theory Statistical Leaning Theoy Statistical Leaning Theoy A model of supevised leaning consists of: a Envionment - Supplying a vecto x with a fixed but unknown pdf F x (x b Teache. It povides a desied esponse

More information

336 ERIDANI kfk Lp = sup jf(y) ; f () jj j p p whee he supemum is aken ove all open balls = (a ) inr n, jj is he Lebesgue measue of in R n, () =(), f

336 ERIDANI kfk Lp = sup jf(y) ; f () jj j p p whee he supemum is aken ove all open balls = (a ) inr n, jj is he Lebesgue measue of in R n, () =(), f TAMKANG JOURNAL OF MATHEMATIS Volume 33, Numbe 4, Wine 2002 ON THE OUNDEDNESS OF A GENERALIED FRATIONAL INTEGRAL ON GENERALIED MORREY SPAES ERIDANI Absac. In his pape we exend Nakai's esul on he boundedness

More information

MATHEMATICAL FOUNDATIONS FOR APPROXIMATING PARTICLE BEHAVIOUR AT RADIUS OF THE PLANCK LENGTH

MATHEMATICAL FOUNDATIONS FOR APPROXIMATING PARTICLE BEHAVIOUR AT RADIUS OF THE PLANCK LENGTH Fundamenal Jounal of Mahemaical Phsics Vol 3 Issue 013 Pages 55-6 Published online a hp://wwwfdincom/ MATHEMATICAL FOUNDATIONS FOR APPROXIMATING PARTICLE BEHAVIOUR AT RADIUS OF THE PLANCK LENGTH Univesias

More information

Chapter 7. Interference

Chapter 7. Interference Chape 7 Inefeence Pa I Geneal Consideaions Pinciple of Supeposiion Pinciple of Supeposiion When wo o moe opical waves mee in he same locaion, hey follow supeposiion pinciple Mos opical sensos deec opical

More information

NUMERICAL SIMULATION FOR NONLINEAR STATIC & DYNAMIC STRUCTURAL ANALYSIS

NUMERICAL SIMULATION FOR NONLINEAR STATIC & DYNAMIC STRUCTURAL ANALYSIS Join Inenaional Confeence on Compuing and Decision Making in Civil and Building Engineeing June 14-16, 26 - Monéal, Canada NUMERICAL SIMULATION FOR NONLINEAR STATIC & DYNAMIC STRUCTURAL ANALYSIS ABSTRACT

More information

A Negative Log Likelihood Function-Based Nonlinear Neural Network Approach

A Negative Log Likelihood Function-Based Nonlinear Neural Network Approach A Negaive Log Likelihood Funcion-Based Nonlinea Neual Newok Appoach Ponip Dechpichai,* and Pamela Davy School of Mahemaics and Applied Saisics Univesiy of Wollongong, Wollongong NSW 5, AUSTRALIA * Coesponding

More information

An Open cycle and Closed cycle Gas Turbine Engines. Methods to improve the performance of simple gas turbine plants

An Open cycle and Closed cycle Gas Turbine Engines. Methods to improve the performance of simple gas turbine plants An Open cycle and losed cycle Gas ubine Engines Mehods o impove he pefomance of simple gas ubine plans I egeneaive Gas ubine ycle: he empeaue of he exhaus gases in a simple gas ubine is highe han he empeaue

More information

P h y s i c s F a c t s h e e t

P h y s i c s F a c t s h e e t P h y s i c s F a c s h e e Sepembe 2001 Numbe 20 Simple Hamonic Moion Basic Conceps This Facshee will:! eplain wha is mean by simple hamonic moion! eplain how o use he equaions fo simple hamonic moion!

More information

Molecular Evolution and Phylogeny. Based on: Durbin et al Chapter 8

Molecular Evolution and Phylogeny. Based on: Durbin et al Chapter 8 Molecula Evoluion and hylogeny Baed on: Dubin e al Chape 8. hylogeneic Tee umpion banch inenal node leaf Topology T : bifucaing Leave - N Inenal node N+ N- Lengh { i } fo each banch hylogeneic ee Topology

More information

EVENT HORIZONS IN COSMOLOGY

EVENT HORIZONS IN COSMOLOGY Mahemaics Today Vol7(Dec-)54-6 ISSN 976-38 EVENT HORIZONS IN COSMOLOGY K Punachanda Rao Depamen of Mahemaics Chiala Engineeing College Chiala 53 57 Andha Padesh, INDIA E-mail: dkpaocecc@yahoocoin ABSTRACT

More information

Modelling Dynamic Conditional Correlations in the Volatility of Spot and Forward Oil Price Returns

Modelling Dynamic Conditional Correlations in the Volatility of Spot and Forward Oil Price Returns Modelling Dynamic Condiional Coelaions in he Volailiy of Spo and Fowad Oil Pice Reuns Maeo Manea a, Michael McAlee b and Magheia Gasso c a Depamen of Saisics, Univesiy of Milan-Bicocca and FEEM, Milan,

More information

Extraction of Web Site Evaluation Criteria and Automatic Evaluation

Extraction of Web Site Evaluation Criteria and Automatic Evaluation Li, P. and Yamada, S. Pape: Exacion of Web Sie Evaluaion Cieia and Auomaic Evaluaion Peng Li and Seiji Yamada Depamen of Compuaional Inelligence and Sysems Science, Tokyo Insiue of Technology J2, 4259

More information

Distribution Free Evolvability of Polynomial Functions over all Convex Loss Functions

Distribution Free Evolvability of Polynomial Functions over all Convex Loss Functions Disibuion Fee Evolvabiliy of Polynomial Funcions ove all Convex Loss Funcions Paul Valian UC Beeley Beeley, Califonia pvalian@gmail.com ABSTRACT We fomulae a noion of evolvabiliy fo funcions wih domain

More information

PHYS PRACTICE EXAM 2

PHYS PRACTICE EXAM 2 PHYS 1800 PRACTICE EXAM Pa I Muliple Choice Quesions [ ps each] Diecions: Cicle he one alenaive ha bes complees he saemen o answes he quesion. Unless ohewise saed, assume ideal condiions (no ai esisance,

More information

MIMO Cognitive Radio Capacity in. Flat Fading Channel. Mohan Premkumar, Muthappa Perumal Chitra. 1. Introduction

MIMO Cognitive Radio Capacity in. Flat Fading Channel. Mohan Premkumar, Muthappa Perumal Chitra. 1. Introduction Inenaional Jounal of Wieless Communicaions, ewoking and Mobile Compuing 07; 4(6): 44-50 hp://www.aasci.og/jounal/wcnmc ISS: 38-37 (Pin); ISS: 38-45 (Online) MIMO Cogniive adio Capaciy in Fla Fading Channel

More information

AN EFFICIENT INTEGRAL METHOD FOR THE COMPUTATION OF THE BODIES MOTION IN ELECTROMAGNETIC FIELD

AN EFFICIENT INTEGRAL METHOD FOR THE COMPUTATION OF THE BODIES MOTION IN ELECTROMAGNETIC FIELD AN EFFICIENT INTEGRAL METHOD FOR THE COMPUTATION OF THE BODIES MOTION IN ELECTROMAGNETIC FIELD GEORGE-MARIAN VASILESCU, MIHAI MARICARU, BOGDAN DUMITRU VĂRĂTICEANU, MARIUS AUREL COSTEA Key wods: Eddy cuen

More information

PARAMETER IDENTIFICATION IN DYNAMIC ECONOMIC MODELS*

PARAMETER IDENTIFICATION IN DYNAMIC ECONOMIC MODELS* Aicles Auumn PARAMETER IDENTIFICATION IN DYNAMIC ECONOMIC MODELS Nikolay Iskev. INTRODUCTION Paamee idenifi caion is a concep which evey suden of economics leans in hei inoducoy economeics class. The usual

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

On the Semi-Discrete Davey-Stewartson System with Self-Consistent Sources

On the Semi-Discrete Davey-Stewartson System with Self-Consistent Sources Jounal of Applied Mahemaics and Physics 25 3 478-487 Published Online May 25 in SciRes. hp://www.scip.og/jounal/jamp hp://dx.doi.og/.4236/jamp.25.356 On he Semi-Discee Davey-Sewason Sysem wih Self-Consisen

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Adaptive Regularization of Weight Vectors

Adaptive Regularization of Weight Vectors Adapive Regulaizaion of Weigh Vecos Koby Camme Depamen of Elecical Engineing he echnion Haifa, 32000 Isael koby@ee.echnion.ac.il Alex Kulesza Depamen of Compue and Infomaion Science Univesiy of Pennsylvania

More information

Reichenbach and f-generated implications in fuzzy database relations

Reichenbach and f-generated implications in fuzzy database relations INTERNATIONAL JOURNAL O CIRCUITS SYSTEMS AND SIGNAL PROCESSING Volume 08 Reichenbach and f-geneaed implicaions in fuzzy daabase elaions Nedžad Dukić Dženan Gušić and Nemana Kajmoić Absac Applying a definiion

More information

Predictive Regressions. Based on AP Chap. 20

Predictive Regressions. Based on AP Chap. 20 Peicive Regessions Base on AP Chap. 20 Ealy auhos, incluing Jensen (969) an Fama (970) viewe ha he efficien mae hypohesis mean euns wee no peicable. Lae wo, noably Lucas (978) showe ha aional expecaions

More information

A Numerical Hydration Model of Portland Cement

A Numerical Hydration Model of Portland Cement A Numeical Hydaion Model of Poland Cemen Ippei Mauyama, Tesuo Masushia and Takafumi Noguchi ABSTRACT : A compue-based numeical model is pesened, wih which hydaion and micosucual developmen in Poland cemen-based

More information

Support Vector Machines

Support Vector Machines Suppo Veco Machine CSL 3 ARIFICIAL INELLIGENCE SPRING 4 Suppo Veco Machine O, Kenel Machine Diciminan-baed mehod olean cla boundaie Suppo veco coni of eample cloe o bounday Kenel compue imilaiy beeen eample

More information

A Study on Non-Binary Turbo Codes

A Study on Non-Binary Turbo Codes A Sudy on Non-Binay Tubo Codes Hoia BALTA, Maia KOVACI Univesiy Polyechnic of Timişoaa, Faculy of Eleconics and Telecommunicaions, Posal Addess, 3223 Timişoaa, ROMANIA, E-Mail: hoia.bala@ec.u.o, maia.kovaci@ec.u.o

More information

Understanding the asymptotic behaviour of empirical Bayes methods

Understanding the asymptotic behaviour of empirical Bayes methods Undersanding he asympoic behaviour of empirical Bayes mehods Boond Szabo, Aad van der Vaar and Harry van Zanen EURANDOM, 11.10.2011. Conens 2/20 Moivaion Nonparameric Bayesian saisics Signal in Whie noise

More information

Online Ranking by Projecting

Online Ranking by Projecting LETTER Communicaed by Edwad Haingon Online Ranking by Pojecing Koby Camme kobics@cs.huji.ac.il Yoam Singe singe@cs.huji.ac.il School of Compue Science and Engineeing, Hebew Univesiy, Jeusalem 91904, Isael

More information

The shortest path between two truths in the real domain passes through the complex domain. J. Hadamard

The shortest path between two truths in the real domain passes through the complex domain. J. Hadamard Complex Analysis R.G. Halbud R.Halbud@ucl.ac.uk Depamen of Mahemaics Univesiy College London 202 The shoes pah beween wo uhs in he eal domain passes hough he complex domain. J. Hadamad Chape The fis fundamenal

More information

arxiv: v2 [stat.me] 13 Jul 2015

arxiv: v2 [stat.me] 13 Jul 2015 One- and wo-sample nonpaameic ess fo he al-o-noise aio based on ecod saisics axiv:1502.05367v2 [sa.me] 13 Jul 2015 Damien Challe 1,2 1 Laboaoie de mahémaiques appliquées aux sysèmes, CenaleSupélec, 92295

More information

Lecture 5. Chapter 3. Electromagnetic Theory, Photons, and Light

Lecture 5. Chapter 3. Electromagnetic Theory, Photons, and Light Lecue 5 Chape 3 lecomagneic Theo, Phoons, and Ligh Gauss s Gauss s Faada s Ampèe- Mawell s + Loen foce: S C ds ds S C F dl dl q Mawell equaions d d qv A q A J ds ds In mae fields ae defined hough ineacion

More information

ÖRNEK 1: THE LINEAR IMPULSE-MOMENTUM RELATION Calculate the linear momentum of a particle of mass m=10 kg which has a. kg m s

ÖRNEK 1: THE LINEAR IMPULSE-MOMENTUM RELATION Calculate the linear momentum of a particle of mass m=10 kg which has a. kg m s MÜHENDİSLİK MEKANİĞİ. HAFTA İMPULS- MMENTUM-ÇARPIŞMA Linea oenu of a paicle: The sybol L denoes he linea oenu and is defined as he ass ies he elociy of a paicle. L ÖRNEK : THE LINEAR IMPULSE-MMENTUM RELATIN

More information

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution Statistics Reseach Lettes Vol. Iss., Novembe Cental Coveage Bayes Pediction Intevals fo the Genealized Paeto Distibution Gyan Pakash Depatment of Community Medicine S. N. Medical College, Aga, U. P., India

More information

Learning of Situation Dependent Prediction toward Acquiring Physical Causality

Learning of Situation Dependent Prediction toward Acquiring Physical Causality Leaning of Siuaion Dependen Pedicion owad Acquiing Physical Causaliy Masaki Ogino Tesuya Fujia Sawa Fuke Minou Asada, JST ERATO Asada Synegisic Inelligence Pojec Yamadaoka 2-, Suia, Osaka 565-87, Japan

More information

Gauge invariance and the vacuum state. Dan Solomon Rauland-Borg Corporation 3450 W. Oakton Skokie, IL Please send all correspondence to:

Gauge invariance and the vacuum state. Dan Solomon Rauland-Borg Corporation 3450 W. Oakton Skokie, IL Please send all correspondence to: Gauge invaiance and he vacuum sae 1 Gauge invaiance and he vacuum sae by Dan Solomon Rauland-Bog Copoaion 345 W. Oakon Skokie, IL 676 Please send all coespondence o: Dan Solomon 164 Bummel Evanson, IL

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data Chaper 2 Models, Censoring, and Likelihood for Failure-Time Daa William Q. Meeker and Luis A. Escobar Iowa Sae Universiy and Louisiana Sae Universiy Copyrigh 1998-2008 W. Q. Meeker and L. A. Escobar. Based

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

envionmen ha implemens all of he common algoihmic deails of all nodal mehods u pemis he specic mehod o e used in any concee insance o e specied y he u

envionmen ha implemens all of he common algoihmic deails of all nodal mehods u pemis he specic mehod o e used in any concee insance o e specied y he u Linea One-Cell Funcional Mehods fo he Two Dimensional Tanspo Equaion. Pa I. The Nodal Fomulaion y G. D. Allen and Paul Nelson Asac We develop a class of spaial appoximaions o he wo-dimensional anspo equaion

More information

Discretization of Fractional Order Differentiator and Integrator with Different Fractional Orders

Discretization of Fractional Order Differentiator and Integrator with Different Fractional Orders Inelligen Conol and Auomaion, 207, 8, 75-85 hp://www.scip.og/jounal/ica ISSN Online: 253-066 ISSN Pin: 253-0653 Disceizaion of Facional Ode Diffeeniao and Inegao wih Diffeen Facional Odes Qi Zhang, Baoye

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

Distributed Search Systems with Self-Adaptive Organizational Setups

Distributed Search Systems with Self-Adaptive Organizational Setups Inenaional Jounal of Ineacive Mulimedia and Aificial Inelligence, Vol. 4, Nº4 Disibued Seach Sysems wih Self-Adapive Oganizaional Seups Fiedeike Wall Univesiae Klagenfu, Depamen of Conolling and Saegic

More information

Pressure Vessels Thin and Thick-Walled Stress Analysis

Pressure Vessels Thin and Thick-Walled Stress Analysis Pessue Vessels Thin and Thick-Walled Sess Analysis y James Doane, PhD, PE Conens 1.0 Couse Oveview... 3.0 Thin-Walled Pessue Vessels... 3.1 Inoducion... 3. Sesses in Cylindical Conaines... 4..1 Hoop Sess...

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Finite-Sample Effects on the Standardized Returns of the Tokyo Stock Exchange

Finite-Sample Effects on the Standardized Returns of the Tokyo Stock Exchange Available online a www.sciencediec.com Pocedia - Social and Behavioal Sciences 65 ( 01 ) 968 973 Inenaional Congess on Inedisciplinay Business and Social Science 01 (ICIBSoS 01) Finie-Sample Effecs on

More information

Theoretical background and the flow fields in downhole liquid-liquid hydrocyclone (LLHC)

Theoretical background and the flow fields in downhole liquid-liquid hydrocyclone (LLHC) AEC Web of Confeences 13, 3 (14) DO: 1.151/ maecconf/ 1413 3 C Owned by he auhos, published by EDP Sciences, 14 heoeical backgound and he flow fields in downhole liquid-liquid hydocyclone (LLHC) Haison

More information