NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

NUROP Chinese Pinyin T Chinese Character Cnversin NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION CHIA LI SHI 1 AND LUA KIM TENG 2 Schl f Cmputing, Natinal University f Singapre 3 Science Drive 2, Singapre 117543 Abstract This reprt explres the pssibility f creating a Pinyin Character cnversin system which is mre dynamic in nature. This prpsed system will underg learning stages t build up its knwledge database. Subsequently, rules will be frmulated t assist in the cnversin stage. Learning will take place thrughut the entire life-cycle f the system. It takes place in the knwledge building stage when the system started ut as a naïve system and later n, cntinues t learn thrugh the mistakes that it makes during applicatin f the rules. The purpse f this system is t break away frm the cnventinal pure statistical apprach t pinyin character cnversin and t build a mre intelligent system. 1 Intrductin The Chinese input methd is the fundamental tl that is used in Chinese infrmatin prcessing. The accuracy and ease f usage f the Chinese input methd have a direct impact n the efficiency f Chinese infrmatin prcessing. Hence, it is imprtant t have a Chinese input methd that is accurate, efficient and has a shrt learning curve. Chinese PC users can input Chinese using the speech methd, which can achieve up t 95% accuracy [Lua, 1998], and the keybard methd by the radical apprach r the phnetic apprach [Lua, 1998]. The radical apprach includes 5-Strkes ( 五笔 ) and Cangjie while the phnetic apprach includes Hanyu Pinyin. Chinese input using hanyu pinyin is still the mst ppular methd f input amng Chinese PC users since the radical apprach requires mre memrizatin. The main challenges f Chinese infrmatin prcessing (specifically pinyin t character cnversin) are the large number f Chinese characters and the existence f hmnym characters and wrds. The existence f hmnym characters leads t the situatin where there is n ne-ne mapping between a pinyin and a character. This implies that in the situatin where there exists mre than ne character mapping t a pinyin, the user will have t select the desired character frm a set f pssible characters. This situatin in turn, results in a lwer input speed and the impssibility f blind-typing. The main gal f Chinese infrmatin prcessing in the field f pinyin t character cnversin is t ptimize the cnversin prcess t carry ut the cnversin with an accuracy rate that is as high as pssible. In additin, the need fr the user t manually select the crrect character frm a set f pssible hmnym characters shuld be minimized in rder t imprve the input speed f the user. 1.1 Cntributins f My Research Findings Given the difficulties f Chinese infrmatin prcessing as mentined abve, the results f this research prject is able t alleviate sme f the difficulties f Chinese infrmatin prcessing and imprve n the current accuracy levels f cnversin. Particularly, my research findings will accmplish the fllwing: 1 Student 2 Supervisr 1

NUROP Chinese Pinyin T Chinese Character Cnversin (a) Shw that the incrpratin f learning capabilities results in a mre dynamic and cleverer system, with knwledge that are mre suitable t a particular cntext. (b) Demnstrate that the existence and usage f rules t help predict and determine the utput character will yield better cnversin accuracy. (c) A pinyin-character system that des nt need sphisticated statistical and artificial intelligence methds (fr instance neural netwrk) in rder t achieve higher cnversin accuracy. 2 Objectives f Prject My main bjective f this prject is t prpse a pinyin-character cnversin methd that is able t accmplish the fllwing: (a) Develp and apply rules rules are develped when the system is learning new knwledge and applied when the user is using the system. The usage f these rules is ne f the main factrs that determine which character is the best chice as utput fr a particular pinyin. In the event where rules are nt available, a simple statistical methd will be used. (b) Incrpratin f learning capabilities the system will be able t learn new knwledge as well as learn frm its mistakes. It is als able t adjust itself t adapt t different user usage pattern. This feature will greatly enhance the dynamism f Chinese input. (c) Imprved accuracy the frequency f prducing the wrng character will be reduced. In fact, this accuracy shuld imprve as the system is mre custmized t the usage pattern f the user. This imprvement in accuracy is achieved thrugh the rules mentined in pint (a). 3 Current Pinyin Input Systems Current Pinyin Input systems adpt mre sphisticated methds t imprve the accuracy f the cnversin, thrugh the fllwing methds: (a) Trigram language mdel and statistical mdel [Zheng and Lee, 2000] (b) Idimatic phrase matching, adjacency cnstraint rules and statistical methd [Lchvsky and Chung, 1994] (c) Neural Netwrk [Yuan, Kunst, and Brchardt, 1994] These systems are able t imprve the accuracy f the cnversin, hence, imprving the input speed f the user. Hwever, these systems lack a dynamic facet, where the system is always learning and evlving t imprve itself. 4 The Prpsed System The system that I am prpsing in this research prject is a rule-based system that has integrated learning capabilities as well as applicatin f simple statistics. Special features f the prpsed system: (a) Naïve Beginning The system starts ut as a naïve system with n built-in knwledge at all, except a pinyin character dictinary, which is a cllectin f characters and its crrespnding pinyins. A character dictinary and nt a wrd dictinary is used since cmbinatins f characters (resulting in wrds) can be learnt thrugh rule frmulatin. This allws the system t be custmized accrding t the envirnment cntext in which it is used right frm the beginning. Naivety f the system means that the pssibility f having wrng pre-laded knwledge is lwered. It als enhances the dynamism f the system, since the knwledge that the system has is nt restricted t the infrmatin that has been pre-laded. 2

NUROP Chinese Pinyin T Chinese Character Cnversin In fact, a naïve pinyin input system is analgus a yung child wh has n knwledge f the Chinese language. (b) Dynamic Learning f Knwledge Knwledge, in the frm f rules, is generated dynamically during several phases f learning. Knwledge f the system is acquired by prcessing text frm the primary schl syllabus f Singapre as well as newspaper articles frm lcal newspapers. The texts are prvided as input t the system, frm the easiest (Primary tw syllabus). Difficulty level increases frm primary tw syllabus t primary fur then t the newspaper articles. (c) Dynamic Generatin f rules The rules are a series f if-then clauses which are generated dynamically during the learning stages, with the system keeping nly the rule parameters as well as ther variables useful fr statistical purpses in the database. Using the analgy f a yung child, generatin f rules during the learning stage is similar t the learning prcess f the yung child, wh will frm relatinships between different characters and wrds. (d) Husekeep Rules Database The Rules Database must underg the husekeeping stage regularly t remve rubbish rules. This is t ensure that nly thse rules that can prduce cnsiderably accurate utputs are maintained in the database. In this way, the accuracy f the system can be maintained. (e) Penalty fr Wrng Output Penalties are given t the rule if the applicatin f the rule results in a wrng utput. This will help t lwer the chances f similar cases f wrng utput in the future. In ther wrds, the system is learning frm its mistakes. The higher the frequency, the higher the penalty given. This is because the high frequency f the rule shuld indicate a higher accuracy rate and if applicatin f the rule results in a wrng utput, it is nly right that a heavier punishment (penalty) be given t the rule. This penalty is analgus t a child remembering where he/she has made mistakes and try nt t make the same mistake again. 5 Testing Methdlgy and Results 5.1 Testing Objectives The bjectives f the testing phase are t accmplish the fllwing: Determine the accuracy rates and errr rates f my system Demnstrate that the accuracy f the system imprves, as the system increases its knwledge base in the frm f rules. Find ut the pssible future imprvements f my system frm the testers. 5.2 Testing Methdlgy I have designed tw phases f testing, each t be dne after ne stage f learning. This means that the first phase f testing is dne after the primary 2 crpus has been learnt and the secnd phase f testing is dne after the primary 4 crpus has been learnt. I have 10 testers, and each tester enters 20 pinyin inputs n tw different machines using several cnversin systems, namely, my prpsed system, and Micrsft Pinyin Cnversin System. Each tester will type in the input that they desire, and recrd dwn the utput that they see. The accuracy rate is calculated by using the fllwing frmula: Ttal number f errrs Ttal number f characters x 100% 3

NUROP Chinese Pinyin T Chinese Character Cnversin After the entire set f test cases have been tested and recrded, the tester will evaluate their typing speed, i.e., evaluate whether the system had enabled them t imprve their typing speed. They will d this evaluate n a scale f ne t ten, n a scale f increasing satisfactin abut the system. 5.3 Test Results (Accuracy) Overall, it can be seen that there is an imprvement in the errr rate as the system acquires mre knwledge. Cmparisn f Errr Rates Number f Errrs 8 6 4 2 0 1 4 7 1013161922252831 Errr Rate (Micrsft) Errr Rate (Pri2) Errr Rate (Pri4) Figure 1: Cmparisn f Errr Rates Frm Figure 1, it can be seen that the errr rate fr Micrsft is cnsiderably smaller. This can be understd as the Micrsft system is a mre established and stable system. As fr my prpsed system, accuracy rates after the first stage f learning is nt very satisfactry, as it can be seen frm Figure 1, the errr rates are quite large, implying lw accuracy. Hwever, as the system picks up mre knwledge frm the next rund f learning, the errr rates fall. As seen frm the figure, the errr rates have fallen cmpared t the Primary 2 errr rates. In sme cases, it can even perfrm better than the Micrsft system. Hence, based n the frmula given in the previus sectin, the errr rates fr the three systems, namely Micrsft, Prpsed system after Primary 2 Crpus and Prpsed system after Primary 4 Crpus are as fllws: 20 Micrsft = x 100% 79 System (Pri2) = x = 9 100% % = 35% System (Pri4) = 54 x 100% = 24% Given the abve errr rates, it implies that the accuracy rates are as belw: Micrsft = 91% System (Pri2) = 65% System (Pri4) = 76% Accuracy Rates Percentage 80 75 70 65 60 55 System (Pri2) System (Pri4) Series1 System Type Figure 2: Cmparisn f Accuracy Rates 6 Cnclusin 4

NUROP Chinese Pinyin T Chinese Character Cnversin As demnstrated by the test bjectives and the test results, it can be seen that the test bjectives have been accmplished. The test results shw an imprvement in verall accuracy rates, an imprvement f abut 11%. In fact, this imprvement is quite significant, since the systems tested are in the preliminary stages f learning. Als, there is the discvery f imprtant and crrect rules that have helped the system imprved 11%. References Bks: Crmen, Thmas H. et al. (2001). Intrductin t Algrithms. Secnd Editin, The Massachusetts Institute f Technlgy, United States f America, 2001 Mitchell, Tm (1997). Machine Learning. First Editin. McGraw-Hill Internatinal, United States, 1997. Cnference Papers: Chen, Zheng and Lee, Kai-Fu (2000). A New Statistical Apprach t Chinese Pinyin Input. ACL-2000. The 38th Annual Meeting f the Assciatin fr Cmputatinal Linguistics, (3-6 Octber 2000), Hng Kng. Lua, Kim Teng. (1998). Chinese Infrmatin Prcessing-Past, Present and Future. Prceedings f the 3rd Internatinal Wrkshp n Infrmatin Retrieval with Asian Languages. (IRAL'98, 1-2, 15-16 Octber, 1998). Brill, Eric. (1997). Unsupervised Learning f Disambiguatin Rules fr Part f Speech Tagging T appear in Natural Language Prcessing Using Very Large Crpra. Kluwer Academic Press.1997. 5