NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

Similar documents
Technical Bulletin. Generation Interconnection Procedures. Revisions to Cluster 4, Phase 1 Study Methodology

Lab 1 The Scientific Method

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Analysis on the Stability of Reservoir Soil Slope Based on Fuzzy Artificial Neural Network

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Chapter 3: Cluster Analysis

Subject description processes

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

ENSC Discrete Time Systems. Project Outline. Semester

Math Foundations 20 Work Plan

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

THE LIFE OF AN OBJECT IT SYSTEMS

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Floating Point Method for Solving Transportation. Problems with Additional Constraints

Document for ENES5 meeting

College of Engineering Writing & Communication Resource Center

Effective Scientific Writing. Brian Quinn, PhD

How do scientists measure trees? What is DBH?

Hypothesis Tests for One Population Mean

Department of Electrical Engineering, University of Waterloo. Introduction

Year 5 End of Year Expectations Reading, Writing and Maths

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

IEEE Int. Conf. Evolutionary Computation, Nagoya, Japan, May 1996, pp. 366{ Evolutionary Planner/Navigator: Operator Performance and

Wagon Markings Guidelines

5 th grade Common Core Standards

Assessment Primer: Writing Instructional Objectives

Query Expansion. Lecture Objectives. Text Technologies for Data Science INFR Learn about Query Expansion. Implement: 10/24/2017

SMART TESTING BOMBARDIER THOUGHTS

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern

NWC SAF ENTERING A NEW PHASE

IAML: Support Vector Machines

English 10 Pacing Guide : Quarter 2

Intelligent Pharma- Chemical and Oil & Gas Division Page 1 of 7. Global Business Centre Ave SE, Calgary, AB T2G 0K6, AB.

Study Group Report: Plate-fin Heat Exchangers: AEA Technology

Lab #3: Pendulum Period and Proportionalities

Power Formulas for Various Energy Resources and Their Application

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

MACHINE LEARNING FOR CLUSTER- GALAXY CLASSIFICATION

Determining the Accuracy of Modal Parameter Estimation Methods

THERMAL TEST LEVELS & DURATIONS

SAP Note Missing documentation on enhancement MDR10001

Churn Prediction using Dynamic RFM-Augmented node2vec

Accreditation Information

Chemistry 20 Lesson 11 Electronegativity, Polarity and Shapes

Least Squares Optimal Filtering with Multirate Observations

Getting Involved O. Responsibilities of a Member. People Are Depending On You. Participation Is Important. Think It Through

Differentiation Applications 1: Related Rates

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

BASD HIGH SCHOOL FORMAL LAB REPORT

SPH3U1 Lesson 06 Kinematics

Web-based GIS Systems for Radionuclides Monitoring. Dr. Todd Pierce Locus Technologies

Pipetting 101 Developed by BSU CityLab

An Efficient Load Shedding Scheme from Customer s Perspective

Better definition of the objective, novelty and relevance of this study improving the structure, content and length of the publication accordingly:

Writing Guidelines. (Updated: November 25, 2009) Forwards

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

A Correlation of. to the. South Carolina Academic Standards for Mathematics Precalculus

Biocomputers. [edit]scientific Background

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

IB Sports, Exercise and Health Science Summer Assignment. Mrs. Christina Doyle Seneca Valley High School

Computational modeling techniques

Kepler's Laws of Planetary Motion

We can see from the graph above that the intersection is, i.e., [ ).

DEFENSE OCCUPATIONAL AND ENVIRONMENTAL HEALTH READINESS SYSTEM (DOEHRS) ENVIRONMENTAL HEALTH SAMPLING ELECTRONIC DATA DELIVERABLE (EDD) GUIDE

Eric Klein and Ning Sa

Five Whys How To Do It Better

Collocation Map for Overcoming Data Sparseness

Guide to Using the Rubric to Score the Klf4 PREBUILD Model for Science Olympiad National Competitions

Early detection of mining truck failure by modelling its operation with neural networks classification algorithms

BLAST / HIDDEN MARKOV MODELS

Romeo and Juliet Essay

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

The standards are taught in the following sequence.

NUMBERS, MATHEMATICS AND EQUATIONS

Biological Cybernetics 9 Springer-Verlag 1986

Tutorial 4: Parameter optimization

ROUNDING ERRORS IN BEAM-TRACKING CALCULATIONS

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Product authorisation in case of in situ generation

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

A Scalable Recurrent Neural Network Framework for Model-free

MACE For Conformation Traits

History of five million pound testing machine, July 1984

Trigonometric Ratios Unit 5 Tentative TEST date

How T o Start A n Objective Evaluation O f Your Training Program

RECHERCHES Womcodes constructed with projective geometries «Womcodes» construits à partir de géométries projectives Frans MERKX (') École Nationale Su

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Comprehensive Exam Guidelines Department of Chemical and Biomolecular Engineering, Ohio University

Checking the resolved resonance region in EXFOR database

Lecture 17: Free Energy of Multi-phase Solutions at Equilibrium

Exam #1. A. Answer any 1 of the following 2 questions. CEE 371 October 8, Please grade the following questions: 1 or 2

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

Exam #1. A. Answer any 1 of the following 2 questions. CEE 371 March 10, Please grade the following questions: 1 or 2

Combining Dialectical Optimization and Gradient Descent Methods for Improving the Accuracy of Straight Line Segment Classifiers

Transcription:

NUROP Chinese Pinyin T Chinese Character Cnversin NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION CHIA LI SHI 1 AND LUA KIM TENG 2 Schl f Cmputing, Natinal University f Singapre 3 Science Drive 2, Singapre 117543 Abstract This reprt explres the pssibility f creating a Pinyin Character cnversin system which is mre dynamic in nature. This prpsed system will underg learning stages t build up its knwledge database. Subsequently, rules will be frmulated t assist in the cnversin stage. Learning will take place thrughut the entire life-cycle f the system. It takes place in the knwledge building stage when the system started ut as a naïve system and later n, cntinues t learn thrugh the mistakes that it makes during applicatin f the rules. The purpse f this system is t break away frm the cnventinal pure statistical apprach t pinyin character cnversin and t build a mre intelligent system. 1 Intrductin The Chinese input methd is the fundamental tl that is used in Chinese infrmatin prcessing. The accuracy and ease f usage f the Chinese input methd have a direct impact n the efficiency f Chinese infrmatin prcessing. Hence, it is imprtant t have a Chinese input methd that is accurate, efficient and has a shrt learning curve. Chinese PC users can input Chinese using the speech methd, which can achieve up t 95% accuracy [Lua, 1998], and the keybard methd by the radical apprach r the phnetic apprach [Lua, 1998]. The radical apprach includes 5-Strkes ( 五笔 ) and Cangjie while the phnetic apprach includes Hanyu Pinyin. Chinese input using hanyu pinyin is still the mst ppular methd f input amng Chinese PC users since the radical apprach requires mre memrizatin. The main challenges f Chinese infrmatin prcessing (specifically pinyin t character cnversin) are the large number f Chinese characters and the existence f hmnym characters and wrds. The existence f hmnym characters leads t the situatin where there is n ne-ne mapping between a pinyin and a character. This implies that in the situatin where there exists mre than ne character mapping t a pinyin, the user will have t select the desired character frm a set f pssible characters. This situatin in turn, results in a lwer input speed and the impssibility f blind-typing. The main gal f Chinese infrmatin prcessing in the field f pinyin t character cnversin is t ptimize the cnversin prcess t carry ut the cnversin with an accuracy rate that is as high as pssible. In additin, the need fr the user t manually select the crrect character frm a set f pssible hmnym characters shuld be minimized in rder t imprve the input speed f the user. 1.1 Cntributins f My Research Findings Given the difficulties f Chinese infrmatin prcessing as mentined abve, the results f this research prject is able t alleviate sme f the difficulties f Chinese infrmatin prcessing and imprve n the current accuracy levels f cnversin. Particularly, my research findings will accmplish the fllwing: 1 Student 2 Supervisr 1

NUROP Chinese Pinyin T Chinese Character Cnversin (a) Shw that the incrpratin f learning capabilities results in a mre dynamic and cleverer system, with knwledge that are mre suitable t a particular cntext. (b) Demnstrate that the existence and usage f rules t help predict and determine the utput character will yield better cnversin accuracy. (c) A pinyin-character system that des nt need sphisticated statistical and artificial intelligence methds (fr instance neural netwrk) in rder t achieve higher cnversin accuracy. 2 Objectives f Prject My main bjective f this prject is t prpse a pinyin-character cnversin methd that is able t accmplish the fllwing: (a) Develp and apply rules rules are develped when the system is learning new knwledge and applied when the user is using the system. The usage f these rules is ne f the main factrs that determine which character is the best chice as utput fr a particular pinyin. In the event where rules are nt available, a simple statistical methd will be used. (b) Incrpratin f learning capabilities the system will be able t learn new knwledge as well as learn frm its mistakes. It is als able t adjust itself t adapt t different user usage pattern. This feature will greatly enhance the dynamism f Chinese input. (c) Imprved accuracy the frequency f prducing the wrng character will be reduced. In fact, this accuracy shuld imprve as the system is mre custmized t the usage pattern f the user. This imprvement in accuracy is achieved thrugh the rules mentined in pint (a). 3 Current Pinyin Input Systems Current Pinyin Input systems adpt mre sphisticated methds t imprve the accuracy f the cnversin, thrugh the fllwing methds: (a) Trigram language mdel and statistical mdel [Zheng and Lee, 2000] (b) Idimatic phrase matching, adjacency cnstraint rules and statistical methd [Lchvsky and Chung, 1994] (c) Neural Netwrk [Yuan, Kunst, and Brchardt, 1994] These systems are able t imprve the accuracy f the cnversin, hence, imprving the input speed f the user. Hwever, these systems lack a dynamic facet, where the system is always learning and evlving t imprve itself. 4 The Prpsed System The system that I am prpsing in this research prject is a rule-based system that has integrated learning capabilities as well as applicatin f simple statistics. Special features f the prpsed system: (a) Naïve Beginning The system starts ut as a naïve system with n built-in knwledge at all, except a pinyin character dictinary, which is a cllectin f characters and its crrespnding pinyins. A character dictinary and nt a wrd dictinary is used since cmbinatins f characters (resulting in wrds) can be learnt thrugh rule frmulatin. This allws the system t be custmized accrding t the envirnment cntext in which it is used right frm the beginning. Naivety f the system means that the pssibility f having wrng pre-laded knwledge is lwered. It als enhances the dynamism f the system, since the knwledge that the system has is nt restricted t the infrmatin that has been pre-laded. 2

NUROP Chinese Pinyin T Chinese Character Cnversin In fact, a naïve pinyin input system is analgus a yung child wh has n knwledge f the Chinese language. (b) Dynamic Learning f Knwledge Knwledge, in the frm f rules, is generated dynamically during several phases f learning. Knwledge f the system is acquired by prcessing text frm the primary schl syllabus f Singapre as well as newspaper articles frm lcal newspapers. The texts are prvided as input t the system, frm the easiest (Primary tw syllabus). Difficulty level increases frm primary tw syllabus t primary fur then t the newspaper articles. (c) Dynamic Generatin f rules The rules are a series f if-then clauses which are generated dynamically during the learning stages, with the system keeping nly the rule parameters as well as ther variables useful fr statistical purpses in the database. Using the analgy f a yung child, generatin f rules during the learning stage is similar t the learning prcess f the yung child, wh will frm relatinships between different characters and wrds. (d) Husekeep Rules Database The Rules Database must underg the husekeeping stage regularly t remve rubbish rules. This is t ensure that nly thse rules that can prduce cnsiderably accurate utputs are maintained in the database. In this way, the accuracy f the system can be maintained. (e) Penalty fr Wrng Output Penalties are given t the rule if the applicatin f the rule results in a wrng utput. This will help t lwer the chances f similar cases f wrng utput in the future. In ther wrds, the system is learning frm its mistakes. The higher the frequency, the higher the penalty given. This is because the high frequency f the rule shuld indicate a higher accuracy rate and if applicatin f the rule results in a wrng utput, it is nly right that a heavier punishment (penalty) be given t the rule. This penalty is analgus t a child remembering where he/she has made mistakes and try nt t make the same mistake again. 5 Testing Methdlgy and Results 5.1 Testing Objectives The bjectives f the testing phase are t accmplish the fllwing: Determine the accuracy rates and errr rates f my system Demnstrate that the accuracy f the system imprves, as the system increases its knwledge base in the frm f rules. Find ut the pssible future imprvements f my system frm the testers. 5.2 Testing Methdlgy I have designed tw phases f testing, each t be dne after ne stage f learning. This means that the first phase f testing is dne after the primary 2 crpus has been learnt and the secnd phase f testing is dne after the primary 4 crpus has been learnt. I have 10 testers, and each tester enters 20 pinyin inputs n tw different machines using several cnversin systems, namely, my prpsed system, and Micrsft Pinyin Cnversin System. Each tester will type in the input that they desire, and recrd dwn the utput that they see. The accuracy rate is calculated by using the fllwing frmula: Ttal number f errrs Ttal number f characters x 100% 3

NUROP Chinese Pinyin T Chinese Character Cnversin After the entire set f test cases have been tested and recrded, the tester will evaluate their typing speed, i.e., evaluate whether the system had enabled them t imprve their typing speed. They will d this evaluate n a scale f ne t ten, n a scale f increasing satisfactin abut the system. 5.3 Test Results (Accuracy) Overall, it can be seen that there is an imprvement in the errr rate as the system acquires mre knwledge. Cmparisn f Errr Rates Number f Errrs 8 6 4 2 0 1 4 7 1013161922252831 Errr Rate (Micrsft) Errr Rate (Pri2) Errr Rate (Pri4) Figure 1: Cmparisn f Errr Rates Frm Figure 1, it can be seen that the errr rate fr Micrsft is cnsiderably smaller. This can be understd as the Micrsft system is a mre established and stable system. As fr my prpsed system, accuracy rates after the first stage f learning is nt very satisfactry, as it can be seen frm Figure 1, the errr rates are quite large, implying lw accuracy. Hwever, as the system picks up mre knwledge frm the next rund f learning, the errr rates fall. As seen frm the figure, the errr rates have fallen cmpared t the Primary 2 errr rates. In sme cases, it can even perfrm better than the Micrsft system. Hence, based n the frmula given in the previus sectin, the errr rates fr the three systems, namely Micrsft, Prpsed system after Primary 2 Crpus and Prpsed system after Primary 4 Crpus are as fllws: 20 Micrsft = x 100% 79 System (Pri2) = x = 9 100% % = 35% System (Pri4) = 54 x 100% = 24% Given the abve errr rates, it implies that the accuracy rates are as belw: Micrsft = 91% System (Pri2) = 65% System (Pri4) = 76% Accuracy Rates Percentage 80 75 70 65 60 55 System (Pri2) System (Pri4) Series1 System Type Figure 2: Cmparisn f Accuracy Rates 6 Cnclusin 4

NUROP Chinese Pinyin T Chinese Character Cnversin As demnstrated by the test bjectives and the test results, it can be seen that the test bjectives have been accmplished. The test results shw an imprvement in verall accuracy rates, an imprvement f abut 11%. In fact, this imprvement is quite significant, since the systems tested are in the preliminary stages f learning. Als, there is the discvery f imprtant and crrect rules that have helped the system imprved 11%. References Bks: Crmen, Thmas H. et al. (2001). Intrductin t Algrithms. Secnd Editin, The Massachusetts Institute f Technlgy, United States f America, 2001 Mitchell, Tm (1997). Machine Learning. First Editin. McGraw-Hill Internatinal, United States, 1997. Cnference Papers: Chen, Zheng and Lee, Kai-Fu (2000). A New Statistical Apprach t Chinese Pinyin Input. ACL-2000. The 38th Annual Meeting f the Assciatin fr Cmputatinal Linguistics, (3-6 Octber 2000), Hng Kng. Lua, Kim Teng. (1998). Chinese Infrmatin Prcessing-Past, Present and Future. Prceedings f the 3rd Internatinal Wrkshp n Infrmatin Retrieval with Asian Languages. (IRAL'98, 1-2, 15-16 Octber, 1998). Brill, Eric. (1997). Unsupervised Learning f Disambiguatin Rules fr Part f Speech Tagging T appear in Natural Language Prcessing Using Very Large Crpra. Kluwer Academic Press.1997. 5